Lecture 16, Part 1 – Data 100, Fall 2020

by Joseph Gonzalez (Spring 2020)

Note: scikit-learn's Pipeline functionality is explored at length in this notebook. IT IS NOT IN SCOPE FOR FALL 2020. Instead, focus on the bigger picture, of how we are splitting our data into train and test, and how we are using cross validation.

Train Test Split and Cross Validation

In this notebook we will work through the train test-split and the process of cross validation.

Imports

As with other notebooks we will use the same set of standard imports.

In [1]:
import numpy as np
import pandas as pd
In [2]:
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
In [3]:
from sklearn.linear_model import LinearRegression

The Data

For this notebook, we will use the seaborn mpg dataset which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars. Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

In [4]:
from seaborn import load_dataset
data = load_dataset("mpg")
data
Out[4]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82 usa ford mustang gl
394 44.0 4 97.0 52.0 2130 24.6 82 europe vw pickup
395 32.0 4 135.0 84.0 2295 11.6 82 usa dodge rampage
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
397 31.0 4 119.0 82.0 2720 19.4 82 usa chevy s-10

398 rows × 9 columns

Train Test Split

The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running the data cleaning and EDA process on the test dataset.

Using Pandas Operations

We can sample the entire dataset to get a permutation and then select a range of rows.

In [5]:
shuffled_data = data.sample(frac=1., random_state=42)
shuffled_data
Out[5]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
198 33.0 4 91.0 53.0 1795 17.4 76 japan honda civic
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
33 19.0 6 232.0 100.0 2634 13.0 71 usa amc gremlin
208 13.0 8 318.0 150.0 3940 13.2 76 usa plymouth volare premier v8
93 14.0 8 318.0 150.0 4237 14.5 73 usa plymouth fury gran sedan
... ... ... ... ... ... ... ... ... ...
71 19.0 3 70.0 97.0 2330 13.5 72 japan mazda rx2 coupe
106 12.0 8 350.0 180.0 4499 12.5 73 usa oldsmobile vista cruiser
270 21.1 4 134.0 95.0 2515 14.8 78 japan toyota celica gt liftback
348 37.7 4 89.0 62.0 2050 17.3 81 japan toyota tercel
102 26.0 4 97.0 46.0 1950 21.0 73 europe volkswagen super beetle

398 rows × 9 columns

Selecting a range of rows for training and test

In [6]:
split_point = int(shuffled_data.shape[0]*0.90)
split_point
Out[6]:
358
In [7]:
tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]
In [8]:
tr
Out[8]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
198 33.0 4 91.0 53.0 1795 17.4 76 japan honda civic
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
33 19.0 6 232.0 100.0 2634 13.0 71 usa amc gremlin
208 13.0 8 318.0 150.0 3940 13.2 76 usa plymouth volare premier v8
93 14.0 8 318.0 150.0 4237 14.5 73 usa plymouth fury gran sedan
... ... ... ... ... ... ... ... ... ...
391 36.0 4 135.0 84.0 2370 13.0 82 usa dodge charger 2.2
134 16.0 6 258.0 110.0 3632 18.0 74 usa amc matador
306 28.8 6 173.0 115.0 2595 11.3 79 usa chevrolet citation
381 36.0 4 107.0 75.0 2205 14.5 82 japan honda accord
319 31.3 4 120.0 75.0 2542 17.5 80 japan mazda 626

358 rows × 9 columns

In [9]:
te
Out[9]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
243 21.5 3 80.0 110.0 2720 13.5 77 japan mazda rx-4
54 35.0 4 72.0 69.0 1613 18.0 71 japan datsun 1200
50 28.0 4 116.0 90.0 2123 14.0 71 europe opel 1900
174 18.0 6 171.0 97.0 2984 14.5 75 usa ford pinto
189 15.5 8 304.0 120.0 3962 13.9 76 usa amc matador
395 32.0 4 135.0 84.0 2295 11.6 82 usa dodge rampage
187 17.5 8 305.0 140.0 4215 13.0 76 usa chevrolet chevelle malibu classic
169 20.0 6 232.0 100.0 2914 16.0 75 usa amc gremlin
58 25.0 4 97.5 80.0 2126 17.0 72 usa dodge colt hardtop
48 18.0 6 250.0 88.0 3139 14.5 71 usa ford mustang
344 39.0 4 86.0 64.0 1875 16.4 81 usa plymouth champ
235 26.0 4 97.0 75.0 2265 18.2 77 japan toyota corolla liftback
252 19.2 6 231.0 105.0 3535 19.2 78 usa pontiac phoenix lj
21 24.0 4 107.0 90.0 2430 14.5 70 europe audi 100 ls
313 28.0 4 151.0 90.0 2678 16.5 80 usa chevrolet citation
160 17.0 6 231.0 110.0 3907 21.0 75 usa buick century
276 21.6 4 121.0 115.0 2795 15.7 78 europe saab 99gle
191 22.0 6 225.0 100.0 3233 15.4 76 usa plymouth valiant
293 31.9 4 89.0 71.0 1925 14.0 79 europe vw rabbit custom
343 39.1 4 79.0 58.0 1755 16.9 81 japan toyota starlet
257 19.4 6 232.0 90.0 3210 17.2 78 usa amc concord
308 33.5 4 151.0 90.0 2556 13.2 79 usa pontiac phoenix
149 24.0 4 120.0 97.0 2489 15.0 74 japan honda civic
130 26.0 4 122.0 80.0 2451 16.5 74 usa ford pinto
151 31.0 4 79.0 67.0 2000 16.0 74 europe fiat x1.9
359 28.1 4 141.0 80.0 3230 20.4 81 europe peugeot 505s turbo diesel
99 18.0 6 232.0 100.0 2945 16.0 73 usa amc hornet
372 27.0 4 151.0 90.0 2735 18.0 82 usa pontiac phoenix
87 13.0 8 350.0 145.0 3988 13.0 73 usa chevrolet malibu
330 40.9 4 85.0 NaN 1835 17.3 80 europe renault lecar deluxe
214 13.0 8 302.0 130.0 3870 15.0 76 usa ford f108
121 15.0 8 318.0 150.0 3399 11.0 73 usa dodge dart custom
397 31.0 4 119.0 82.0 2720 19.4 82 usa chevy s-10
20 25.0 4 110.0 87.0 2672 17.5 70 europe peugeot 504
188 16.0 8 318.0 150.0 4190 13.0 76 usa dodge coronet brougham
71 19.0 3 70.0 97.0 2330 13.5 72 japan mazda rx2 coupe
106 12.0 8 350.0 180.0 4499 12.5 73 usa oldsmobile vista cruiser
270 21.1 4 134.0 95.0 2515 14.8 78 japan toyota celica gt liftback
348 37.7 4 89.0 62.0 2050 17.3 81 japan toyota tercel
102 26.0 4 97.0 46.0 1950 21.0 73 europe volkswagen super beetle

Checking that they add up.

In [10]:
len(tr) + len(te) == len(data)
Out[10]:
True

Using SKLearn

We can use the train_test_split function from sklearn.model_selection to do this easily.

In [11]:
from sklearn.model_selection import train_test_split
In [12]:
tr, te = train_test_split(data, test_size=0.1, random_state=83)
In [13]:
tr
Out[13]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
143 26.0 4 97.0 78.0 2300 14.5 74 europe opel manta
391 36.0 4 135.0 84.0 2370 13.0 82 usa dodge charger 2.2
182 28.0 4 107.0 86.0 2464 15.5 76 europe fiat 131
63 14.0 8 400.0 175.0 4385 12.0 72 usa pontiac catalina
52 30.0 4 88.0 76.0 2065 14.5 71 europe fiat 124b
... ... ... ... ... ... ... ... ... ...
394 44.0 4 97.0 52.0 2130 24.6 82 europe vw pickup
256 20.5 6 225.0 100.0 3430 17.2 78 usa plymouth volare
295 35.7 4 98.0 80.0 1915 14.4 79 usa dodge colt hatchback custom
23 26.0 4 121.0 113.0 2234 12.5 70 europe bmw 2002
82 23.0 4 120.0 97.0 2506 14.5 72 japan toyouta corona mark ii (sw)

358 rows × 9 columns

In [14]:
te
Out[14]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
232 16.0 8 351.0 149.0 4335 14.5 77 usa ford thunderbird
365 20.2 6 200.0 88.0 3060 17.1 81 usa ford granada gl
225 17.5 6 250.0 110.0 3520 16.4 77 usa chevrolet concours
277 16.2 6 163.0 133.0 3410 15.8 78 europe peugeot 604sl
115 15.0 8 350.0 145.0 4082 13.0 73 usa chevrolet monte carlo s
314 26.4 4 140.0 88.0 2870 18.1 80 usa ford fairmont
219 25.5 4 122.0 96.0 2300 15.5 77 usa plymouth arrow gs
33 19.0 6 232.0 100.0 2634 13.0 71 usa amc gremlin
312 37.2 4 86.0 65.0 2019 16.4 80 japan datsun 310
306 28.8 6 173.0 115.0 2595 11.3 79 usa chevrolet citation
280 21.5 6 231.0 115.0 3245 15.4 79 usa pontiac lemans v6
97 18.0 6 225.0 105.0 3121 16.5 73 usa plymouth valiant
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
100 18.0 6 250.0 88.0 3021 16.5 73 usa ford maverick
29 27.0 4 97.0 88.0 2130 14.5 71 japan datsun pl510
300 23.9 8 260.0 90.0 3420 22.2 79 usa oldsmobile cutlass salon brougham
281 19.8 6 200.0 85.0 2990 18.2 79 usa mercury zephyr 6
252 19.2 6 231.0 105.0 3535 19.2 78 usa pontiac phoenix lj
388 26.0 4 156.0 92.0 2585 14.5 82 usa chrysler lebaron medallion
376 37.0 4 91.0 68.0 2025 18.2 82 japan mazda glc custom l
278 31.5 4 89.0 71.0 1990 14.9 78 europe volkswagen scirocco
101 23.0 6 198.0 95.0 2904 16.0 73 usa plymouth duster
76 18.0 4 121.0 112.0 2933 14.5 72 europe volvo 145e (sw)
343 39.1 4 79.0 58.0 1755 16.9 81 japan toyota starlet
362 24.2 6 146.0 120.0 2930 13.8 81 japan datsun 810 maxima
327 36.4 5 121.0 67.0 2950 19.9 80 europe audi 5000s (diesel)
170 23.0 4 140.0 78.0 2592 18.5 75 usa pontiac astro
50 28.0 4 116.0 90.0 2123 14.0 71 europe opel 1900
246 32.8 4 78.0 52.0 1985 19.4 78 japan mazda glc deluxe
32 25.0 4 98.0 NaN 2046 19.0 71 usa ford pinto
89 15.0 8 318.0 150.0 3777 12.5 73 usa dodge coronet custom
230 15.5 8 350.0 170.0 4165 11.4 77 usa chevrolet monte carlo landau
313 28.0 4 151.0 90.0 2678 16.5 80 usa chevrolet citation
86 14.0 8 304.0 150.0 3672 11.5 73 usa amc matador
46 22.0 4 140.0 72.0 2408 19.0 71 usa chevrolet vega (sw)
364 26.6 8 350.0 105.0 3725 19.0 81 usa oldsmobile cutlass ls
272 23.8 4 151.0 85.0 2855 17.6 78 usa oldsmobile starfire sx
171 24.0 4 134.0 96.0 2702 13.5 75 japan toyota corona
71 19.0 3 70.0 97.0 2330 13.5 72 japan mazda rx2 coupe
231 15.5 8 400.0 190.0 4325 12.2 77 usa chrysler cordoba

Quick Visualization

Out of curiosity what does the mgp field look like? I am going to look at both the train and test distributions but in practice we should avoid looking at the test data.

In [15]:
ff.create_distplot([tr['mpg'], te['mpg']], ['train mpg', 'test mpg'])





Building A Basic Model

Let's go through the process of building a model. Let's start by looking at just engine characteristics like "cylinders" and the "displacement". We will first use just our own feature function (as we did in previous lectures). Then we will introduce how to use sklearn Pipelines to combine feature functions and models. As we will see, by combining the feature function and model, we can simplify subsequent training and testing since we are guaranteed that our feature functions are the same on both the training and test datasets.

My first feature function will just extract the two features that I want to use in my model.

In [16]:
def phi(df):
    return df[["cylinders", "displacement"]]

Then I fit an sklearn LinearRegression model to my training data.

In [17]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
In [18]:
model.fit(phi(tr), tr['mpg'])
Out[18]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

To evaluate the error we will use the Root Mean Squared Error (RMSE) which is like the mean squared error but in the correct units (mpg) instead of (mpg^2).

In [19]:
def rmse(y, yhat):
    return np.sqrt(np.mean((y - yhat)**2))

The training error is:

In [20]:
Y_hat = model.predict(phi(tr))
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
Training Error (RMSE): 4.589009902639974

Don't try this at home!

The test error is:

In [21]:
Y_hat = model.predict(phi(te))
Y = te['mpg']
print("Test Error (RMSE):", rmse(Y, Y_hat))
Test Error (RMSE): 5.034009303909175

Oh no! We just used the test data to evaluate our model! We shouldn't have done that.

(Don't worry, we are trained professionals and this is only for demonstration purposes. But seriously, don't try this at home.)

Notice: The test error is slightly higher than the training error. This is typically (but not always) the case. Sometimes we get lucky and the test data is "easier to predict" or happens to closely follow the training data.





SKLearn Pipelines

Again, for Fall 2020, you do not need to know how to use Pipelines in scikit-learn. They are quite involved. Fortunately, they are merely an accessory to the concepts of this lecture, and not the core of it. If you treat each instance of a Pipeline as a black-box way of specifying which features our model should have, you will be able to understand the cross-validation content just fine.

We have removed much of the dialogue around Pipeline from this lecture, but if you're interested, you can skim the documentation on pipelines.

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

model = Pipeline([
    ("SelectColumns", ColumnTransformer([("keep", "passthrough", ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
])

model.fit(tr, tr['mpg']);

Keeping track of all the models.

In this notebook (and in life) we will want to keep track of all our models. Here I will store the models in a dictionary with a (not great) name so I can remember which model is which and can easily compare my models in a plot.

In [23]:
models = {"c+d": model}

More Feature Transformations

We might also want to look at the displacement per cylinder. This is an additional feature transformation that we can add to the first stage of our pipeline. To define this transformation we first need to create a function transformer:

In [24]:
from sklearn.preprocessing import FunctionTransformer

def compute_volume(X):
    return np.expand_dims(X[:,1] / X[:,0]  , axis=1)

volume_transformer = FunctionTransformer(compute_volume, validate=True)

We can then add this as an additional column transformation:

In [25]:
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", ["cylinders", "displacement"]),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
])
In [26]:
model.fit(tr, tr['mpg']);

Again, we evaluate the model on our training dataset and see a reduction in error:

In [27]:
Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
Training Error (RMSE): 4.318799994795116
In [28]:
models["c+d+d/c"] = model

Adding More Features

We can now add additional features about the car.

In [29]:
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration"]
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
])

I have put the following code in a try/except statement because I know it will raise an error. What do you think will go wrong?

In [30]:
try:
    model.fit(tr, tr['mpg'])
except ValueError as err:
    print(err)
Input contains NaN, infinity or a value too large for dtype('float64').

There appear to be NaN (missing values) in the data (take a look at the horsepower column). We need to deal with these missing values. In previous lectures I mentioned imputation and a standard imputation technique is to replace the missing value with the mean for that column. Scikit learn has a built-in imputation function that we can add to our pipeline after we select the desired columns. The imputation will actually be applied to all the columns. If we wanted to apply it to a specific column then we would need to put it inside the ColumnTransformer.

Notice: The imputation function actually needs to be fit to data so it is also part of the model.

In [31]:
from sklearn.impute import SimpleImputer
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])

We can now train our model.

In [32]:
model.fit(tr, tr['mpg']);

Saving the model for later comparison:

In [33]:
models['c+d+d/c+h+w+a'] = model

Evaluating the training error:

In [34]:
Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
Training Error (RMSE): 4.025034093127813

We reduced the training error but what about the test error? We really shouldn't look at the test error so instead we will use cross validation to compare the accuracy:





Cross Validation

In the following function we use the sklearn KFold cross validation class.

Here we define a five fold cross validation with

five_fold = KFold(n_splits=5)

Then we loop over the 5 splits and get the indicies (tr_ind) in the training data to use for training and the indices (va_ind) in the training data to use for validation:

for tr_ind, te_ind in five_fold.split(tr):
In [35]:
from sklearn.model_selection import KFold
from sklearn.base import clone

def cross_validate_rmse(model):
    model = clone(model)
    five_fold = KFold(n_splits=5)
    rmse_values = []
    for tr_ind, va_ind in five_fold.split(tr):
        model.fit(tr.iloc[tr_ind,:], tr['mpg'].iloc[tr_ind])
        rmse_values.append(rmse(tr['mpg'].iloc[va_ind], model.predict(tr.iloc[va_ind,:])))
    return np.mean(rmse_values)

Valiating the model

In [36]:
cross_validate_rmse(model)
Out[36]:
4.05180814386984

The following helper function generates a plot comparing all the models in the models dictionary.

In [37]:
def compare_models(models):
    # Compute the training error for each model
    training_rmse = [rmse(tr['mpg'], model.predict(tr)) for model in models.values()]
    # Compute the cross validation error for each model
    validation_rmse = [cross_validate_rmse(model) for model in models.values()]
    # Compute the test error for each model (don't do this!)
    test_rmse = [rmse(te['mpg'], model.predict(te)) for model in models.values()]
    names = list(models.keys())
    fig = go.Figure([
        go.Bar(x = names, y = training_rmse, name="Training RMSE"),
        go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
        go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
    return fig
In [38]:
fig = compare_models(models)
fig.update_yaxes(range=[2,5.1], title="RMSE")

Notice I made the Test RMSE invisible(ish) because you shouldn't look at it until we are done. But again for demonstration purposes I plotted in so we can see how it compares to the training and cross validation errors.

Can you improve the model further? Let's try adding the model year

In [39]:
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"])
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])
In [40]:
model.fit(tr, tr['mpg'])
models['c+d+d/c+h+w+a+y'] = model

Comparing the models

In [41]:
fig = compare_models(models)
fig.update_yaxes(range=[2,5.1], title="RMSE")

The model year improved accuracy quite a bit! This improvement also appears to generalize as it also reduced the cross validation error.

Going too Far?

Can we use the car's name to predict MPG? The name contains general features like the manufacturer that might help but it also contains the vehicle make which is probably too specific and not all the vehicles in test will appear in training.

Let's try by applying the CountVectorizer (which implements the bag of words features). At this point we are also likely to have too many dimensions in our model and we are not applying any regularization technique to compensate (because we haven't covered regularization in lecture yet.)

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", "passthrough", quantitative_features),
        ("cyl_vol", volume_transformer, ["cylinders", "displacement"]),
        ("text", CountVectorizer(), "name")
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])

Notice: That we are using an additional column transformation.

In [43]:
model.fit(tr, tr['mpg'])
models['c+d+d/c+h+w+a+y+n'] = model
In [44]:
fig = compare_models(models)
fig.update_yaxes(range=[0,5.1], title="RMSE")

Overfitting!. We substantially reduced the training error but actually made the generalization error worse!

In [45]:
best_model = clone(models['c+d+d/c+h+w+a+y'])
In [46]:
best_model.fit(data, data['mpg']);
In [47]:
rmse(best_model.predict(te), te['mpg'])
Out[47]:
2.8167549672284293