Lecture 16 – Cross-Validation and Regularization

by Joseph Gonzalez, Paul Shao (Spring 2020), and Suraj Rampure (Summer 2020)

Important: This lecture is a combination of several lectures from Spring 2020 (this is why the video titles don’t match our numbering), plus a piece of “glue” added in this summer. Read this before proceeding with the lectures, as it details what materials you should focus on.

Sections 16.1 through 16.4 discuss train-test splits and cross-validation.

  • 16.1 walks through why we need to split our data into train and test in the first place, and how cross-validation works. It primarily consists of slides.
  • 16.2 and 16.3 walk through the process of creating a basic train-test split, and evaluating models that we’ve fit on our training data using our testing data. Code is in “Part 1”.
  • 16.4 walks through the process of implementing cross-validation. In this video there references to a Pipeline object in scikit-learn. This is not in scope for Summer 2020, so do not worry about its details. Code is in “Part 1”.

Sections 16.5 and 16.6 discuss regularization.

  • 16.5 discusses why we need to regularize, and how penalties on the norm of our parameter vector accomplish this goal.
  • 16.6 explicitly lists the optimal model parameter when using the L2 penalty on our linear model (called “ridge regression”).

There are also three supplementary videos accompanying this lecture. They don’t introduce any new material, but may still be helpful for your understanding. They are listed as supplementary and not required since the runtime of this lecture is already quite long. They do not have accompanying Quick Checks for this reason.

  • 16.7 and 16.8 walk through implementing ridge and LASSO regression in a notebook. These videos are helpful in explaining how regularization and cross-validation are used in practice. These videos again use Pipeline, which is not in scope. Code is in “Part 2”.
  • 16.9 is another supplementary video, created by Paul Shao (a TA for Data 100 in Spring 2020). It gives a great high-level overview of both the bias-variance tradeoff and regularization.
Video Quick Check
16.1
Training error vs. testing error. Why we need to split our data into train and test. How cross-validation works, and why it is useful.
16.1
16.2
Using scikit-learn to construct a train-test split.
16.2
16.3
Building a linear model and determining its training and test error.
16.3
16.4
Implementing cross-validation, and using it to help select a model.
16.4
16.5
An overview of regularization.
14.5
16.6
Ridge regression and LASSO regression.
16.6
16.7
*Supplemental.* Using ridge regression and cross-validation in scikit-learn.
N/A
16.8
*Supplemental.* Using LASSO regression and cross-validation in scikit-learn.
N/A
16.9
*Supplemental.* An overview of the bias-variance tradeoff, and how it interfaces with regularization.
N/A