In [1]:
import numpy as np
import pandas as pd
In [2]:
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
In [3]:
from sklearn.linear_model import LinearRegression

Data 100 Spring 2020

Lecture 22

A. Adhikari

This notebook accompanies the slides for Lecture 21. You can think of it as a reading guide for the slides. It takes you through the ideas and results on the slides, in the context of one example.

Start by going over the Review of Linear Regression.

  • The first three slides in that section are taken from Lecture 15.
  • Slides 7-8 remind you of the sizes of the vectors and matrices involved. They also write the regression model as "signal + noise" though as yet there are no assumptions about the properties of the noise.

Now step away from the slides and examine the data we are about to analyze.

The Data

The Snowy Plover is a tiny bird that lives on the coast in parts of California and elsewhere. It is so small that it is vulnerable to many predators and to people and dogs that don't look where they are stepping when they go to the beach. It is considered endangered in many parts of the US.

The data are about the eggs and newly-hatched chicks of the Snowy Plover. Here's a parent bird and some eggs. plover and eggs

The data were collected at the Point Reyes National Seashore by a former student at Berkeley. The goal was to see how the size of an egg could be used to predict the weight of the resulting chick. The bigger the newly-hatched chick, the more likely it is to survive. plover and chick

Each row of the data frame below corresponds to one Snowy Plover egg and the resulting chick. Note how tiny the bird is:

  • Egg Length and Egg Breadth (widest diameter) are measured in millimeters
  • Egg Weight and Bird Weight are measured in grams; for comparison, a standard paper clip weighs about one gram
In [4]:
birds = pd.read_csv('snowy_plover.csv')
birds.head()
Out[4]:
Egg Length Egg Breadth Egg Weight Bird Weight
0 28.80 21.84 7.4 5.2
1 29.04 22.45 7.7 5.4
2 29.36 22.48 7.9 5.6
3 30.10 21.71 7.5 5.3
4 30.17 22.75 8.3 5.9
In [5]:
birds.shape
Out[5]:
(44, 4)

We are going to be regressing Bird Weight on the size of the eggs. The scatter plot below show the relation between Bird Weight and Egg Weight. You can see that it is linear but the vertical spread is not the same throughout.

In [6]:
fig1 = go.Figure()
data_scatter = go.Scatter(x=birds["Egg Weight"], y=birds["Bird Weight"], 
                            mode="markers",
                            marker=dict(size=8))
fig1.add_trace(data_scatter)
fig1.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=400,
                 xaxis_title="Egg Weight",
                 yaxis_title="Bird Weight")
fig1

Here is a scatter plot of Bird Weight versus the two length variables Egg Length and Egg Breadth. You can turn the plot around to get a sense of the shape.

In [7]:
fig2 = go.Figure()
data_scatter = go.Scatter3d(x=birds["Egg Length"], y=birds["Egg Breadth"], z=birds["Bird Weight"], 
                            mode="markers",
                            marker=dict(size=4))
fig2.add_trace(data_scatter)
fig2.update_layout(dict(margin=dict(l=0, r=0, t=0, b=0), 
                  height=500,
                 scene=dict(
                 xaxis_title="Egg Length",
                 yaxis_title="Egg Breadth",
                 zaxis_title="Bird Weight")
                      ))
fig2

Regression on All Three Covariates

It seems reasonable to fit a linear model to these data. Let's start by regressing Bird Weight on all three measurements on the eggs. That means $d=3$.

We will first use SKLearn just as you have done before.

Note that we are using a model with an intercept. That is, every element of the first column of the design matrix $\mathbb{X}$ is 1.

The other columns of $\mathbb{X}$ are Egg Length, Egg Breadth, and Egg Weight.

The observed responses $\mathbb{Y}$ are the column Bird Weight.

In [8]:
model = LinearRegression(fit_intercept=True)
In [9]:
model.fit(birds[["Egg Length", "Egg Breadth", "Egg Weight"]], 
          birds[["Bird Weight"]])
Out[9]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Estimated Parameters (Slide 8)

Here are the intercept and the slopes (also known as weights or coefficients) of Egg Length, Egg Breadth, and Egg Weight.

In terms of our model, these are the values in the vector $\hat{\theta}$.

  • The units of the intercept are grams
  • The units of the coefficients of Egg Length and Egg Breadth are grams per millimeter
  • The units of the coefficient of Egg Weight are grams per gram
In [10]:
model.intercept_, model.coef_
Out[10]:
(array([-4.60567046]), array([[0.06657019, 0.215914  , 0.43122888]]))

Fitted Values (Slide 8)

We can now find our estimates of Bird Weight. These are called fitted values and can be used to predict the weight of the next little chick based on measurements on the egg.

In terms of our model, the fitted values are $\hat{\mathbb{Y}}$ which is calculated as $\mathbb{X}\hat{\theta}$.

In [11]:
all_covariates = birds
all_covariates["Fitted"] = model.predict(
    birds[["Egg Length", "Egg Breadth", "Egg Weight"]]
    )
In [12]:
all_covariates.head()
Out[12]:
Egg Length Egg Breadth Egg Weight Bird Weight Fitted
0 28.80 21.84 7.4 5.2 5.218207
1 29.04 22.45 7.7 5.4 5.495260
2 29.36 22.48 7.9 5.6 5.609285
3 30.10 21.71 7.5 5.3 5.319802
4 30.17 22.75 8.3 5.9 5.893995

Residuals (Slide 10-12)

As soon as you make an estimate, you have to consider the error in it. The residuals below are defined by $e = \mathbb{Y} - \hat{\mathbb{Y}}$.

In [13]:
all_covariates["Residuals"] = all_covariates["Bird Weight"] - all_covariates["Fitted"]
In [14]:
all_covariates.head()
Out[14]:
Egg Length Egg Breadth Egg Weight Bird Weight Fitted Residuals
0 28.80 21.84 7.4 5.2 5.218207 -0.018207
1 29.04 22.45 7.7 5.4 5.495260 -0.095260
2 29.36 22.48 7.9 5.6 5.609285 -0.009285
3 30.10 21.71 7.5 5.3 5.319802 -0.019802
4 30.17 22.75 8.3 5.9 5.893995 0.006005
In [15]:
# The sum of the residuals is 0

all_covariates["Residuals"].sum()
Out[15]:
-6.217248937900877e-14

Observed Responses and Fitted Values (Slide 13)

These two ($\mathbb{Y}$ and $\mathbb{\hat{Y}}$) have the same average:

In [16]:
all_covariates["Bird Weight"].mean()
Out[16]:
6.145454545454546
In [17]:
all_covariates["Fitted"].mean()
Out[17]:
6.145454545454546

Residual Plot (Slides 13-15)

The plot shows the residuals versus the fitted values. Notice:

  • The plot has no trend; the residuals and fitted values are uncorrelated.
  • The residuals average out to 0, so the least-squares line through this plot is the horizontal line at height 0.
  • There's no pattern in the residuals but the vertical spread varies.
In [18]:
fig3 = go.Figure()
data_scatter = go.Scatter(x=birds["Fitted"], y=birds["Residuals"], 
                            mode="markers",
                            marker=dict(color='red', size=8))
fig3.add_trace(data_scatter)
fig3.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=400,
                 xaxis_title="Fitted Values",
                 yaxis_title="Residuals")
fig3

Multiple $R^2$ (Slides 17-19)

Here are the observed responses $\mathbb{Y}$ plotted against the fitted values $\hat{\mathbb{Y}}$.

  • This slope has to be positive: the larger the observed response, the larger its estimate will tend to be.
In [19]:
fig4 = go.Figure()
data_scatter = go.Scatter(x=birds["Fitted"], y=birds["Bird Weight"], 
                            mode="markers",
                            marker=dict(size=8))
fig4.add_trace(data_scatter)
fig4.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=400,
                 xaxis_title="Fitted Values",
                 yaxis_title="Bird Weight")
fig4

Here is a correlation matrix that shows the correlations between all the pairs you can make from among $\mathbb{Y}$ (Bird Weight), the fitted values, and the residuals.

Note:

  • The entry in Row $i$ and Column $j$ is the correlation between Variable $i$ and Variable $j$.
  • The matrix is symmetric because the correlation between Variable $i$ and Variable $j$ is the same as the correlation between Variable $j$ and Variable $i$.
  • Each element on the diagonal is the correlation between a variable and itself, and hence is 1.
  • The correlation between the residuals and the fitted values is 0.
In [20]:
birds[["Bird Weight", "Fitted", "Residuals"]].corr()
Out[20]:
Bird Weight Fitted Residuals
Bird Weight 1.000000 8.508845e-01 5.253529e-01
Fitted 0.850884 1.000000e+00 -1.070668e-16
Residuals 0.525353 -1.070668e-16 1.000000e+00

The correlation between the fitted values and the observed Bird Weight is pretty high: more than 0.85.

The multiple $R^2$ is the square of this correlation. It is one way of measuring how close the fitted values are to the observed responses.

In [21]:
# Multiple R^2

0.850884**2
Out[21]:
0.724003581456

Towards Inference

Everything we have done thus far has been descriptive. We have not considered generalizing beyond the data that we have.

Also, everything that we have done thus far makes no assumptions about the shape of the scatter plots of the data.

To generalize – that is, to make predictions based on new eggs – we do have to make some assumptions about randomness.

Inference (Slides 24-27)

Slides 25 and 26 state the model, which should feel familiar from Lecture 21.

For a particular egg, $x$ is the vector of length, breadth, and weight. The model is

$$ f_\theta(x) ~ = ~ \theta_0 + \theta_1\text{egg_length} + \theta_2\text{egg_breadth} + \theta_3\text{egg_weight} + \epsilon $$
  • For each $i$, the parameter $\theta_i$ is a fixed number but it is unobservable. We can only estimate it.
  • The random error $\epsilon$ is also unobservable, but it is assumed to have expectation 0 and be independent and identically distributed across eggs.

To carry out the calculations needed for constructing confidence intervals etc, I have used the module statsmodels. It's a pretty standard way of doing regression using Python. See the textbook for example; scroll down till you see statsmodels.api being imported.

  • Notice the use of add_constant to specify that we want an intercept.
  • Notice also that there's a ton of output, much of which is incomprehensible unless you have taken the relevant Stat classes.

Many regression programs produce such output. Data scientists learn to use what they understand and pretend the rest isn't there. The more you study regression, the more of the entries you will be able to understand. You can go a long way just with the pieces that you understand from Data 8 and 100.

In [22]:
import statsmodels.api as sm
In [23]:
y = birds["Bird Weight"]
In [24]:
X_all = sm.add_constant(
    (birds[["Egg Length", "Egg Breadth", "Egg Weight"]]).values)

results_all = sm.OLS(y, X_all).fit()
results_all.summary()
Out[24]:
OLS Regression Results
Dep. Variable: Bird Weight R-squared: 0.724
Model: OLS Adj. R-squared: 0.703
Method: Least Squares F-statistic: 34.98
Date: Sat, 11 Apr 2020 Prob (F-statistic): 2.90e-11
Time: 13:57:15 Log-Likelihood: 5.5617
No. Observations: 44 AIC: -3.123
Df Residuals: 40 BIC: 4.013
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -4.6057 4.843 -0.951 0.347 -14.394 5.183
x1 0.0666 0.083 0.803 0.426 -0.101 0.234
x2 0.2159 0.229 0.944 0.351 -0.246 0.678
x3 0.4312 0.317 1.360 0.181 -0.209 1.072
Omnibus: 3.670 Durbin-Watson: 1.716
Prob(Omnibus): 0.160 Jarque-Bera (JB): 3.399
Skew: -0.029 Prob(JB): 0.183
Kurtosis: 4.360 Cond. No. 5.74e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.74e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We are going to ignore the majority of the output. We will just focus on $R^2$ and the coefficients.

Note that the value of R-squared at the top of the table is the same as what we got earlier.

Skip to the middle section where you see the labels coef, std err, etc.

  • const, x1, x2, and x3 are respectively Intercept, Egg Length, Egg Breadth, and Egg Weight.
  • coef is the observed value of $\hat{\theta}$, the best coefficients based on our data. You should go back and check that this is exactly what we got in the section called Estimated Parameters earlier in this notebook.
  • std err is short for standard error and is an approximation to the SD of the coefficient; it has been estimated from the data. The coefficient has an SD because it is a random variable. We calculated its value based on this sample, but it could have come out differently for a different sample.

Confidence Intervals for the True Parameters

Skip the next two columns and look at the two at the end. Those form approximate 95% confidence intervals for the true coefficients. For example,

  • $(-14.934, 5.183)$ is an approximate 95% confidence interval for $\theta_0$
  • $(-0.101, 0.234)$ is an approximate 95% confidence interval for $\theta_1$

And so on.

You should check that the endpoints of the intervals aren't very different from what you would have got by going two standard errors on either side of the estimated value.

In [25]:
# Normal-curve based rough 95% confidence interval 
# for the true intercept theta_0

-4.6057 + 4.83*np.array([-2, 2])
Out[25]:
array([-14.2657,   5.0543])

Testing if a Slope is 0 (Slides 27-28)

Before you read this, you might want to skim the textbook, especially if you didn't take Data 8. But if you understand the hypotheses and the reasons for them, then keep reading here.

Look at the columns labeled t and P(>|t|). The first is a test statistic and the second is a $p$-value.

  • For each true slope $\theta_i$, we are testing the null hypothesis $\theta_i = 0$ versus the alternative hypothesis $\theta_i \neq 0$.
  • The test statistic is the estimated coefficient converted to standard units under the null hypothesis:
$$ t ~ = ~ \frac{\text{estimate} - 0}{\text{std err}} $$

Small $p$-values lead us to rejecting the null hypothesis. Equivalently, you can just look at the confidence interval. If it doesn't contain 0, reject the null (at the 5% level). If the interval does contain 0 (that is, if the two ends are of different signs), then the data are consistent with the null hypothesis that the true coefficient is 0.

Collinearity (Slides 21-22)

This brings us to a mystery.

  • The regression is pretty good: $R^2$ is high, the residual plot looks OK.
  • But none of the coefficients is significantly different from 0. Each confidence interval contains 0.

What we are realizing that while the model fits well, none of the individual coefficients has meaning. We have a good overall fit but no detailed interpretation.

Now read Slides 21-22. The problem is collinearity. The covariates are related to each other, not just to the response. As a result, we can't interpret the individual coefficients.

Here is the correlation matrix of the covariates and the response.

In [26]:
birds[["Egg Length", "Egg Breadth", "Egg Weight", "Bird Weight"]].corr()
Out[26]:
Egg Length Egg Breadth Egg Weight Bird Weight
Egg Length 1.000000 0.402764 0.792449 0.676142
Egg Breadth 0.402764 1.000000 0.839077 0.733687
Egg Weight 0.792449 0.839077 1.000000 0.847228
Bird Weight 0.676142 0.733687 0.847228 1.000000
  • First look at the Bird Weight column. All the covariates are correlated with the response. That's good news for regressing Bird Weight on the other variables.
  • But now look at the the correlations between the covariates. Those are high too. For example, Egg Breadth and Egg Weight have a high correlation.

This means you can't increase one covariate while keeping the others constant. The individual slopes have no meaning.

Variable Selection

  • Egg Weight is highly correlated with the other two covariates. We can try regressing on Egg Weight alone.
  • Egg Breadth and Egg Length have the smallest correlation among all pairs of the covariates. We can try regressing on just those two.

Here are the two regressions. Notice that $R^2$ hardly drops at all compared to the regression with all three covariates. But in each regression below, the slopes are significantly different from 0. Separating egg weight from the two length dimensions of the egg shows that the variables matter, but that egg length and breadth have almost the same explanatory power as egg weight by itself.

The regression on egg weight alone looks like the one to use. It's no surprise that if you want to predict the weight of the newly-hatched chick, using the weight of the egg is your best move.

Regression on Egg Weight Alone

In [27]:
X_ew = sm.add_constant(
    (birds[["Egg Weight"]]).values)

results_ew = sm.OLS(y, X_ew).fit()
results_ew.summary()
Out[27]:
OLS Regression Results
Dep. Variable: Bird Weight R-squared: 0.718
Model: OLS Adj. R-squared: 0.711
Method: Least Squares F-statistic: 106.8
Date: Sat, 11 Apr 2020 Prob (F-statistic): 4.15e-13
Time: 13:57:15 Log-Likelihood: 5.0722
No. Observations: 44 AIC: -6.144
Df Residuals: 42 BIC: -2.576
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.0583 0.601 -0.097 0.923 -1.271 1.155
x1 0.7185 0.070 10.336 0.000 0.578 0.859
Omnibus: 2.737 Durbin-Watson: 1.797
Prob(Omnibus): 0.254 Jarque-Bera (JB): 1.986
Skew: -0.033 Prob(JB): 0.370
Kurtosis: 4.039 Cond. No. 158.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Regression on Egg Length and Egg Breadth

In [28]:
X_el_eb = sm.add_constant(
    (birds[["Egg Length", "Egg Breadth"]]).values)

results_el_eb = sm.OLS(y, X_el_eb).fit()
results_el_eb.summary()
Out[28]:
OLS Regression Results
Dep. Variable: Bird Weight R-squared: 0.711
Model: OLS Adj. R-squared: 0.697
Method: Least Squares F-statistic: 50.49
Date: Sat, 11 Apr 2020 Prob (F-statistic): 8.73e-12
Time: 13:57:15 Log-Likelihood: 4.5668
No. Observations: 44 AIC: -3.134
Df Residuals: 41 BIC: 2.219
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -10.7386 1.788 -6.007 0.000 -14.349 -7.128
x1 0.1695 0.034 4.955 0.000 0.100 0.239
x2 0.5057 0.084 6.006 0.000 0.336 0.676
Omnibus: 4.685 Durbin-Watson: 1.648
Prob(Omnibus): 0.096 Jarque-Bera (JB): 5.379
Skew: 0.016 Prob(JB): 0.0679
Kurtosis: 4.713 Cond. No. 2.04e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.04e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Check Your Assumptions!

The last slide (29) is the most important, because it's easy to get carried away by computation and forget that the calculations are based on some underlying assumptions about the data.