Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [ ]:
NAME = ""
COLLABORATORS = ""


Lab 6: Multiple Linear Regression and Feature Engineering¶

In this lab, we will work through the process of:

1. Implementing a linear model
2. Defining loss functions
3. Feature engineering
4. Minimizing loss functions using numeric libraries and analytical methods

This lab will continue using the toy tip calculation dataset used in Lab 5.

This assignment should be completed and submitted by Monday October 1, 2018 at 11:59pm

Collaboration Policy¶

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you write your solutions individually. If you do discuss the assignments with others, please include their names at the top of this notebook.

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline


To begin, let's load the tips dataset from the seaborn library. This dataset contains records of tips, total bill, and information about the person who paid the bill. This is the same dataset used in Lab 5, so it should look familiar!

In [ ]:
data = sns.load_dataset("tips")

print("Number of Records:", len(data))


Question 1: Defining the Model and Feature Engineering¶

In the previous lab, we defined a simple linear model with only one parameter. Now let's make a more complicated model that utilizes other features in our dataset. Let our prediction for tip be a combination of the following features:

$$\text{Tip} = \theta_1 \cdot \text{total_bill} + \theta_2 \cdot \text{sex} + \theta_3 \cdot \text{smoker} + \theta_4 \cdot \text{day} + \theta_5 \cdot \text{time} + \theta_6 \cdot \text{size}$$

Notice that some of these features are not numbers! But our linear model will need to predict a numerical value. Let's start by converting some of these non-numerical values into numerical values. Below we split the tips and the features.

In [ ]:
tips = data['tip']
X = data.drop(columns='tip')


Question 1a: Feature Engineering¶

First, let's convert our features to numerical values. A straightforward approach is to map some of these non-numerical features into numerical ones.

For example, we can treat the day as a value from 1-7. However, one of the disadvantages in directly translating to a numeric value is that we unintentionally assign certain features disproportionate weight. Consider assigning Sunday to the numeric value of 7, and Monday to the numeric value of 1. In our linear model, Sunday will have 7 times the influence of Monday, which can lower the accuracy of our model.

Instead, let's use one-hot encoding to better represent these features!

As discussed in lecture, one-hot encoding will produce a binary vector indicating the non-numeric feature. Sunday would be encoded as a [0 0 0 0 0 0 1]. This assigns a more even weight across each category in non-numeric features. Complete the code below to one-hot encode our dataset.

In [ ]:
def one_hot_encode(data):
"""
Return the one-hot encoded dataframe of our input, data.

Parameters
-----------
data: a dataframe that may include non-numerical features

Returns
-----------
A one-hot encoded dataframe that only contains numeric features

Hint: Check out the pd.get_dummies function
"""
raise NotImplementedError()


Run the cell below to see the transformed dataset named one_hot_X. This dataframe holds our "featurized" data, which is also often denoted by $\phi$.

In [ ]:
one_hot_X = one_hot_encode(X)

In [ ]:
assert one_hot_X.shape == (244, 12)
# Makes sure all columns are numeric
assert np.all([np.issubdtype(one_hot_X[column].dtype, np.number) for column in one_hot_X])


Question 1b: Defining the Model¶

Now that all of our data is numeric, we can begin to define our model function. Notice that after one-hot encoding our data, we now have 12 features instead of 6. Therefore, our linear model now looks like:

$$\text{Tip} = \theta_1 \cdot \text{size} + \theta_2 \cdot \text{total_bill} + \theta_3 \cdot \text{day_Thur} + \theta_4 \cdot \text{day_Fri} + ... + \theta_{11} \cdot \text{time_Lunch} + \theta_{12} \cdot \text{time_Dinner}$$

We can represent the linear combination above as a matrix-vector product. Implement the linear_model function to evaluate this product.

Hint: You can use np.dot, pd.DataFrame.dot, or the @ operator to multiply matrices/vectors. However, while the @ operator can be used to multiply numpy arrays, it generally will not work between two pandas objects, so keep that in mind when computing matrix-vector products!

In [ ]:
def linear_model(thetas, X):
"""
Return the linear combination of thetas and features as defined above.

Parameters
-----------
thetas: a 1D vector representing the parameters of our model ([theta1, theta2, ...])
X: a 2D dataframe of numeric features

Returns
-----------
A 1D vector representing the linear combination of thetas and features as defined above.
"""
raise NotImplementedError()

In [ ]:
assert linear_model(np.arange(1,5), np.arange(1,5)) == 30
assert (linear_model(2*np.eye(100), np.ones(100)) == 2*np.ones(100)).all()
test_theta = np.array([[1, 2], [3, 4], [5, 6]])
test_x = np.array([[1, 3, 5], [2, 4, 6]])
expected = np.array([[35, 44], [44, 56]])
actual = linear_model(test_theta, test_x)
assert np.array_equal(actual, expected)


Question 2: Fitting the Model using Numeric Methods¶

Recall in the previous lab we defined multiple loss functions and found optimal theta using the scipy.minimize function. Adapt the loss functions and optimization code from the previous lab (provided below) to work with our new linear model.

In [ ]:
from scipy.optimize import minimize

def l1(y, y_hat):
return np.abs(y - y_hat)

def l2(y, y_hat):
return (y - y_hat)**2

def minimize_average_loss(loss_function, model, X, y):
"""
Minimize the average loss calculated from using different theta vectors, and
estimate the optimal theta for the model.

Parameters
-----------
loss_function: either the squared or absolute loss functions defined above
model: the model (as defined in Question 1b)
X: a 2D dataframe of numeric features (one-hot encoded)
y: a 1D vector of tip amounts

Returns
-----------
The estimate for the optimal theta vector that minimizes our loss
"""

## Notes on the following function call which you need to finish:
#
# 0. The first '...' should be replaced with the average loss evaluated on
#       the data X, y using the model and appropriate loss function.
# 1. x0 are the initial values for THETA.  Yes, this is confusing
#       but optimization people like x to be the thing they are
#       optimizing. Replace the second '...' with an initial value for theta,
#       and remember that theta is now a vector. DO NOT hard-code the length of x0;
#       it should depend on the number of features in X.
raise NotImplementedError()
return minimize(lambda theta: ..., x0=...)['x']
# Notice above that we extract the 'x' entry in the dictionary returned by minimize.
# This entry corresponds to the optimal theta estimated by the function.

minimize_average_loss(l2, linear_model, one_hot_X, tips)

In [ ]:
expected_l1 = np.array([0.1021857, 0.08590592, 0.1181548, 0.28102618, 0.13510419, 0.2640767,
-0.0567563, 0.20579663, -0.03585098, 0.28599161, 0.17696859, 0.22221226])
expected_l2 = np.array([0.094487, 0.1759912, 0.18411085, 0.21655235, 0.15712761, 0.24353555,
0.01519972, 0.17746092, 0.05600946, 0.15199343, 0.23440129, 0.1662616])
actual_l1 = np.array(minimize_average_loss(l1, linear_model, one_hot_X, tips))
actual_l2 = np.array(minimize_average_loss(l2, linear_model, one_hot_X, tips))

# Check if your values for theta predict similar tips as our values for theta do
assert np.isclose(one_hot_X @ expected_l1, one_hot_X @ actual_l1, rtol=0.1).all()
assert np.isclose(one_hot_X @ expected_l2, one_hot_X @ actual_l2, rtol=0.001).all()


Question 3: Fitting the Model using Analytic Methods¶

Let's also fit our model analytically, for the l2 loss function. In this question we will derive an analytical solution, fit our model and compare our results with our numerical optimization results.

Recall that if we're fitting a linear model with the l2 loss function, we are performing least squares! Remember, we are solving the following optimization problem for least squares:

$$\min_{\theta} ||X\theta - y||^2$$

Let's begin by deriving the analytic solution to least squares. We begin by expanding out the l2 norm and multiplying out all of the terms.

Math Explanation
$$||X\theta - y||^2 = (X\theta - y)^T (X\theta - y)$$ Expand l2 norm using the definition for matrices.
$$= (\theta^T X^T - y^T) (X\theta - y)$$ Distribute the transpose operator. Remember that $(AB)^T = B^TA^T$.
$$= \theta^T X^T X \theta - \theta^T X^T y - y^T X \theta + y^T y$$ Multiply out all of the terms (FOIL).
$$= \theta^T X^T X \theta - 2\theta^T X^T y + y^T y$$ The two middle terms are both transposes of each other, and they are both scalars (since we have a 1xn row vector times an nxn matrix times an nx1 column vector). Since the transpose of a scalar is still the same scalar, we can combine the two middle terms.

Whew! Now that we have everything expanded out and somewhat simplified, let's take the gradient of the expression above and set it to the zero vector. This will allow us to solve for the optimal $\theta$ that will minimize our loss.

Math Explanation
$$\nabla_\theta (\theta^T X^T X \theta) - \nabla_\theta(2\theta^TX^T y) + \nabla_\theta(y^T y) = \vec{0}$$ Let's take derivatives one term at a time.
$$(X^T X + (X^T X)^T)\theta - \nabla_\theta(2\theta^TX^T y) + \nabla_\theta(y^T y) = \vec{0}$$ For the first term, we use the identity $\frac{\partial}{\partial x} x^T A x = (A + A^T)x$.
$$(X^T X + (X^T X)^T)\theta - 2X^T y + \nabla_\theta(y^T y) = \vec{0}$$ For the second term, we use the identity $\frac{\partial}{\partial x} x^T A = A$.
$$(X^T X + (X^T X)^T)\theta - 2X^T y + \vec{0} = \vec{0}$$ The last derivative is the easiest, since $y^T y$ does not depend on $\theta$.
$$2X^T X\theta = 2X^T y$$ Notice that $(X^T X)^T = X^T X$, so we can combine the $X^T X$ terms into $2X^TX$. We also move $2X^Ty$ to the right side of the equation.
$$\theta = (X^T X)^{-1} X^T y$$ Divide by 2 on both sides, then left-multiply by $(X^T X)^{-1}$ on both sides to solve for $\theta$.

Question 3a: Solving for Theta¶

Now that we have the analytic solution for $\theta$, let's find the optimal numerical thetas for our tips dataset. Fill out the function below.

Hints:

1. Use the np.linalg.inv function to compute matrix inverses
2. To compute the transpose of a matrix, you can use X.T or X.transpose()
In [ ]:
def get_analytical_sol(X, y):
"""
Computes the analytical solution to our least squares problem

Parameters
-----------
X: a 2D dataframe of numeric features (one-hot encoded)
y: a 1D vector of tip amounts

Returns
-----------
The estimate for theta computed using the equation mentioned above
"""
raise NotImplementedError()

In [ ]:
analytical_thetas = get_analytical_sol(one_hot_X, tips)
print("Our analytical loss is: ", l2(linear_model(analytical_thetas, one_hot_X),tips).mean())
print("Our numerical loss is: ", l2(linear_model(minimize_average_loss(l2, linear_model, one_hot_X, tips), one_hot_X), tips).mean())

In [ ]:
assert np.isclose(l2(linear_model(analytical_thetas, one_hot_X),tips).mean(), 328.61824697, rtol=1)


Question 3b: Weird Results?¶

Our analytical loss is surprisingly much worse than our numerical loss. Why is this?