Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE`

or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [ ]:

```
NAME = ""
COLLABORATORS = ""
```

In this lab, we will work through the process of:

- Implementing a linear model
- Defining loss functions
- Feature engineering
- Minimizing loss functions using numeric libraries and analytical methods

This lab will continue using the toy tip calculation dataset used in Lab 5.

**This assignment should be completed and submitted by Monday October 1, 2018 at 11:59pm**

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others, please **include their names** at the top of this notebook.

In [ ]:

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline
```

To begin, let's load the tips dataset from the `seaborn`

library. This dataset contains records of tips, total bill, and information about the person who paid the bill. This is the same dataset used in Lab 5, so it should look familiar!

In [ ]:

```
data = sns.load_dataset("tips")
print("Number of Records:", len(data))
data.head()
```

In the previous lab, we defined a simple linear model with only one parameter. Now let's make a more complicated model that utilizes other features in our dataset. Let our prediction for tip be a combination of the following features:

$$ \text{Tip} = \theta_1 \cdot \text{total_bill} + \theta_2 \cdot \text{sex} + \theta_3 \cdot \text{smoker} + \theta_4 \cdot \text{day} + \theta_5 \cdot \text{time} + \theta_6 \cdot \text{size} $$

Notice that some of these features are not numbers! But our linear model will need to predict a numerical value. Let's start by converting some of these non-numerical values into numerical values. Below we split the tips and the features.

In [ ]:

```
tips = data['tip']
X = data.drop(columns='tip')
```

First, let's convert our features to numerical values. A straightforward approach is to map some of these non-numerical features into numerical ones.

For example, we can treat the day as a value from 1-7. However, one of the disadvantages in directly translating to a numeric value is that we unintentionally assign certain features disproportionate weight. Consider assigning Sunday to the numeric value of 7, and Monday to the numeric value of 1. In our linear model, Sunday will have 7 times the influence of Monday, which can lower the accuracy of our model.

Instead, let's use one-hot encoding to better represent these features!

As discussed in lecture, one-hot encoding will produce a binary vector indicating the non-numeric feature. Sunday would be encoded as a [0 0 0 0 0 0 1]. This assigns a more even weight across each category in non-numeric features. Complete the code below to one-hot encode our dataset.

In [ ]:

```
def one_hot_encode(data):
"""
Return the one-hot encoded dataframe of our input, data.
Parameters
-----------
data: a dataframe that may include non-numerical features
Returns
-----------
A one-hot encoded dataframe that only contains numeric features
Hint: Check out the pd.get_dummies function
"""
# YOUR CODE HERE
raise NotImplementedError()
```

Run the cell below to see the transformed dataset named `one_hot_X`

. This dataframe holds our "featurized" data, which is also often denoted by $\phi$.

In [ ]:

```
one_hot_X = one_hot_encode(X)
one_hot_X.head()
```

In [ ]:

```
assert one_hot_X.shape == (244, 12)
# Makes sure all columns are numeric
assert np.all([np.issubdtype(one_hot_X[column].dtype, np.number) for column in one_hot_X])
```

Now that all of our data is numeric, we can begin to define our model function. Notice that after one-hot encoding our data, we now have 12 features instead of 6. Therefore, our linear model now looks like:

$$ \text{Tip} = \theta_1 \cdot \text{size} + \theta_2 \cdot \text{total_bill} + \theta_3 \cdot \text{day_Thur} + \theta_4 \cdot \text{day_Fri} + ... + \theta_{11} \cdot \text{time_Lunch} + \theta_{12} \cdot \text{time_Dinner} $$

We can represent the linear combination above as a matrix-vector product. Implement the `linear_model`

function to evaluate this product.

**Hint**: You can use np.dot, pd.DataFrame.dot, or the `@`

operator to multiply matrices/vectors. However, while the `@`

operator can be used to multiply `numpy`

arrays, it generally will not work between two `pandas`

objects, so keep that in mind when computing matrix-vector products!

In [ ]:

```
def linear_model(thetas, X):
"""
Return the linear combination of thetas and features as defined above.
Parameters
-----------
thetas: a 1D vector representing the parameters of our model ([theta1, theta2, ...])
X: a 2D dataframe of numeric features
Returns
-----------
A 1D vector representing the linear combination of thetas and features as defined above.
"""
# YOUR CODE HERE
raise NotImplementedError()
```

In [ ]:

```
assert linear_model(np.arange(1,5), np.arange(1,5)) == 30
assert (linear_model(2*np.eye(100), np.ones(100)) == 2*np.ones(100)).all()
test_theta = np.array([[1, 2], [3, 4], [5, 6]])
test_x = np.array([[1, 3, 5], [2, 4, 6]])
expected = np.array([[35, 44], [44, 56]])
actual = linear_model(test_theta, test_x)
assert np.array_equal(actual, expected)
```

Recall in the previous lab we defined multiple loss functions and found optimal theta using the scipy.minimize function. Adapt the loss functions and optimization code from the previous lab (provided below) to work with our new linear model.

In [ ]:

```
from scipy.optimize import minimize
def l1(y, y_hat):
return np.abs(y - y_hat)
def l2(y, y_hat):
return (y - y_hat)**2
def minimize_average_loss(loss_function, model, X, y):
"""
Minimize the average loss calculated from using different theta vectors, and
estimate the optimal theta for the model.
Parameters
-----------
loss_function: either the squared or absolute loss functions defined above
model: the model (as defined in Question 1b)
X: a 2D dataframe of numeric features (one-hot encoded)
y: a 1D vector of tip amounts
Returns
-----------
The estimate for the optimal theta vector that minimizes our loss
"""
## Notes on the following function call which you need to finish:
#
# 0. The first '...' should be replaced with the average loss evaluated on
# the data X, y using the model and appropriate loss function.
# 1. x0 are the initial values for THETA. Yes, this is confusing
# but optimization people like x to be the thing they are
# optimizing. Replace the second '...' with an initial value for theta,
# and remember that theta is now a vector. DO NOT hard-code the length of x0;
# it should depend on the number of features in X.
# 2. Your answer will be very similar to your answer to question 2 from lab 5.
# YOUR CODE HERE
raise NotImplementedError()
return minimize(lambda theta: ..., x0=...)['x']
# Notice above that we extract the 'x' entry in the dictionary returned by `minimize`.
# This entry corresponds to the optimal theta estimated by the function.
minimize_average_loss(l2, linear_model, one_hot_X, tips)
```

In [ ]:

```
expected_l1 = np.array([0.1021857, 0.08590592, 0.1181548, 0.28102618, 0.13510419, 0.2640767,
-0.0567563, 0.20579663, -0.03585098, 0.28599161, 0.17696859, 0.22221226])
expected_l2 = np.array([0.094487, 0.1759912, 0.18411085, 0.21655235, 0.15712761, 0.24353555,
0.01519972, 0.17746092, 0.05600946, 0.15199343, 0.23440129, 0.1662616])
actual_l1 = np.array(minimize_average_loss(l1, linear_model, one_hot_X, tips))
actual_l2 = np.array(minimize_average_loss(l2, linear_model, one_hot_X, tips))
# Check if your values for theta predict similar tips as our values for theta do
assert np.isclose(one_hot_X @ expected_l1, one_hot_X @ actual_l1, rtol=0.1).all()
assert np.isclose(one_hot_X @ expected_l2, one_hot_X @ actual_l2, rtol=0.001).all()
```

Let's also fit our model analytically, for the l2 loss function. In this question we will derive an analytical solution, fit our model and compare our results with our numerical optimization results.

Recall that if we're fitting a linear model with the l2 loss function, we are performing least squares! Remember, we are solving the following optimization problem for least squares:

$$\min_{\theta} ||X\theta - y||^2$$

Let's begin by deriving the analytic solution to least squares. We begin by expanding out the l2 norm and multiplying out all of the terms.

Math | Explanation |
---|---|

$$||X\theta - y||^2 = (X\theta - y)^T (X\theta - y)$$ | Expand l2 norm using the definition for matrices. |

$$ = (\theta^T X^T - y^T) (X\theta - y)$$ | Distribute the transpose operator. Remember that $(AB)^T = B^TA^T$. |

$$ = \theta^T X^T X \theta - \theta^T X^T y - y^T X \theta + y^T y$$ | Multiply out all of the terms (FOIL). |

$$ = \theta^T X^T X \theta - 2\theta^T X^T y + y^T y$$ | The two middle terms are both transposes of each other, and they are both scalars (since we have a 1xn row vector times an nxn matrix times an nx1 column vector). Since the transpose of a scalar is still the same scalar, we can combine the two middle terms. |

Whew! Now that we have everything expanded out and somewhat simplified, let's take the gradient of the expression above and set it to the zero vector. This will allow us to solve for the optimal $\theta$ that will minimize our loss.

Math | Explanation |
---|---|

$$\nabla_\theta (\theta^T X^T X \theta) - \nabla_\theta(2\theta^TX^T y) + \nabla_\theta(y^T y) = \vec{0}$$ | Let's take derivatives one term at a time. |

$$(X^T X + (X^T X)^T)\theta - \nabla_\theta(2\theta^TX^T y) + \nabla_\theta(y^T y) = \vec{0}$$ | For the first term, we use the identity $\frac{\partial}{\partial x} x^T A x = (A + A^T)x$. |

$$(X^T X + (X^T X)^T)\theta - 2X^T y + \nabla_\theta(y^T y) = \vec{0}$$ | For the second term, we use the identity $\frac{\partial}{\partial x} x^T A = A$. |

$$(X^T X + (X^T X)^T)\theta - 2X^T y + \vec{0} = \vec{0}$$ | The last derivative is the easiest, since $y^T y$ does not depend on $\theta$. |

$$2X^T X\theta = 2X^T y$$ | Notice that $(X^T X)^T = X^T X$, so we can combine the $X^T X$ terms into $2X^TX$. We also move $2X^Ty$ to the right side of the equation. |

$$\theta = (X^T X)^{-1} X^T y$$ | Divide by 2 on both sides, then left-multiply by $(X^T X)^{-1}$ on both sides to solve for $\theta$. |

Now that we have the analytic solution for $\theta$, let's find the optimal numerical thetas for our tips dataset. Fill out the function below.

Hints:

- Use the
`np.linalg.inv`

function to compute matrix inverses - To compute the transpose of a matrix, you can use
`X.T`

or`X.transpose()`

In [ ]:

```
def get_analytical_sol(X, y):
"""
Computes the analytical solution to our least squares problem
Parameters
-----------
X: a 2D dataframe of numeric features (one-hot encoded)
y: a 1D vector of tip amounts
Returns
-----------
The estimate for theta computed using the equation mentioned above
"""
# YOUR CODE HERE
raise NotImplementedError()
```

In [ ]:

```
analytical_thetas = get_analytical_sol(one_hot_X, tips)
print("Our analytical loss is: ", l2(linear_model(analytical_thetas, one_hot_X),tips).mean())
print("Our numerical loss is: ", l2(linear_model(minimize_average_loss(l2, linear_model, one_hot_X, tips), one_hot_X), tips).mean())
```

In [ ]:

```
assert np.isclose(l2(linear_model(analytical_thetas, one_hot_X),tips).mean(), 328.61824697, rtol=1)
```

Our analytical loss is surprisingly much worse than our numerical loss. Why is this?

YOUR ANSWER HERE

You're done!

Before submitting this assignment, ensure to:

- Restart the Kernel (in the menubar, select Kernel->Restart & Run All)
- Validate the notebook by clicking the "Validate" button

Finally, make sure to **submit** the assignment via the Assignments tab in Datahub