# Models and Estimation¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.simplefilter('ignore', FutureWarning) # Seaborn triggers warnings in scipy
%matplotlib inline

In [2]:
# Configure nice plotting defaults - (this must be done in a cell separate
# from %matplotlib call)
plt.style.use('seaborn')
sns.set_context('talk', font_scale=1.4)
plt.rcParams['figure.figsize'] = (10, 7)


We're also going to use plotly for interactive plots.

In [3]:
import plotly.offline as py
py.init_notebook_mode(connected=True) # True if online -> smaller notebooks without plotly.js

import plotly.graph_objs as go
import plotly.figure_factory as ff



# The Tips Dataset¶

A waiter maintained a record of his tips, total bill, and information about the person who paid the bill.

In [4]:
data = sns.load_dataset("tips")
print("Number of Records:", len(data))

Number of Records: 244

Out[4]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

### Why did the waiter collect this information?¶

Perhaps the waiter was interested in:

1. Predicting the tip as a function of the bill, time of day, day of the week, or information about the table.
2. Infering the relationships between tips and other factors. For example, does the percent tip decrease or increase with bill size?

We build models to help answer both of these questions.

# EDA: Tip Distribution¶

Let's take a closer look at the tip data

In [5]:
sns.distplot(data['tip'], rug=True)
plt.xlabel("Tip in Dollars")
plt.savefig("tip_in_dollars.pdf")


### What do we observe?¶

1. The distribution is skewed right.
2. There are a few large tip values (perhaps big bills?)

# EDA: Derived Variable Percent Tip¶

In many countries it is common to tip a percentage of the bill.

$$\Large \texttt{pcttip} = \frac{\texttt{tip}}{\texttt{totalbill}} * 100$$

However, the exact percentage varies by region, dining environment, and service quality.

Question? What is the percentage tip that this waiter receives? This will be our first estimation goal.

We will extend our data with an additional derived variable containing the percent tip:

In [7]:
data['pcttip'] = data['tip']/data['total_bill'] * 100


Examining it's distribution we see:

In [8]:
sns.distplot(data['pcttip'], rug=True)
plt.xlabel("Percent Tip")
plt.savefig("percent_tip.pdf")


There are some large outliers that are skewing the plot. Why? What might be an explanation?

In [9]:
text = ["bill $" + str(bill) + ", tip$" + str(tip) for
(bill,tip) in data[['total_bill', 'tip']].itertuples(index=False)]
py.iplot(ff.create_distplot([data['pcttip']], group_labels=["Percent Tip"],
rug_text=[text]))


We can try to remove the outliers to get a better picture.

In [10]:
pcttips = data['pcttip']
# Removing data points that are greater than the 99% quantile
sns.distplot(pcttips[pcttips < pcttips.quantile(0.99)], rug=True)
plt.xlabel("Percent Tip")
plt.savefig("percent_tip.pdf")

In [11]:
pcttips = data['pcttip']
# Removing data points that are greater than the 99% quantile
# sns.distplot(pcttips[pcttips < pcttips.quantile(0.99)], rug=True)
py.iplot(ff.create_distplot([pcttips[pcttips < pcttips.quantile(0.99)]], group_labels=["Percent Tip"],
bin_size=.5))


### Observations?¶

1. This distribution is more symmetric than before (perhaps slightly skewed to the right) and appears to be centered around $15\%$ which is the customary amount to tip in the United States.
2. How many modes do we see? One and maybe two? Is this something we would expect (15% or 20% tips?).

# Defining the Model¶

Let's begin with the goal of estimating the percentage tip that this waiter receives. The percentage tip could depend on a lot of factors including the time of day, to the characteristics of the patrons, and the service of the waiter.

However, let's start simple by ignoring these factors and defining the percentage tip as a constant value:

$$\Large \text{percentage tip} = \theta^*$$

Here the parameter $\theta^*$ is the true tip rate. We don't know this value but we believe that such a value exists. The $*$ character indicates that this is the unknown quantity that the universe determines. I like to remember that the star $*$ quantity is determined by the universe.

Is this model correct? Maybe ... probably not. However, this model probably provides reasonable predictions about future tips and might provide some insight into common tipping practices.

How do we estimate the value for $\theta^*$ from our data? This is the fundamental question of estimation.

## How do we estimate the parameter $\theta^*$ of our tip model?¶

There are actually many ways we could estimate this parameter.

1. 15.0 <-- thats what Wikipedia suggests.
2. Pick a data point at random
3. Look at the histogram and choose the mode for our estimate
4. We could choose the mean for our estimate. This is the balance point of the histogram.
5. We could choose the median for our estimate.

Which of these strategies is best? It depends on our goals.

# Optimizing an Objective Function¶

One method to estimate the value for the parameter $\theta^*$ is to define an objective function which characterizes the quality of our choice of $\theta^*$ with respect to the data. We can then choose the best parameter $\theta$ with respect to this objective function.

How do we define an objective function?

Perhaps a natural objective function would penalize parameter values for $\theta^*$ that are inconsistent with the data.

# Loss Functions¶

At the core of most estimation procedures is a loss function which measures how well a model fits the data. Suppose we observed a single tip percentage $y$ and our model predicts $\theta$ as our guess for $\theta^*$, then any of the following might be appropriate loss functions

1. Squared Loss (also known as the $L^2$ loss pronounced "ell-two"): $$\Large L\left(\theta, y \right) = \left( y - \theta \right)^2$$
2. Absolute Loss (also known as the $L^1$ loss pronounced "ell-one"): $$\Large L\left(\theta, y \right) = \left| y - \theta \right|$$
3. Huber Loss (combines both Squared and Absolute Loss): $$\Large L_\alpha\left( \theta, y \right) = \begin{cases} \frac{1}{2}\left( y - \theta \right)^2 & \left| y - \theta \right| < \alpha \ \alpha \left(\left| y - \theta \right| - \frac{\alpha}{2} \right) & \text{otherwise}  \end{cases}$$

These loss functions effectively measure the error in our estimate relative to the data. Let's examine how they work on a real observations.

## Visualizing the Loss¶

Suppose we observe a single tip percentage for $16\%$:

In [12]:
y_obs = 16


We will try a range of possible $\theta$ values

In [13]:
thetas = np.linspace(0, 50, 200)


### Squared Loss¶

$$\Large L\left(\theta, y \right) = \left( y - \theta \right)^2$$

In [14]:
def squared_loss(est, y_obs):
return (est - y_obs)**2

In [15]:
thetas = np.linspace(0, 50, 200)
loss = squared_loss(thetas, y_obs)
plt.plot(thetas, loss, label="Squared Loss")
plt.vlines(y_obs, -100, 0, colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=4)
plt.savefig("l2_plt.pdf")


From the above plot we see that the value for $\theta$ that minimizes the loss corresponds to $\theta = y = 16\%$. Let's look at the other loss functions:

### Absolute Loss¶

$$\Large L\left(\theta, y \right) = \left| y - \theta \right|$$

In [16]:
def abs_loss(est, y_obs):
return np.abs(y_obs - est)

In [17]:
thetas = np.linspace(0, 50, 200)
loss = abs_loss(thetas, y_obs)
plt.plot(thetas, loss, label="Abs Loss")
plt.vlines(y_obs, -5, -2,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig("l1_plt.pdf")


Again we see that the loss function is minimized when $\theta = y = 16\%$ =

### Huber Loss¶

$$\Large L_\alpha\left( \theta, y \right) = \begin{cases} \frac{1}{2}\left( y - \theta \right)^2 & \left| y - \theta \right| < \alpha \\ \alpha \left(\left| y - \theta \right| - \frac{\alpha}{2} \right) & \text{otherwise} \end{cases}$$

In [18]:
def huber_loss(est, y_obs, alpha = 1):
d = abs_loss(est, y_obs)
return np.where(d < alpha,
squared_loss(est, y_obs) / 2.0,
alpha * (d - alpha / 2.0))

In [19]:
thetas = np.linspace(0, 50, 200)
loss = huber_loss(thetas, y_obs, alpha=5)
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(y_obs, -20, -5,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig('huber_loss.pdf')


## An interactive exploration of the Huber loss function¶

Let's define huberf as just the loss function, fixing $\theta=0$:

In [20]:
def huberf(y, Î±=1):
ay = abs(y)
return np.where(ay < Î±, (y**2) / 2, Î± * (ay - Î±/2))

In [21]:
from ipywidgets import interact

y = np.linspace(-10, 10)
h1 = huberf(y, Î±=1)
y2 = .5*y**2

@interact
def _(Î±=(0, 8, .25)):
plt.plot(y, h1, label=rf"$\alpha = 1$")
plt.plot(y, huberf(y, Î±=Î±), label=rf"$\alpha = {Î±:.2f}$")
plt.plot(y, y2, label=r"$\frac{1}{2} y^2$")
plt.legend(loc='upper center')
plt.axvline(-Î±, color='r', lw=1, ls='-.')
plt.axvline( Î±, color='r', lw=1, ls='-.')
plt.ylim(0, 20)


### Comparing all the loss functions at once.¶

In the following plot we plot all three loss functions.

In [22]:
thetas = np.linspace(0, 50, 200)
plt.plot(thetas, squared_loss(thetas, y_obs), label="Squared Loss")
plt.plot(thetas, abs_loss(thetas, y_obs), color='k', label="Abs Loss")
plt.plot(thetas, huber_loss(thetas, y_obs, alpha=2), color='m', label="Huber Loss")
plt.vlines(y_obs, -10, -5, colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.ylim([-20, 50])
plt.legend()
plt.savefig("loss_functions_compared.pdf")


### What do we observe?¶

1. All the loss functions have the same minimizer.
2. Different shapes around the minimum
1. Abs loss is pointy
2. Huber and Squared Loss are smooth.
3. Different behavior far away from minimum
1. Squared loss penalizes errors more than the Abs or Huber loss.

# Average Loss of the Data¶

The loss functions we have examined so far define the loss of the model with respect to a single data point. How can we extend this notion of loss to the entire dataset?

A simple and intuitive way to extend the loss to the entire dataset is to compute the average loss over all the data points. More formally, let the dataset $\mathcal{D}$ be the set of observations:

$$\Large \mathcal{D} = \{y_1, \ldots, y_n\}$$

where $y_i$ correspond to a single tip percentage and there are $n=244$ observations in our dataset.

Then we can define the average loss over the dataset as:

$$\Large L\left(\theta, \mathcal{D}\right) = \frac{1}{n} \sum_{i=1}^n L(\theta, y_i)$$

## Plotting the Average Loss¶

In [23]:
def avg_loss(loss, est, data):
return np.mean(np.array([loss(est, y_obs) for y_obs in data]), axis=0)


### Average Squared Loss¶

In [24]:
thetas = np.linspace(0, 50, 200)
loss = avg_loss(squared_loss, thetas, data['pcttip'])
plt.plot(thetas, loss, label="Squared Loss")
plt.vlines(data['pcttip'], -100, -30, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
print("Minimizing Theta", thetas[np.argmin(loss)])
plt.savefig("l2_avg_loss.pdf")

Minimizing Theta 16.08040201005025


### Average Absolute Loss¶

In [25]:
thetas = np.linspace(0, 50, 200)
loss = avg_loss(abs_loss, thetas, data['pcttip'])
plt.plot(thetas, loss, label="Absolute Loss")
plt.vlines(data['pcttip'], -5, -2, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
print("Minimizing Theta", thetas[np.argmin(loss)])
plt.savefig("l1_avg_loss.pdf")

Minimizing Theta 15.57788944723618


### Average Huber¶

In [26]:
thetas = np.linspace(0, 50, 200)
loss = avg_loss(huber_loss, thetas, data['pcttip'])
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(data['pcttip'], -5, -2, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
print("Minimizing Theta", thetas[np.argmin(loss)])
plt.savefig("huber_avg_loss.pdf")

Minimizing Theta 15.57788944723618

In [27]:
thetas = np.linspace(0, 50, 200)
a = plt.plot(thetas, avg_loss(squared_loss, thetas, data['pcttip'])/10, label="Squared Loss / 10 ")
plt.vlines(thetas[np.argmin(avg_loss(squared_loss, thetas, data['pcttip']))], 0, 20, linestyle='--',
color = a[0].get_color())
a = plt.plot(thetas, avg_loss(abs_loss, thetas, data['pcttip']), label="Absolute Loss")
plt.vlines(thetas[np.argmin(avg_loss(abs_loss, thetas, data['pcttip']))], 0, 20,
color = a[0].get_color(),linewidth=4)
a = plt.plot(thetas, avg_loss(huber_loss, thetas, data['pcttip']), label="Huber Loss")
plt.vlines(thetas[np.argmin(avg_loss(abs_loss, thetas, data['pcttip']))], 0, 20, linestyle='--',
color = a[0].get_color(),linewidth=2)

plt.vlines(data['pcttip'], -5, -2, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
plt.ylim([-7,15])
plt.xlim([5,30])
plt.savefig("all_loss.pdf")


## What do we observe?¶

1. The Absolute loss appears smoother. This is due to averaging. If we zoom in we see that it is still pointy.
2. The minimizing $\theta$ values are not all the same. Why?!

In the following plot we sample 5 data points at random and plot the various loss functions. A sample of 5 points is used to make it easier to visualize the corners in the absolute loss.

In [28]:
thetas = np.linspace(0, 50, 200)
np.random.seed(42) # seed the generator so plot doesn't change
tmp = data['pcttip'].sample(5)
loss_top = 20 # Need a top for the red lines
# Add the data points to the candidate thetas to enable visualization of corners
theta_sharp = np.sort(np.hstack((thetas, tmp)))

lines = [ # Plot the Loss functions
go.Scatter(x = theta_sharp, y = avg_loss(squared_loss, theta_sharp, data['pcttip'])-30, name='Squared Loss - 20'),
go.Scatter(x = theta_sharp, y = avg_loss(abs_loss, theta_sharp, tmp), name='Abs. Loss'),
go.Scatter(x = theta_sharp, y = avg_loss(huber_loss, theta_sharp, tmp), name='Huber Loss')
] + [ # Plot the red liens corresponding to data points
go.Scatter(x = [yobs, yobs], y=[0, loss_top],
mode='lines', showlegend=False,
line=dict(color='red', width = 0.5)) for yobs in tmp
]

# Assemble the plot
py.iplot(go.Figure(data = lines,
layout=dict(xaxis=dict(title="Theta Values"),
yaxis=dict(range=[0, loss_top]))))


In the above plot observe:

1. The extreme sharp slope of the Squared Loss
2. The corners in the absolute loss.
3. The Huber loss is very similar to the absolute loss but smoother
4. In the above plot the minimizers for the loss functions seem to align. Why?

</br>

## Dealing with Outliers¶

The differences in the above loss functions largely reduces to their sensitivity to outliers. Recall the distribution of tips (plotted below for interactive exploration).

In [29]:
py.iplot(ff.create_distplot([data['pcttip']], group_labels=["Percent Tip"],
rug_text=[["$" + str(s) for s in data['total_bill'].values]]))  ### Role of each data point in the loss¶ Below we plot the contribution of each data point to the loss function (holding$\theta=15\%\$)

In [30]:
yobs = np.sort(data['pcttip'])
theta_est = 15.0

plt.plot(yobs, squared_loss(yobs, theta_est) / np.sum(squared_loss(yobs, theta_est)), '-*', label='Squared Loss')
plt.plot(yobs, abs_loss(yobs, theta_est) / np.sum(abs_loss(yobs, theta_est)), '-*', label='Absolute Loss')
plt.plot(yobs, huber_loss(yobs, theta_est) / np.sum(huber_loss(yobs, theta_est)), '-*', label='Absolute Loss')
plt.xlabel(r"Tip as % of total bill")
plt.ylabel(r"Proportion of the Average Loss")
plt.legend()
# plt.ylim([-7,15])
# plt.xlim([5,30])
plt.savefig("prop_loss.pdf")