In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.simplefilter('ignore', FutureWarning) # Seaborn triggers warnings in scipy
%matplotlib inline
```

In [2]:

```
# Configure nice plotting defaults - (this must be done in a cell separate
# from %matplotlib call)
plt.style.use('seaborn')
sns.set_context('talk', font_scale=1.4)
plt.rcParams['figure.figsize'] = (10, 7)
```

We're also going to use plotly for interactive plots.

In [3]:

```
import plotly.offline as py
py.init_notebook_mode(connected=True) # True if online -> smaller notebooks without plotly.js
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')
```

A waiter maintained a record of his tips, total bill, and information about the person who paid the bill.

In [4]:

```
data = sns.load_dataset("tips")
print("Number of Records:", len(data))
data.head()
```

Out[4]:

Perhaps the waiter was interested in:

**Predicting**the tip as a function of the bill, time of day, day of the week, or information about the table.**Infering**the relationships between tips and other factors. For example, does the**percent tip**decrease or increase with bill size?

We build models to help answer both of these questions.

In [5]:

```
sns.distplot(data['tip'], rug=True)
plt.xlabel("Tip in Dollars")
plt.savefig("tip_in_dollars.pdf")
```

- The distribution is skewed right.
- There are a few large tip values (perhaps big bills?)
- There appears to be a mode around $2.00. Perhaps there is a tradition of tipping 2 dollars for small meals (e.g., coffee?)
- Everyone provided a tip.

In [6]:

```
sns.distplot(data['total_bill'], rug=True)
plt.xlabel("Total Bill in Dollars")
plt.savefig("bill.pdf")
```

- Like tips we see that the distribution is skewed right.
- The distribution also appears much smoother. with a clear single mode around $15.

In many countries it is common to tip a percentage of the bill.

$$\Large \texttt{pcttip} = \frac{\texttt{tip}}{\texttt{totalbill}} * 100 $$

However, the exact percentage varies by region, dining environment, and service quality.

**Question?** What is the percentage tip that this waiter receives? This will be our first estimation goal.

We will extend our data with an additional **derived variable** containing the **percent tip**:

In [7]:

```
data['pcttip'] = data['tip']/data['total_bill'] * 100
```

Examining it's distribution we see:

In [8]:

```
sns.distplot(data['pcttip'], rug=True)
plt.xlabel("Percent Tip")
plt.savefig("percent_tip.pdf")
```

There are some **large outliers** that are skewing the plot. Why? What might be an explanation?

In [9]:

```
text = ["bill $" + str(bill) + ", tip $" + str(tip) for
(bill,tip) in data[['total_bill', 'tip']].itertuples(index=False)]
py.iplot(ff.create_distplot([data['pcttip']], group_labels=["Percent Tip"],
rug_text=[text]))
```

We can try to remove the outliers to get a better picture.

In [10]:

```
pcttips = data['pcttip']
# Removing data points that are greater than the 99% quantile
sns.distplot(pcttips[pcttips < pcttips.quantile(0.99)], rug=True)
plt.xlabel("Percent Tip")
plt.savefig("percent_tip.pdf")
```

In [11]:

```
pcttips = data['pcttip']
# Removing data points that are greater than the 99% quantile
# sns.distplot(pcttips[pcttips < pcttips.quantile(0.99)], rug=True)
py.iplot(ff.create_distplot([pcttips[pcttips < pcttips.quantile(0.99)]], group_labels=["Percent Tip"],
bin_size=.5))
```

- This distribution is more symmetric than before (perhaps slightly skewed to the right) and appears to be centered around $15\%$ which is the customary amount to tip in the United States.
- How many modes do we see? One and maybe two? Is this something we would expect (15% or 20% tips?).

Let's begin with the goal of **estimating the percentage tip** that this waiter receives. The percentage tip could depend on a lot of factors including the time of day, to the characteristics of the patrons, and the service of the waiter.

However, let's start simple by ignoring these factors and defining the percentage tip as a constant value:

$$\Large \text{percentage tip} = \theta^* $$

Here the parameter $\theta^*$ is the true tip rate. We don't know this value but we believe that such a value exists. The $*$ character indicates that this is the unknown quantity that the universe determines. I like to remember that the star $*$ quantity is determined by the universe.

Is this model correct? Maybe ... probably not. However, this model probably provides reasonable predictions about future tips and might provide some insight into common tipping practices.

How do we estimate the value for $\theta^*$ from our data? This is the fundamental question of estimation.

There are actually many ways we could estimate this parameter.

- 15.0 <-- thats what Wikipedia suggests.
- Pick a data point at random
- Look at the histogram and choose the
**mode**for our estimate - We could choose the mean for our estimate. This is the balance point of the histogram.
- We could choose the median for our estimate.

**Which of these strategies is best?** It depends on our goals.

One method to estimate the value for the parameter $\theta^*$ is to define an objective function which characterizes the quality of our choice of $\theta^*$ with respect to the data. We can then choose the *best* parameter $\theta$ with respect to this objective function.

**How do we define an objective function?**

Perhaps a natural objective function would penalize parameter values for $\theta^*$ that are inconsistent with the data.

At the core of most estimation procedures is a **loss function** which measures how well a model fits the data. Suppose we observed a single tip percentage $y$ and our model predicts $\theta$ as our guess for $\theta^*$, then any of the following might be appropriate **loss functions**

**Squared Loss**(also known as the $L^2$ loss pronounced "ell-two"): $$\Large L\left(\theta, y \right) = \left( y - \theta \right)^2 $$**Absolute Loss**(also known as the $L^1$ loss pronounced "ell-one"): $$\Large L\left(\theta, y \right) = \left| y - \theta \right| $$**Huber Loss**(combines both Squared and Absolute Loss): $$\Large L_\alpha\left( \theta, y \right) = \begin{cases} \frac{1}{2}\left( y - \theta \right)^2 & \left| y - \theta \right| < \alpha \

\end{cases} $$`\alpha \left(\left| y - \theta \right| - \frac{\alpha}{2} \right) & \text{otherwise}`

These loss functions effectively measure the **error** in our estimate relative to the data. Let's examine how they work on a real observations.

Suppose we observe a single tip percentage for $16\%$:

In [12]:

```
y_obs = 16
```

We will try a range of possible $\theta$ values

In [13]:

```
thetas = np.linspace(0, 50, 200)
```

$$\Large L\left(\theta, y \right) = \left( y - \theta \right)^2 $$

In [14]:

```
def squared_loss(est, y_obs):
return (est - y_obs)**2
```

In [15]:

```
thetas = np.linspace(0, 50, 200)
loss = squared_loss(thetas, y_obs)
plt.plot(thetas, loss, label="Squared Loss")
plt.vlines(y_obs, -100, 0, colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=4)
plt.savefig("l2_plt.pdf")
```

From the above plot we see that the value for $\theta$ that minimizes the loss corresponds to $\theta = y = 16\%$. Let's look at the other loss functions:

$$\Large L\left(\theta, y \right) = \left| y - \theta \right| $$

In [16]:

```
def abs_loss(est, y_obs):
return np.abs(y_obs - est)
```

In [17]:

```
thetas = np.linspace(0, 50, 200)
loss = abs_loss(thetas, y_obs)
plt.plot(thetas, loss, label="Abs Loss")
plt.vlines(y_obs, -5, -2,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig("l1_plt.pdf")
```

Again we see that the loss function is minimized when $\theta = y = 16\%$ =

$$\Large L_\alpha\left( \theta, y \right) = \begin{cases} \frac{1}{2}\left( y - \theta \right)^2 & \left| y - \theta \right| < \alpha \\ \alpha \left(\left| y - \theta \right| - \frac{\alpha}{2} \right) & \text{otherwise} \end{cases} $$

In [18]:

```
def huber_loss(est, y_obs, alpha = 1):
d = abs_loss(est, y_obs)
return np.where(d < alpha,
squared_loss(est, y_obs) / 2.0,
alpha * (d - alpha / 2.0))
```

In [19]:

```
thetas = np.linspace(0, 50, 200)
loss = huber_loss(thetas, y_obs, alpha=5)
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(y_obs, -20, -5,colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend()
plt.savefig('huber_loss.pdf')
```

Let's define `huberf`

as just the loss function, fixing $\theta=0$:

In [20]:

```
def huberf(y, Î±=1):
ay = abs(y)
return np.where(ay < Î±, (y**2) / 2, Î± * (ay - Î±/2))
```

In [21]:

```
from ipywidgets import interact
y = np.linspace(-10, 10)
h1 = huberf(y, Î±=1)
y2 = .5*y**2
@interact
def _(Î±=(0, 8, .25)):
plt.plot(y, h1, label=rf"$\alpha = 1$")
plt.plot(y, huberf(y, Î±=Î±), label=rf"$\alpha = {Î±:.2f}$")
plt.plot(y, y2, label=r"$\frac{1}{2} y^2$")
plt.legend(loc='upper center')
plt.axvline(-Î±, color='r', lw=1, ls='-.')
plt.axvline( Î±, color='r', lw=1, ls='-.')
plt.ylim(0, 20)
```

In the following plot we plot all three loss functions.

In [22]:

```
thetas = np.linspace(0, 50, 200)
plt.plot(thetas, squared_loss(thetas, y_obs), label="Squared Loss")
plt.plot(thetas, abs_loss(thetas, y_obs), color='k', label="Abs Loss")
plt.plot(thetas, huber_loss(thetas, y_obs, alpha=2), color='m', label="Huber Loss")
plt.vlines(y_obs, -10, -5, colors="r", label="Observation")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.ylim([-20, 50])
plt.legend()
plt.savefig("loss_functions_compared.pdf")
```

- All the loss functions have the same minimizer.
- Different shapes around the minimum
- Abs loss is pointy
- Huber and Squared Loss are smooth.

- Different behavior far away from minimum
- Squared loss penalizes errors more than the Abs or Huber loss.

The loss functions we have examined so far define the loss of the model with respect to a single data point. How can we extend this notion of loss to the entire dataset?

A simple and intuitive way to extend the loss to the entire dataset is to compute the average loss over all the data points. More formally, let the dataset $\mathcal{D}$ be the set of observations:

$$\Large \mathcal{D} = \{y_1, \ldots, y_n\} $$

where $y_i$ correspond to a single tip percentage and there are $n=244$ observations in our dataset.

Then we can define the average loss over the dataset as:

$$\Large L\left(\theta, \mathcal{D}\right) = \frac{1}{n} \sum_{i=1}^n L(\theta, y_i) $$

In [23]:

```
def avg_loss(loss, est, data):
return np.mean(np.array([loss(est, y_obs) for y_obs in data]), axis=0)
```

In [24]:

```
thetas = np.linspace(0, 50, 200)
loss = avg_loss(squared_loss, thetas, data['pcttip'])
plt.plot(thetas, loss, label="Squared Loss")
plt.vlines(data['pcttip'], -100, -30, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
print("Minimizing Theta", thetas[np.argmin(loss)])
plt.savefig("l2_avg_loss.pdf")
```

In [25]:

```
thetas = np.linspace(0, 50, 200)
loss = avg_loss(abs_loss, thetas, data['pcttip'])
plt.plot(thetas, loss, label="Absolute Loss")
plt.vlines(data['pcttip'], -5, -2, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
print("Minimizing Theta", thetas[np.argmin(loss)])
plt.savefig("l1_avg_loss.pdf")
```

In [26]:

```
thetas = np.linspace(0, 50, 200)
loss = avg_loss(huber_loss, thetas, data['pcttip'])
plt.plot(thetas, loss, label="Huber Loss")
plt.vlines(data['pcttip'], -5, -2, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
print("Minimizing Theta", thetas[np.argmin(loss)])
plt.savefig("huber_avg_loss.pdf")
```

In [27]:

```
thetas = np.linspace(0, 50, 200)
a = plt.plot(thetas, avg_loss(squared_loss, thetas, data['pcttip'])/10, label="Squared Loss / 10 ")
plt.vlines(thetas[np.argmin(avg_loss(squared_loss, thetas, data['pcttip']))], 0, 20, linestyle='--',
color = a[0].get_color())
a = plt.plot(thetas, avg_loss(abs_loss, thetas, data['pcttip']), label="Absolute Loss")
plt.vlines(thetas[np.argmin(avg_loss(abs_loss, thetas, data['pcttip']))], 0, 20,
color = a[0].get_color(),linewidth=4)
a = plt.plot(thetas, avg_loss(huber_loss, thetas, data['pcttip']), label="Huber Loss")
plt.vlines(thetas[np.argmin(avg_loss(abs_loss, thetas, data['pcttip']))], 0, 20, linestyle='--',
color = a[0].get_color(),linewidth=2)
plt.vlines(data['pcttip'], -5, -2, colors="r", linewidth=0.4, label="Observations")
plt.xlabel(r"Choice for $\theta$")
plt.ylabel(r"Loss")
plt.legend(loc=5)
plt.ylim([-7,15])
plt.xlim([5,30])
plt.savefig("all_loss.pdf")
```

- The Absolute loss appears smoother. This is due to averaging. If we zoom in we see that it is still pointy.
**The minimizing $\theta$ values are not all the same.***Why?!*

In the following plot we sample 5 data points at random and plot the various loss functions. A sample of 5 points is used to make it easier to visualize the corners in the absolute loss.

In [28]:

```
thetas = np.linspace(0, 50, 200)
np.random.seed(42) # seed the generator so plot doesn't change
tmp = data['pcttip'].sample(5)
loss_top = 20 # Need a top for the red lines
# Add the data points to the candidate thetas to enable visualization of corners
theta_sharp = np.sort(np.hstack((thetas, tmp)))
lines = [ # Plot the Loss functions
go.Scatter(x = theta_sharp, y = avg_loss(squared_loss, theta_sharp, data['pcttip'])-30, name='Squared Loss - 20'),
go.Scatter(x = theta_sharp, y = avg_loss(abs_loss, theta_sharp, tmp), name='Abs. Loss'),
go.Scatter(x = theta_sharp, y = avg_loss(huber_loss, theta_sharp, tmp), name='Huber Loss')
] + [ # Plot the red liens corresponding to data points
go.Scatter(x = [yobs, yobs], y=[0, loss_top],
mode='lines', showlegend=False,
line=dict(color='red', width = 0.5)) for yobs in tmp
]
# Assemble the plot
py.iplot(go.Figure(data = lines,
layout=dict(xaxis=dict(title="Theta Values"),
yaxis=dict(range=[0, loss_top]))))
```

In the above plot observe:

- The extreme sharp slope of the Squared Loss
- The corners in the absolute loss.
- The Huber loss is very similar to the absolute loss but smoother
- In the above plot the minimizers for the loss functions seem to align. Why?

</br>

The differences in the above loss functions largely reduces to their sensitivity to outliers. Recall the distribution of tips (plotted below for interactive exploration).

In [29]:

```
py.iplot(ff.create_distplot([data['pcttip']], group_labels=["Percent Tip"],
rug_text=[["$" + str(s) for s in data['total_bill'].values]]))
```

Below we plot the contribution of each data point to the loss function (holding $\theta=15\%$)

In [30]:

```
yobs = np.sort(data['pcttip'])
theta_est = 15.0
plt.plot(yobs, squared_loss(yobs, theta_est) / np.sum(squared_loss(yobs, theta_est)), '-*', label='Squared Loss')
plt.plot(yobs, abs_loss(yobs, theta_est) / np.sum(abs_loss(yobs, theta_est)), '-*', label='Absolute Loss')
plt.plot(yobs, huber_loss(yobs, theta_est) / np.sum(huber_loss(yobs, theta_est)), '-*', label='Absolute Loss')
plt.xlabel(r"Tip as % of total bill")
plt.ylabel(r"Proportion of the Average Loss")
plt.legend()
# plt.ylim([-7,15])
# plt.xlim([5,30])
plt.savefig("prop_loss.pdf")
```