## Scipy has Optimization Package¶

from scipy.optimize import minimize

minimize(loss, x0 = ...)

So... let's call it a day?

## Overview¶

• Loss Minimization Framework
• Application to Recommender Systems
• Optimization Problems
• GD for Recommender Systems

## Estimation by Loss Minimization - a Recipe¶

• Design a model
• Choose loss function
• Minimize loss function

## Example - Summarize data with a single number¶

• Observe: numeric data $X_1,\dots, X_n$.
• Design a model: a single numeric summary $\theta$.
• Choose loss function: $\mathbb{L}_2$-loss, or sum-of-square deviations $\ell(\theta) = \sum_{i=1}^n (X_i-\theta)^2$.
• Minimize loss function: $0 = \left.\frac{\text{d}\ell}{\text{d}\theta}\right|_{\theta = \widehat\theta} = \sum_{i=1}^n2(\widehat\theta - X_i)~$ $\Longrightarrow \widehat\theta = \frac{1}{n}\sum_{i=1}^n X_i$.

## Example - Recommender System¶

• a company sells items, collects feedback (usually called a rating) from customers (called users).
• rather than listing the most popular items, the company wants to make personalized recommendations
• the cost of errors varies across applications

<img src="nytrecs.png", width=50%>

• Observe: a "rating" $r_{ui}$ by user $u$ on item $i$. A form of feedback:
• $5$-star rating
• Purchase
• Dwell time
• Not clicking "next"
• User $u$ does not rate every item $i$
• some ratings are unobserved
• We (may) observe features $x_i = (x_{i1},x_{i2})$ about each item $i$.
• Design a model: a vector of "weights" $\theta = (\theta_1,\theta_2)$ such that $$r_{ui} \approx x_i\cdot \theta = x_{i1}\theta_1 + x_{i2}\theta_2.$$

• Notice the right-hand-side above does not depend on $u$, so this model is not personalized.

• Design a personalized model: a "preference" vector $y_u = (y_{u1},y_{u2})$ for user $u$ such that $$r_{ui} \approx x_i\cdot y_u$$ (different weights for each user)
• Choose loss function: $\mathbb{L}_2$-loss, or sum-of-square deviations $\ell(y_1,\dots,y_u) = \sum_{u,i} (r_{ui}-x_i\cdot y_u)^2$.
• notice the above loss function only sums over pairs $(u,i)$ for which we observe $r_{ui}$
• Minimize loss function (but how?)

<img src="volinsky.png", width=50%>

(Koren, Bell & Volinsky 2009)

## Function Minimization¶

An unconstrained optimization problem is denoted

\begin{align} \min_{x}\,f(x), \end{align} where

• $x$ is a collection of decision variables.
• $f(x)$ is the associated cost.

In our context, we have a loss function $\ell(\theta)$ and want to solve

\begin{align} \min_{\theta}\,\ell(\theta), \end{align}

which is to say we want to find $\widehat\theta$ such that $\ell\left(\widehat\theta\right) \le \ell(\theta)$ for all $\theta$.

## Minimizing Analytically¶

Say $\ell(\theta) = (\theta - 1)^2$.

In [ ]:
plt.figure(figsize=(8,5))
x = np.arange(-5, 5, .01)
plt.plot(x, (x-1)**2, 'b')
plt.plot(1, 0.0, '|', color = 'r',
markersize = 20, markeredgewidth = 4)
plt.xlabel('$\\theta$')
plt.ylabel('$\ell(\\theta)$')
plt.show()

• look at the graph and see $\widehat\theta\approx 1$.
• notice $\ell(\theta) \ge 0$ and $\ell(1) = 0$.
• set derivative to zero. $0 = \left.\frac{\text{d}\ell}{\text{d}\theta}\right|_{\theta = \widehat\theta} = 2(\widehat\theta - 1) \text{ so } \widehat\theta = 1.$

## Minimizing Analytically (cont'd)¶

Now $\ell(\theta) = (\theta - 1)^2 + (\theta+2)^2$

In [ ]:
plt.figure(figsize=(8,5))
theta = np.arange(-4, 4, .01)
a, b = -2, 1
plt.plot(theta, (theta-a)**2, 'k--')
plt.plot(theta, (theta-b)**2, 'k--')
plt.plot(theta, (theta-a)**2+(theta-b)**2, 'b')
plt.plot([a,b], [0]*2, '|', color = 'k',
markersize = 20, markeredgewidth = 4)
plt.plot(0.5*(a+b), 4.5, '|', color = 'r',
markersize = 20, markeredgewidth = 4)
plt.xlabel('$\\theta$')
plt.ylabel('$\ell(\\theta)$')
plt.show()


## Minimizing Analytically (cont'd)¶

Now $\ell(\theta) = \theta^2(\theta^2-4)^2$

In [ ]:
plt.figure(figsize=(8,5))
x = np.arange(-2.5, 2.5, .01)
plt.plot(x, x**2*(x**2-4)**2, 'b')
plt.plot([-2,-np.sqrt(4/3),0,np.sqrt(4/3),2],
[0, 256/27, 0, 256/27, 0], '|', color='r',
markersize = 20, markeredgewidth = 4)
plt.xlabel('$\\theta$')
plt.ylabel('$\ell(\\theta)$')
plt.show()


Remember that $\frac{\text{d}\ell}{\text{d}\theta} = 0$ does not imply $\theta$ is a minimizer!

(might try $2^{\text{nd}}$ derivative test)

## Convexity Revisited¶

• Assuming convexity, $\ell'(\eta) = 0$ does tell us that $\eta$ is a global minimizer, as we will now show.

## Proof by Picture¶

In [ ]:
plt.figure(figsize=(20,10))

xx = np.linspace(-4,4,50)
yy = 3*(xx+1)**2 + 2

plt.subplot(1,2,1)
plt.plot(xx,yy)
plt.plot(np.linspace(-2,1,50), 3*np.linspace(-2,1,50) + 11,'r-')
plt.plot([-2,1],[5,14],'kx',markersize=20,markeredgewidth=2)
plt.ylim([-20,80])
plt.title("zeroth order condition for convexity") #  $f(\lambda x+(1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$

plt.subplot(1,2,2)
plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')
plt.plot(1,14,'kx',markersize=20,markeredgewidth=2)
plt.ylim([-20,80])
plt.title("first order condition for convexity") # $f(y) \geq f(x) + (y-x)\\frac{df}{dx}(x)$
plt.show()


## Direct Proof¶

• Recall convextiy for the loss $\ell$ means for all $0\le\lambda\le 1$ $$\ell(\lambda\theta + (1-\lambda)\eta) \le \lambda\ell(\theta) + (1-\lambda)\ell(\eta)$$
• Group terms with $\lambda$ in front:

$$\ell(\eta +\lambda(\theta - \eta)) \le \ell(\eta) + \lambda(\ell(\theta) - \ell(\eta)).$$

• Subtract $\ell(\eta)$ and divide by $\lambda$:

$$\frac{\ell(\eta +\lambda(\theta - \eta)) - \ell(\eta)}{\lambda} \le \ell(\theta) - \ell(\eta).$$

• Make ratio look like a derivative:

$$\underbrace{\frac{\ell(\eta +\lambda(\theta - \eta)) - \ell(\eta)}{\lambda(\theta - \eta)}}_{\rightarrow \ell'(\eta)} (\theta - \eta) \le (\ell(\theta) - \ell(\eta)).$$

• Hence (if $\ell$ is convex and differentiable)

$$\ell(\eta) + \ell'(\eta)(\theta - \eta) \le \ell(\theta).$$

• If $\ell(\theta)$ is convex, it lies above its tangent approximation everywhere.

## Why convexity? Recap¶

• We showed if $\ell$ is convex and differentiable, then for all $\eta,\theta$, $$\ell(\eta) + \ell'(\eta)(\theta - \eta) \le \ell(\theta).$$
• In particular, if $\ell'(\eta) = 0$ (i.e. $\eta$ is a critical point), then $$\ell(\eta) \le \ell(\theta) \text{ for all } \theta.$$
• Checking that $\ell'(\eta) = 0$ tells us that $\eta$ is a global minimizer.

## What about when $\theta$ is not a single number?¶

• Example. $\theta = (\theta_1,\theta_2)$ is a vector and $\ell(\theta) = \theta_1^2 + 4\theta_2^2$.
• The gradient is $\nabla\ell(\theta) = \left(\frac{\partial \ell}{\partial \theta_1}, \frac{\partial \ell}{\partial \theta_2}\right) = (2\theta_1, 8\theta_2)$.
In [ ]:
a, b = -3, 3
x = np.arange(a, b, .05)
y = np.arange(a, b, .05)
X, Y = np.meshgrid(x, y)
z = X**2 + 4*Y**2

p2 = Surface(name = "Loss Surface",
x = x,
y = y,
z = z,
colorscale = 'Viridis',
reversescale = True,
showscale = False)

paraboloid = Contour(name = "Loss Contour",
x = x,
y = y,
z = z,
colorscale = 'Viridis',
reversescale = True,
autocontour = True,
xaxis = 'x2',
yaxis = 'y2')

lay = Layout(xaxis = {'range' : [a, b],
'title' : '$\\theta_1$'},
yaxis = {'range' : [a, b],
'title' : '$\\theta_2$'},
scene = {"domain": {'x': [0, 0.48],
'y': [0, 1]},
'xaxis' : {'title' : 'θ1'},
'yaxis' : {'title' : 'θ2'},
'zaxis' : {'title' : 'ℓ(θ1,θ2)'}},
xaxis2= {'domain' : [.52, 1],
'range' : [a, b],
'title' : '$\\theta_1$'},
yaxis2= {'anchor' : 'x2',
'range' : [a, b],
'title' : '$\\theta_2$'},
autosize = False,
width = 1000,
height = 600)

iplot(Figure(data = Data([p2, paraboloid]), layout = lay))


• If $\theta$ is a single variable, $\ell'(\theta) = 0$ means the loss is locally flat around $\theta$.
• If $\theta = (\theta_1,\theta_2)$, the partial derivatives $\frac{\partial \ell}{\partial \theta_j}$ represent the derivative along one coordinate with the other held fixed.
• Hence if the gradient $\nabla \ell(\theta) = \left(\frac{\partial \ell}{\partial \theta_1}, \frac{\partial \ell}{\partial \theta_2}\right) = 0$, the function is locally flat around $\theta$.
• If $\ell$ is convex and $\nabla\ell(\eta) = 0$, then $\eta$ is a global minimizer.

Gradient: steepest ascent, perpendicular to level curves (contours)

In [ ]:
import plotly.figure_factory as ff

a, b = -3, 3
x = np.arange(a, b, .05)
y = np.arange(a, b, .05)

trace1 = Contour(name = "Loss Contour",
x = x,
y = y,
z = z,
colorscale = 'Viridis',
reversescale = True,
showscale = False,
autocontour = True,)

lay = Layout(xaxis = {'range' : [a, b],
'title' : '$\\theta_1$'},
yaxis = {'range' : [a, b],
'title' : '$\\theta_2$'},
width = 600,
height = 600)

xx = np.linspace(a, b, 7)
yy = np.linspace(a, b, 7)
X, Y = np.meshgrid(xx, yy)

trace2 = ff.create_quiver(X, Y, -2*X, -8*Y,
scale = .025,
arrow_scale = .15,)['data'][0]
trace2['marker']['color'] = 'black'
trace2['marker']['line']['width'] = 8
trace2['marker']['size'] = 8

iplot(Figure(data = Data([trace1, trace2]), layout = lay))


• Iterative optimization algorithm: given a candidate minimizer $\theta^{(t)}$ at step $t$, want an improved version $\theta^{(t+1)}$.
• Gradient moves along direction of steepest ascent, so $-\nabla\ell(\theta)$ moves along direction of steepest descent
• So to get from $\theta^{(t)}$ to $\theta^{(t+1)}$, we might as well follow the gradient!
In [ ]:
xx = np.linspace(-4,4,50)
f = lambda x: 3*(x+1)**2 + 2
df = lambda x: 6*(x+1)
yy = f(xx)

alpha = .2

x0 = 1
x1 = x0 - alpha*df(x0)
x2 = x1 - alpha*df(x1)

plt.figure(figsize=(10,10))

plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')

plt.plot(x0,-19,'|k',
markersize = 20, markeredgewidth = 4)
plt.plot(x0,f(x0),'.k',
markersize = 15, markeredgewidth = 4)
plt.annotate('$\\theta^{(0)}$', [x0 - .1, -17])
plt.arrow(x0, -19, -alpha*df(x0), 0,
plt.annotate('$-0.2\ell\'(\\theta^{(0)})$', [x1-1, -17])

plt.axvline(x=-1, ls='--')
plt.ylim([-20,80])
plt.title('$\\ell(\\theta) = 3(\\theta+1)^2+2$')
plt.show()

In [ ]:
plt.figure(figsize=(10,10))

plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')

plt.plot(x1,-19,'|k',
markersize = 20, markeredgewidth = 4)
plt.plot(x1,f(x1),'.k',
markersize = 15, markeredgewidth = 4)
plt.annotate('$\\theta^{(1)}$', [x1 - .1, -17])
plt.arrow(x1, -19, -alpha*df(x1), 0,
plt.annotate('$-0.2\ell\'(\\theta^{(1)})$', [x1+1, -17])

plt.axvline(x = -1, ls='--')
plt.ylim([-20, 80])
plt.title('$\\ell(\\theta) = 3(\\theta+1)^2+2$')
plt.show()

In [ ]:
plt.figure(figsize=(10,10))

plt.plot(xx,yy)
plt.plot(xx,12*xx+2,'r-')

plt.plot(x2,-19,'|k',
markersize = 20, markeredgewidth = 4)
plt.plot(x2,f(x2),'.k',
markersize = 15, markeredgewidth = 4)
plt.annotate('$\\theta^{(2)}$', [x2 - .1, -17])
plt.arrow(x2, -19, -alpha*df(x2), 0,

plt.axvline(x = -1, ls='--')
plt.ylim([-20, 80])
plt.title('$\\ell(\\theta) = 3(\\theta+1)^2+2$')
plt.show()


• Initialize value $\theta^{(0)}$ (zeros, random guess, or output of another method)

• For $t$ from $0$ until convergence, update

$$\theta^{(t+1)} \leftarrow \theta^{(t)} - \alpha\nabla \ell(\theta^{(t)})$$

• $\alpha$ is called the learning rate and need not be constant (i.e. use $\alpha_t$).
• For sufficiently small $\alpha$ we will have $\ell(\theta^{(t+1)}) \le \ell(\theta^{(t)})$.
• We might stop once $\nabla\ell(\theta^{(t)})$ is small.
In [ ]:
TOGGLER

In [ ]:
def gradient_descent(theta0, grad, alpha, tol = 1e-4, max_iter = 1e5):
theta_path = [theta0]
i = 0
while(np.linalg.norm(grad(theta_path[-1])) > tol) & (i < max_iter):
i += 1
theta_t = theta_path[-1]
return np.array(theta_path)

In [ ]:
import plotly.figure_factory as ff

a, b = -3, 3
x = np.arange(a, b, .05)
y = np.arange(a, b, .05)

trace1 = Contour(name = "Loss Contour",
x = x,
y = y,
z = z,
colorscale = 'Viridis',
reversescale = True,
showscale = False,
autocontour = True,)

lay = Layout(xaxis = {'range' : [a, b],
'title' : '$\\theta_1$'},
yaxis = {'range' : [a, b],
'title' : '$\\theta_2$'},
width = 600,
height = 600)

def trace2(theta0 = np.array([2.3, 2.5]), alpha = 0.01):
lambda theta: np.array([2*theta[0], 8*theta[1]]),
alpha,
tol = 1e-2)
return Scatter(name = "Theta Path",
x = theta_path[:,0],
y = theta_path[:,1],
mode = "lines+markers")

title = lambda a : '$\\alpha = ' + str(a) + '$'
iplot(Figure(data = [trace1, trace2(alpha = 0.01)], layout = {**lay, **{'title': title(.01)}}))

In [ ]:
iplot(Figure(data = [trace1, trace2(alpha = 0.1)], layout = {**lay, **{'title': title(.1)}}))

In [ ]:
iplot(Figure(data = [trace1, trace2(alpha = 0.23)], layout = {**lay, **{'title': title(.23)}}))


## Automatic Differentiation¶

    import autograd.numpy as np

def loss(theta):
...

grad(loss)

## Example - Recommender System¶

In [ ]:
!pip install surprise

In [ ]:
rnames = ['uid', 'iid', 'rating', 'time']
sep = '\t',
names = rnames)\
.drop(columns = ['time'])

inames = ['iid', 'title', 'release date', 'video release date', 'IMDb URL',
'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance',
'Sci-Fi', 'Thriller', 'War', 'Western']
sep = '|',
names = inames,
encoding = 'iso8859_2')\
.drop(columns = ['release date', 'video release date',
'IMDb URL', 'unknown'])

item_names = items.iloc[:,[0,1]].set_index('iid')

tot_rating = ratings.groupby('iid').sum().drop(columns = ['uid'])
avg_rating = ratings.groupby('iid').mean().drop(columns = ['uid'])
top_titles = avg_rating[(tot_rating > 1300) & (avg_rating > 3.82)].dropna().join(item_names)
inds = list(top_titles.index-1)
top_titles.sort_values('rating', ascending = False)


In this model, both user vectors $y_u$ and item vectors $x_i$ are unobserved: $$r_{ui} \approx x_i\cdot y_u$$

In [ ]:
plt.figure(figsize = (17, 17))
plt.plot(algo.qi[inds, 0], algo.qi[inds, 1], 'k.')
for i in inds:
title = item_names.loc[i+1,'title'][:-6]
if title.strip().endswith(', The'):
title = "The " + title[:-6]
plt.annotate(title, algo.qi[i])
plt.xticks([])
plt.yticks([])
plt.show()


## On Worrying Selectively¶

"Since all models are wrong the scientist
must be alert to what is importantly wrong.
It is inappropriate to be concerned about
mice when there are tigers abroad."
- George E. P. Box, Science and Statistics

## Recap¶

• Loss Minimization as a Recipe for Estimation
• Concepts and Tools from Numerical Optimization
• Application to Recommender Systems