## Why is KL Divergence Positive?¶

Notebook authored by Prof. Ani Adhikari, with minor modifications for Data 100 by Prof. F. Perez.

In [1]:
import numpy as np

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact


### Preliminary: Jensen's Inequality¶

Let $g$ be a convex function. Then $E(g(X)) \ge g(E(X))$.

Because $g$ is convex, its graph lies below the secant line at every point.

In [4]:
x0, x1 = 0.05, 3

@interact(a=(2*x0, 1), b=(1.1, 0.95*x1))
def secplot(a, b):

# Compute -log(x)
x = np.linspace(x0, x1, 200)
g = -np.log(x)

# Compute secant through a,b
ga, gb = -np.log(a), -np.log(b)
m = (gb-ga)/(b-a)  # slope
k = ga - m*a  # intercept

secant = m*x + k

# Plot all
plt.figure(figsize=(10, 7))
plt.plot(x, g, lw=2, color='darkblue', label=r'$g(X) = -\log(X)$')
plt.plot(x, secant, lw=2, color='red', label=f'Secant through $(a,b) = ({a:.1f},{b:.1f})$')
plt.scatter([a, b], [ga, gb], s=100, color='orange', edgecolors= 'black', linewidths=2)
plt.legend()
plt.ylim(-1.5, 3.5)
plt.xlabel('x')
plt.ylabel('y');


As we saw before, the secant is a linear interpolant that can be written as

$$t g(a) + (1-t) g(b)$$

for $t \in (0,1)$. The function at any point between $a$ and $b$ is given by

$$g(t a + (1-t)b)$$

so our convexity condition is that

$$t g(a) + (1-t) g(b) \ge g(t a + (1-t)b)$$

The secant is a weighted average of the value of the function between $a$ and $b$, and this makes the convexity condition a statement of Jensen's inequality for two points. If we generalize this to a weighted average over all the points in the $(a,b)$ interval, weighted by a probability distribution, we end up with teh full version of Jensen's inequality stated as

$$E[g(X)] \ge g(E[X]).$$

### The Sign of the Kullback-Leibler Divergence¶

I'll work in the discrete case. In the continuous case replace probabilities by densities and sums by integrals.

\begin{align*} D_{KL}(p || q) ~ &= ~ E_{p}\big{(}\log\big{(} \frac{p(X)}{q(X)} \big{)} \big{)} \\ & \ge ~ \log \big{(} E_{p}\big{(} \frac{p(X)}{q(X)} \big{)} \big{)} ~~~ \text{by Jensen, because } g(x) = -\log(x) \text{ is a convex function} \\ &= ~ \log \big{(} \sum_{\text{all x}} \frac{p(x)}{q(x)} \cdot p(x) \big{)} \\ &= ~ -\log \big{(} \sum_{\text{all x}} p(x) \big{)} \\ &= ~ -\log(1) \\ &= ~ 0 \end{align*}