Relative Entropy for Fun and Profit

## Category: Uncategorized

### Optimization talk at PyData London 2014

I was pleased to be asked to speak at PyData London this year. I gave an overview of some optimization tools in Python which they’ve put up on their Youtube channel.

My talk is here.

Here are the slides (which probably don’t make much sense without the talk), and here are the associated IPython notebooks I used:

### Kullback-Leibler divergence asymmetry

It’s widely known that the KL divergence is not symmetric, i.e.

$\displaystyle D_{KL}(p\Vert q) \neq D_{KL}(q \Vert p)$

But what’s the intuitive picture of how the symmetry fails? Recently I saw Will Penny explain this (at the Free Energy Principle workshop, of which hopefully more later). I am going to “borrow” very liberally from his talk. I think he also mentioned that he was using slides from Chris Bishop’s book, so this material might be well known from there (I haven’t read it).

f <- function(x) { dnorm(x, mean=0, sd=0.5) };
g <- function(x) { dnorm(x, mean=3, sd=2) };
curve(f(x),-6,6, col="red", ylim=c(-1,1));
scale <- 0.05;
curve( scale*(log(g(x))-log(f(x))),-6,6, col="blue", lty=3, add=T);

The plot shows two Gaussians, a lower variance distribution $f$ in red and a wider distribution $g$ in blue. You can also see the (scaled) quantity $\log(\frac{f}{g}) = -\log(\frac{g}{f})$ in red, and its inverse in blue.

The KL divergence $D_{KL}(f\Vert g)$ is the expectation under the red pdf of the red dotted line, and $D_{KL}(g\Vert f)$ is the corresponding expectation for the blue pair. A couple of observations reveal two sources of disagreement between them:

• The KL from $f$ gives little weight to discrepancies outside the narrow region where it has the most probability density; what the broader density $g$ does out here will affect it very little
• The quantity inside the expectations, the difference of logs, is actually anti-symmetric. (Somehow it’s remarkable that we can take two differently positively weighted sums of the same quantity with and without a minus sign, but always guarantee a positive result, since $D_{KL}\geq 0$ !)

## Consequences for optimization

Say we have a multi-modal distribution $p$, like this mixture of two Gaussians:

mu1 <- 4.0;
sd1 <- 0.8;
mu2 <- 1;
sd2 <- 0.5;
curve(0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2),-6,6);

and want to approximate it with a simpler distribution $q$ – we’ll use a single Gaussian – by minimizing the KL divergence.

We can choose to minimize either $D_{KL}(q\Vert p)$ or $D_{KL}(p \Vert q)$. Penny’s claim is that:

• minimizing $D_{KL}(q\Vert p)$ results in matching a Gaussian to one of the modes (the approximating distribution is more compact than the true distribution)
• minimizing $D_{KL}(p\Vert q)$ results in matching moments (our approximating distribution will have the same mean and variance as the mixture, so will be more spread out than either of the two peaks in the true distributions).

In the first case, the approximating density $q$ can choose to “ignore” troublesome parts of the target density $p$ which are difficult to fit by reducing its variance. We’ve seen above that the tails will receive little weight and can’t affect the divergence much.

In the second case, the expectation is taken under the target density $p$, so $q$ has to try to accommodate the whole thing as best it can, multi modes and all.

(Once again David MacKay’s wonderful ITILA book has some interesting material related to this; Chapter 33 on variational free energy minimization.)

I wanted to try this optimization out. The KL divergence between two Gaussians has a closed form but it seems this is not the case for mixtures of Gaussians. So let’s fake it by sampling our densities and using discrete distributions:

xx <- -1000:1000/100;
p <- dmog(xx,4,0.8,1,0.5);

dmog <- function(x, mu1, sd1, mu2, sd2) {
0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2);
}

kl.div <- function(p, q) {
p %*% log(p/q);
}

dkl.pq <- function(pars) {
q <- dnorm(xx, mean=pars[1], sd=pars[2]);
kl.div(p, q);
}

dkl.qp <- function(pars) {
q <- dnorm(xx, mean=pars[1], sd=pars[2]);
kl.div(q, p);
}
init <- c(0,2);
result.qp <- optim(init, dkl.qp);
result.pq <- optim(init, dkl.pq);
plot(xx, p, type="l", col="red", ylim=c(0,1));
curve(dnorm(x, mean=result.qp$par[1], sd=result.qp$par[2]), -10,10,col="blue",add=T)
curve(dnorm(x, mean=result.pq$par[1], sd=result.pq$par[2]), -10,10,col="green",add=T)
result.qp
result.pq

We get the expected result:

• the green curve is the minimum $D_{KL}(p\Vert q)$ fit, which approximate the mixture distribution’s mean (and variance?)
• the blue curve is the minimum $D_{KL}(q \Vert p)$ fit, which here approximately fits the mean and sd of one of the two mixture components ($\mu \approx 1$, $\sigma \approx 0.5$). It’s a bit more touchy though, and depends on the initial conditions now.

UPDATE:

After I posted this Iain Murray linked to some nice further material on Twitter:

• This note describing in more detail when the “$D_{KL}(q\Vert p)$ is more compact” property does and doesn’t apply
• A set of video lectures by David MacKay to accompany the ITILA book

In particular Lecture 14, on approximating distributions by variational free energy minimization, covers much of this material.

It also sets the mathematical scene for what I’m intending to talk about next: Karl Friston’s Free Energy Principle, which applies extensions of these ideas to understanding the brain and behaviour.

## Probability distributions and distances

There’s an interesting correspondence between unimodal p.d.f.’s on a metric space (like the reals $\mathbb{R}^{N}$), and distance functions. I will dig through this for amusement below, with some R code to generate the pictures.

### Squared Euclidean distance and Gaussians

For example, in one dimension the normal p.d.f. is

$f(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 }$

so the log likelihood for a single $x$ and fixed $\sigma$ is proportional to the squared distance from the mean: $\log\mathcal{L}(\mu) = \frac{(x-\mu)^{2}}{2\sigma^{2}}$. Here’s the log likelihood for unit variance with $\mu=\frac{1}{2}$:

mu <- 0.5;
par(mfrow=c(2,1));
curve(dnorm(x, mean=mu, sd=1, log=T),-4,5,xlab="x",ylab=expression(log(phi*("x"))));
abline(v=mu);
curve(dnorm(x, mean=mu, sd=1, log=F),-4,5,xlab="x",ylab=expression(phi*("x")));
abline(v=mu);
text(mu,0,expression(mu=="0.5"));

This is basic stuff at the core of classical statistical techniques: least squares models are computationally straightforward to fit, but they also happen to imply Gaussian errors, which is a reasonable first approximation for a lot of real life data.

Of course we could look at it the other way. If I want to measure the Euclidean distance from a point $a$ to a point $b$, I can back this out from the log likelihood of $a$ under a unit variance Gaussian centred at $b$.

Putting $\sigma=1$:

$\displaystyle f(x;\mu) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} \left(x-\mu\right)^2 }$

$\displaystyle \log f(x;\mu) = - \frac{1}{2}\log(2\pi) -\frac{1}{2}\left(x-\mu \right)^2$

$\displaystyle \left(x-\mu \right)^2 = - 2\log f(x;\mu) - \log(2\pi)$

$\displaystyle \left(x-\mu \right) = \sqrt{-2\log f(x;\mu) - \log(2\pi)}$

a <- 2;
b <- 4;
sqrt(-2*dnorm(b, mean=a, sd=1, log=T)  - log(2*pi));

This is rather a roundabout way of measuring distances. Would you ever want to do this? Well, in the multivariate case, if the covariance matrix isn’t spherical you get a distance measure that depends on direction; something a bit like the Mahalanobis distance.

### L-1 distance and Laplace distribution

For our next trick, we can try measure distances with other norms than $L_{2}$, the good old squared Euclidean distance. For instance if we minimize the absolute error instead of the squared error (e.g. for Quantile Regression) we’re still measuring distances but now they’re in the achingly fashionable $L_{1}$ norm. We’ve implicitly assumed the errors follow the Laplace or “double exponential” distribution so the log likelihood looks like this:

mu <- 0.5;
b <- 1;
par(mfrow=c(2,1));
curve(log(exp(-abs(x-mu)/b)/2*b),-4,5,xlab="x",ylab=expression(log(f(x))));
abline(v=mu);
curve(exp(-abs(x-mu)/b)/2*b,-4,5,xlab="x",ylab=expression(f(x)));
abline(v=mu);
text(mu,0,expression(mu==0.5));

### KL divergence and Dirichlet distributions

What if we go a bit further off-piste: let’s consider the Kullback-Leibler divergence between two arbitrary N-dimensional probability distributions $\mathbf{p}$ and $\mathbf{q}$:

$\displaystyle D_{KL}(\mathbf{p} \Vert \mathbf{q}) = \sum_{i} p_{i} \log \frac{p_{i}}{q_{i}}$

This is a bit different than the examples above, where the log-probability depends on metric distances between points in $\mathbb{R}^{N}$, Firstly: the KL divergence isn’t a proper metric (it’s not symmetric and doesn’t obey the triangle inequality). However, it is does act as a distance in the sense that $D_{KL}(p,q) \geq 0$ with equality only when $\mathbf{p}=\mathbf{q}$, which is good enough for our purposes (it’s a “premetric”).

Secondly, we’re now confining ourselves to points that live on probability simplex $\mathbb{P}^{N}$, instead of anywhere in $\mathbb{R}^{N}$. That means our distributions $\mathbf{p}$ have to satisfy $\sum_{i} p_{i}=1$, and $p_{i}>0$.

A widely used distribution on the simplex is the Dirichlet distribution:

$\displaystyle f(\mathbf{p} ; \mathbf{\alpha}) \propto \prod_{i=1}^K p_i^{\alpha_i - 1}$

which is also called the beta distributionin the case when $N=2$. Then we have a single degree of freedom $p$:

$\displaystyle f(p ; \alpha,\beta) \propto p^{\alpha-1}(1-p)^{\beta-1}$

Now; is it too much to hope for that the log-likelihood for the beta distribution follows the same form as the Gaussian and Laplace distributions above, but with the KL divergence replacing the metric distance? (SPOILERS: yes, but not by much, and let’s see where…)

#### Concentration and mean parameters

Firstly, the other distributions had two parameters: a mean or location ($\mathbf{\mu}$) and a scale or dispersion parameter ($\sigma$ or $b$). Here we seem to have just an unnormalized $N$-dimensional parameter vector of pseudo-counts $\mathbf{\alpha}$. But as David Mackay will tell you in ITILA, you can factor this into a positive real concentration parameter and a normalized mean vector which lives on the simplex: $\mathbf{\alpha} = c\mathbf{\mu}$.

Let’s write out the log-likelihood in this form, lumping all the messy gamma functions at the front into a normalizing factor $Z$.

$\displaystyle \log f(\mathbf{p} ; c, \mathbf{\mu}) = \sum_{i=1}^K (c\mu_i - 1) \log p_i - \log Z$

$\displaystyle = c\sum_{i=1}^K \mu_i \log p_i - \sum_{i=1}^K \log p_i - \log Z$

Recall that the KL divergence from the mean $\mathbf{\mu}$ to $\mathbf{p}$ is

$\displaystyle D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) = \sum_{i} \mu_{i} \log \frac{\mu_{i}}{p_{i}} = \sum_{i} \mu_{i} \log \mu_{i} - \sum_{i} \mu_{i} \log p_{i} = -H(\mu) - \sum_{i} \mu_{i} \log p_{i}$

where $H(\mathbf{\mu})$ is the entropy of the mean distribution $\mathbf{\mu}$. Then our log likelihood is:

$\displaystyle \log f(\mathbf{p} ; c, \mathbf{\mu}) = -c D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) - c H(\mathbf{\mu}) - \sum_{i=1}^K \log p_i - \log Z$

which is looking promising, especially because the $H(\mathbf{\mu})$ doesn’t depend on $p$ so will just become part of the normalizing constant $Z$. But what about the other term?

That’s a more awkward customer, but we can still look at it in terms of KL divergence; not from our mean $\mathbf{\mu}$ but from the uniform distribution $\mathbf{u}$ with $u_{i}=\frac{1}{N}$.

If we substitute in

$\displaystyle D_{KL}(\mathbf{u} \Vert \mathbf{p}) = \sum_{i} u_{i} \log \frac{u_{i}}{p_{i}} = -H( \mathbf{u} ) - \frac{1}{N} \sum_{i} \log p_{i}$

and pull out the constants we end up with this:

$\displaystyle = -c D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) + N D_{KL}(\mathbf{u} \Vert \mathbf{p}) - \log Z$

Curiouser and curiouser!

#### How does it behave?

Our Dirichlet log likelihood doesn’t depend solely on the “distance” from the mean like other two examples, but:

• It does depend on the KL divergence from the mean, and asymptotically solely on that as we make $c \gg N$.
• We can also see that it does so exactly when $\mathbf{\mu} = \mathbf{u}$, with an effective concentration of $c-N$.

So the contours of the Dirichlet distribution with uniform mean are balls of constant $D_{KL}$, which is pleasing. (Someone has made nice 3d prints of these and other related “Bregman balls” here.)

dirichlet.logZ <- function(c, m) {
sum(lgamma(c*m)) - lgamma(sum(c*m));
}
logddirichlet <- function(p, c, m) {
(c*m-1) %*% log(p) - dirichlet.logZ(c,m);
}

m <- c(1,1,1)/3.;
c <- 25.;

x <- 1:99/100.;
y <- 1:99/100.;
z <- matrix(nrow=99,ncol=99)
for (xx in 1:99) {
for (yy in 1:99) {
p <- c(x[xx],y[yy],1-x[xx]-y[yy]);
z[yy,xx] <- logddirichlet(p, c, m);
}
}
filled.contour(x,y,exp(z));

There’s also a repulsive dependence on the divergence from the uniform p.d.f. $\mathbf{u}$, which is easiest to see in the univariate (beta distribution) case.

This accounts for the skewness of the distributions when $p \neq \frac{1}{2}$, and it also explains why the distribution becomes multimodal with peaks at the boundaries when $c$ is small: then the “centrifugal” term pushing away from $\mathbf{u}$ dominates over the “centripetal” term which pulls us toward $\mathbf{m}$.

par(mfrow=c(2,1));
curve(dbeta(x, 25, 5));
curve(dbeta(x, 0.8, 0.8));

#### Connections

Why does this unexpected “centrifugal” term appear in the Dirichlet? It’s to do with the pesky $-1$‘s in the exponents… why are they there?
Sometimes people want to approximate beta distributions by Gaussians. John D.Cook points out that this works reasonably well when the parameters are “large and approximately equal” (exactly the circumstances under which the distribution depends more on $D_{KL}$ from the mean). This paper of David MacKay and Andrew Gelman’s blog both suggest that if you need to approximate Dirichlet distributions with Gaussians, this is best done with a change of coordinates to the softmax basis.
It feels to me like there is already a similarity between $D_{KL}$ and squared Euclidean distance, and so between the Dirichlet and the Gaussian. Maybe it can be made more precise with a better understanding of information geometry; and maybe this might help to come up with or reinterpret a generalization of the Dirichlet with a more general precision/concentration structure for $N>2$.