Say we have a multi-modal distribution , like this mixture of two Gaussians:

mu1 <- 4.0;
sd1 <- 0.8;
mu2 <- 1;
sd2 <- 0.5;
curve(0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2),-6,6);

and want to approximate it with a simpler distribution – we’ll use a single Gaussian – by minimizing the KL divergence.

We can choose to minimize either or . Penny’s claim is that:

- minimizing results in matching a Gaussian to one of the modes (the approximating distribution is more compact than the true distribution)
- minimizing results in matching moments (our approximating distribution will have the same mean and variance as the mixture, so will be more spread out than either of the two peaks in the true distributions).

In the first case, the approximating density can choose to “ignore” troublesome parts of the target density which are difficult to fit by reducing its variance. We’ve seen above that the tails will receive little weight and can’t affect the divergence much.

In the second case, the expectation is taken under the target density , so has to try to accommodate the whole thing as best it can, multi modes and all.

(Once again David MacKay’s wonderful ITILA book has some interesting material related to this; Chapter 33 on variational free energy minimization.)

I wanted to try this optimization out. The KL divergence between two Gaussians has a closed form but it seems this is not the case for mixtures of Gaussians. So let’s fake it by sampling our densities and using discrete distributions:

xx <- -1000:1000/100;
p <- dmog(xx,4,0.8,1,0.5);
dmog <- function(x, mu1, sd1, mu2, sd2) {
0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2);
}
kl.div <- function(p, q) {
p %*% log(p/q);
}
dkl.pq <- function(pars) {
q <- dnorm(xx, mean=pars[1], sd=pars[2]);
kl.div(p, q);
}
dkl.qp <- function(pars) {
q <- dnorm(xx, mean=pars[1], sd=pars[2]);
kl.div(q, p);
}
init <- c(0,2);
result.qp <- optim(init, dkl.qp);
result.pq <- optim(init, dkl.pq);
plot(xx, p, type="l", col="red", ylim=c(0,1));
curve(dnorm(x, mean=result.qp$par[1], sd=result.qp$par[2]), -10,10,col="blue",add=T)
curve(dnorm(x, mean=result.pq$par[1], sd=result.pq$par[2]), -10,10,col="green",add=T)
result.qp
result.pq

We get the expected result:

- the green curve is the minimum fit, which approximate the mixture distribution’s mean (and variance?)
- the blue curve is the minimum fit, which here approximately fits the mean and sd of one of the two mixture components (, ). It’s a bit more touchy though, and depends on the initial conditions now.

UPDATE:

After I posted this Iain Murray linked to some nice further material on Twitter:

- This note describing in more detail when the “ is more compact” property does and doesn’t apply
- A set of video lectures by David MacKay to accompany the ITILA book

In particular Lecture 14, on approximating distributions by variational free energy minimization, covers much of this material.

It also sets the mathematical scene for what I’m intending to talk about next: Karl Friston’s Free Energy Principle, which applies extensions of these ideas to understanding the brain and behaviour.