Relative Entropy for Fun and Profit

### Cauchy-Schwarz for outer products as a matrix inequality

If you read the Wikipedia page on the Cramér-Rao bound in statistics, there is an elegant and concise proof given of the scalar version of the bound. However, no proof of the full multivariate case is given there.

Indeed, it seems at first like the same approach will not work, because multivariate Cramér-Rao is a matrix inequality, while the scalar proof relies on the Cauchy-Schwarz inequality, which is a statement about inner products. Since an inner product is just a real-valued number, surely a different approach is required for proofs about matrices?

But after reading this 1980 paper of Bultheel I think the same short proof goes through, if we generalise the definition of “inner product” slightly. In fact, this form of Cauchy-Schwarz holds for the familiar outer product and the inner product version is just a special case!

Below we’ll confine ourselves to the reals for simplicity, unlike Bultheel who works more abstractly.

We’ll review the scalar case, then extend it to matrices.

## Inner product definition

An inner product, $\langle\cdot, \cdot\rangle$ on a vector space $V$ over the reals takes two vectors and returns a real number. The prototypical example is the dot product on $\mathbb{R}^N$, $\langle \mathbf{x}, \mathbf{y}\rangle = \sum_i x_i y_i$, but we can allow others if they satisfy these requirements:

• Symmetry: $\langle \mathbf{x},\mathbf{y} \rangle = \langle \mathbf{y},\mathbf{x} \rangle$
• Bilinearity: $\langle a\mathbf{x}+b\mathbf{y}, c\mathbf{z} \rangle = ac \langle \mathbf{x}, \mathbf{z} \rangle + bc \langle \mathbf{y}, \mathbf{z} \rangle$
• Positive definiteness: $\langle \mathbf{x},\mathbf{x} \rangle \geq 0$, with equality if and only if $\mathbf{x}=0$.

## Cauchy-Schwarz from inner products

The Cauchy-Schwarz inequality is the following statement about products of inner products:

$\langle \mathbf{x},\mathbf{x} \rangle\langle \mathbf{y},\mathbf{y} \rangle \geq \langle \mathbf{x},\mathbf{y} \rangle^2$

We can show this using the definition of the inner product above. Take a vector $\mathbf{x} + c\mathbf{y}$ where $c$ is a real scalar.

Positive definiteness says:

$\langle \mathbf{x} + c\mathbf{y}, \mathbf{x} + c\mathbf{y} \rangle \geq 0$

We can use bilinearity to expand this:

$\langle \mathbf{x}, \mathbf{x} \rangle + c\langle \mathbf{x}, \mathbf{y} \rangle + c\langle \mathbf{y}, \mathbf{x} \rangle + c^2 \langle \mathbf{y}, \mathbf{y} \rangle \geq 0$

and symmetry to obtain

$\langle \mathbf{x}, \mathbf{x} \rangle + 2c\langle \mathbf{x}, \mathbf{y} \rangle + c^2 \langle \mathbf{y}, \mathbf{y} \rangle \geq 0$

Now if we make the choice $c = -\frac{\langle \mathbf{x},\mathbf{y}\rangle}{\langle \mathbf{y},\mathbf{y} \rangle}$ and simplify:

$\langle \mathbf{x}, \mathbf{x} \rangle - 2\frac{\langle \mathbf{x},\mathbf{y}\rangle}{\langle \mathbf{y},\mathbf{y} \rangle}\langle \mathbf{x}, \mathbf{y} \rangle + \frac{\langle \mathbf{x},\mathbf{y}\rangle^2}{\langle \mathbf{y},\mathbf{y} \rangle^2} \langle \mathbf{y}, \mathbf{y} \rangle \geq 0$

We obtain the desired inequality:

$\langle \mathbf{x}, \mathbf{x} \rangle \geq \frac{\langle \mathbf{x},\mathbf{y}\rangle}{\langle \mathbf{y},\mathbf{y} \rangle}\langle \mathbf{x}, \mathbf{y} \rangle$

## Matrix-valued inner product axioms

Now we’d like something a little more powerful. We can get this if we are willing to generalise the notion of an inner product to something that returns a matrix instead of a number. I’ll denote this new “inner product” by $\langle \cdot, \cdot\rangle_M$.

We also need to generalise our axioms slightly for this wider definition. I will follow Bultheel’s definition, simplifying by considering only real-valued matrices. So:

• Symmetry holds up to a transpose. Now $\langle \mathbf{x}, \mathbf{y}\rangle_M$ is a matrix, we need to add matrix transposition if we swap the arguments, but we still have:

$\langle \mathbf{x}, \mathbf{y}\rangle_M = \langle \mathbf{y}, \mathbf{x}\rangle_M^T$

• Bilinearity still applies, not only with scalar coefficients but also with $d \times d$ matrices. We have to be careful about whether we are multiplying on the left or right, because matrix multiplication is not commutative. So we have:

$\langle \mathbf{A}\mathbf{x}, \mathbf{B}\mathbf{y} + \mathbf{C}\mathbf{z}\rangle_M = \mathbf{A}\langle \mathbf{x}, \mathbf{y}\rangle_M \mathbf{B}^T + \mathbf{A}\langle \mathbf{x}, \mathbf{z}\rangle_M \mathbf{C}^T$

• Positive definiteness: we will demand that $\langle \mathbf{x}, \mathbf{x} \rangle_M$ is itself positive definite, i.e. $\langle \mathbf{x}, \mathbf{x} \rangle_M \succeq 0$ as a matrix inequality. We can also insist that $\mathrm{Tr}(\langle \mathbf{x}, \mathbf{x} \rangle_M) = 0$ implies $\mathbf{x} = \mathbf{0}$.

## Matrix-valued Cauchy-Schwarz

Now a multivariate Cauchy-Schwarz follows from these axioms just as it did in the scalar case, though again we must take care of the transpositions.

Positive definiteness:

$\langle \mathbf{x} + \mathbf{A}\mathbf{y}, \mathbf{x} + \mathbf{A}\mathbf{y} \rangle \succeq 0$

Bilinearity:

$\langle \mathbf{x}, \mathbf{x} \rangle + \langle \mathbf{x}, \mathbf{y} \rangle \mathbf{A}^T + \mathbf{A}\langle \mathbf{y}, \mathbf{x} \rangle + \mathbf{A} \langle \mathbf{y}, \mathbf{y} \rangle \mathbf{A}^T \succeq 0$

We substitute $\mathbf{A} = -\langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1}$:

$\langle \mathbf{x}, \mathbf{x} \rangle - \langle \mathbf{x}, \mathbf{y} \rangle (\langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1})^T - \langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1} \langle \mathbf{y}, \mathbf{x} \rangle + \langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1} \langle \mathbf{y}, \mathbf{y} \rangle (\langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1})^T \succeq 0$

Using transpose symmetry to tidy up:

$\langle \mathbf{x}, \mathbf{x} \rangle - \langle \mathbf{x}, \mathbf{y} \rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1} \langle \mathbf{y},\mathbf{x}\rangle - \langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1} \langle \mathbf{y}, \mathbf{x} \rangle + \langle \mathbf{x},\mathbf{y}\rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1} \langle \mathbf{y},\mathbf{x}\rangle \succeq 0$

we obtain the matrix form of Cauchy-Schwarz:

$\langle \mathbf{x}, \mathbf{x} \rangle \succeq \langle \mathbf{x}, \mathbf{y} \rangle \langle \mathbf{y},\mathbf{y} \rangle^{-1} \langle \mathbf{y},\mathbf{x}\rangle$

In particular, in the $d=1$ scalar case, this reduces to the usual scalar form of the inequality.

## Matrix-valued metrics?

I think this is cute! For one thing, we’ve just defined the outer product to be an inner product!

(The outer product between two $N-$ dimensional vectors $\mathbf{u}, \mathbf{v}$ is the $N \times N$ matrix $\mathbf{uv^T}$, while the Euclidean dot product is the scalar $\mathbf{u^T v}$.)

Yet since the outer product is transpose symmetric, bilinear, and results in a positive definite matrix for a single vector, it’s a perfectly good inner product for these purposes. I wonder why Cauchy-Schwarz is more commonly known in the less general inner product form?

I’m also intrigued by the geometric connotations of “matrix-valued inner products”. The inner product is an algebraic construction which is geometrically motivated, and so bridges these two aspects of mathematics. The inner product is at the core of geometry and defines:

• length of vectors (from the induced norm $\Vert \mathbf{x} \Vert = \sqrt{\langle \mathbf{x},\mathbf{x} \rangle}$)
• angles between vectors (from $\cos(\angle \mathbf{x}\mathbf{y}) = \frac{\langle \mathbf{x},\mathbf{y} \rangle}{\Vert \mathbf{x}\Vert\Vert \mathbf{y}\Vert}$), and in particular orthogonality when $\langle \mathbf{x},\mathbf{y} \rangle = 0$
• projections onto sets (by minimizing the norm, or by orthogonality)

So – what would it mean geometrically for a length or an angle to be matrix valued?

I don’t know! But it does occur to me that if you have two ordinary, independent scalar metrics $g_1$ and $g_2$, you can always compose these into a new “matrix-valued metric” $\mathbf{g} = \mathrm{diag}(g_1, g_2)$. (This is still positive definite in the sense above). That declares two vectors to be orthogonal when both of the component metrics are: this means that the trace of our matrix, which is the sum of its eigenvalues, will be zero. (If we had used the determinant instead, it would declare orthogonality whenever any of the constituents did.)

Furthermore, the trace of the outer product recovers the inner product. In fact, the trace already gives a proper inner product between square matrices, thought of as a vector space: $\langle \mathbf{A}, \mathbf{B}\rangle = \mathrm{tr}(\mathbf{A}\mathbf{B})$. So we can squash our matrix-valued inner product back to an ordinary scalar inner product by taking the trace. And if we do this for our diagonal matrix of independent metrics, we recover the usual metric on $\mathbb{R}^N$! It was there all along, but the matrix-valued metric additionally preserves more information about along which basis directions the vectors agree and disagree.

### Variational Bayes and the evidence lower bound

Variational methods for Bayesian inference have been enjoying a renaissance recently in machine learning.

### Problem: normalization can be intractable when applying Bayes’ Theorem

Given a likelihood function and a prior distribution that we can evaluate, $p(y \vert z)p(z) = p(y,z)$ is the joint likelihood.

The posterior is just $p(z\vert y) = p(y\vert z)p(z)/p(y)$ where we divide by the evidence $p(y)$:

$p(y) = \int p(y\vert z) p(z) dz = \mathbb{E}_{p(z)}[p(y\vert z)]$

However, $z$ is frequently intractable. For example, it may not admit a closed form solution, and it is frequently high-dimensional, so even numerical methods like quadrature may not help much.

Here are three techniques we can use to approach it:

1. If the posterior $p(z\vert y)$ and prior $p(z)$ are of particular forms so that they are conjugate to one another, then the integral will have a simple closed form. Then the updates to the prior from the likelihood are often trivial to calculate.
2. It is possible to draw samples from the posterior before we know the normalization factor. Then we can approximate the expectation stochastically by sample averages. By drawing enough samples the expectations converge to the true values – the theory behind MCMC techniques. Two hindrances arise: it is difficult to know how many samples is “enough”; and “enough samples” can also take a long time to generate.
3. We can introduce an approximating distribution $q(z) \approx p(z\vert y)$. If we can somehow measure the quality of the approximation and iteratively improve it as far as possible, we can use this approximation with confidence. This is the variational approach derived below.

### Variational lower bound

Start with the KL from $q$ to $p$, and rearrange to isolate the interesting quantity $p(y)$:

$D_{KL}(q(z) \Vert p(z \vert y)) = \int q(z) \log \frac{q(z)}{p(z \vert y)} dz = \int q(z) \log \frac{q(z)p(y)}{p(z, y)} dz$

$= \mathbb{E}_{q(z)}[\log q(z) - \log p(z, y) + \log p(y)] dz$

But $p(y)$ doesn’t depend on $z$ so we can pull it out of the expectation:

$D_{KL}(q(z) \Vert p(z \vert y)) = \mathbb{E}_{q(z)}[\log q(z) - \log p(z, y)] + \log p(y)$

Rearranging, we get

$\log p(y) = D_{KL}(q(z) \Vert p(z \vert y) ) + \mathbb{E}_{q(z)}[\log p(z, y) - \log q(z) ]$

Because we saw previously that $D_{KL} \geq 0$, we have

$\log p(y) \geq \mathbb{E}_{q(z)}[\log p(z, y) - \log q(z)] = L[q]$

This quantity $L[q]$ is the evidence lower bound (ELBO). It is a functional of the approximating distribution $q(z)$.

This bound is valuable because it can be calculated without the unknown normalizing constant $p(y)$. However it is equal to $p(y)$ at its maximum, when $D_{KL}(q(z)\Vert p(z\vert y))=0$, which also implies $q(z) = p(z \vert y)$.

We have transformed the problem of taking expectations into one of optimization. Now a new question arises – is this problem any easier than the integral we started with? Not necessarily! However we can now make different choices for the form of $q(z)$, so if we can find a family of distributions that is amenable to our available optimization techniques and which also contains a good approximation to the true posterior, we will be happy with the trade-off.

For example, we can rewrite $L$ in terms of yet another KL divergence, this time between the approximate posterior $q(z)$ and the prior $p(z)$ on the latent variables:

$L[q] = \mathbb{E}_{q(z)}[\log p(y\vert z)p(z) - \log q(z)]$

$= \mathbb{E}_{q(z)}[\log p(y\vert z) - \log \frac{q(z)}{p(z)}]$

$= \mathbb{E}_{q(z)}[\log p(y\vert z) ] - D_{KL}(q(z) \Vert p(z))$

If $q(z)$ and $p(z)$ have the same form – for instance we have chosen them both to be Gaussian – then the second term will have a closed form expression, so if we have a good way to evaluate the first term, as in Kingma & Welling 2013, then we’ll be able to optimize $q(z)$ and solve the problem.

(This bound gets its name from the variational principle, applying the calculus of variations to optimize the function $q(z)$ without assuming a particular form. However, we frequently assume a fixed-form approximation using a parametric form of this density. In this case no calculus of variations is required, and the problem reduces to an ordinary optimization.)

### Non-negativity of the KL Divergence

It is straightforward to show that the KL divergence is never negative using Jensen’s inequality and the concavity of the $\log$ function.

Jensen implies that $\mathbb{E}[f(x)] \geq f(\mathbb{E}[x])$ when $f(x)$ is convex.

Setting $f(x)=-\log(x)$ gives

\begin{aligned} D_{KL}(p\Vert q) & = \mathbb{E}_q\left[-\log \frac{p(x)}{q(x)}\right] \\ & \geq -\log \mathbb{E}_q\left[\frac{p(x)}{q(x)}\right] \\ & = -\log \int q(x) \frac{p(x)}{q(x)} dx \\ & = -\log \int p(x) dx = -\log 1 = 0 \\ \end{aligned}

Furthermore, the KL divergence is just one member of a more general family, the Csiszár $f$-divergences. These have the form

$D_{f}(p\Vert q) = \int q(x) f\left(\frac{p(x)}{q(x)}\right) dx$

for some convex function $f$. The same argument applies here (noting that the lower bound is now $f(1)$, so this will only translate into non-negativity for particular choices of $f$).

### Optimization talk at PyData London 2014

I was pleased to be asked to speak at PyData London this year. I gave an overview of some optimization tools in Python which they’ve put up on their Youtube channel.

My talk is here.

Here are the slides (which probably don’t make much sense without the talk), and here are the associated IPython notebooks I used:

### Kullback-Leibler divergence asymmetry

It’s widely known that the KL divergence is not symmetric, i.e.

$\displaystyle D_{KL}(p\Vert q) \neq D_{KL}(q \Vert p)$

But what’s the intuitive picture of how the symmetry fails? Recently I saw Will Penny explain this (at the Free Energy Principle workshop, of which hopefully more later). I am going to “borrow” very liberally from his talk. I think he also mentioned that he was using slides from Chris Bishop’s book, so this material might be well known from there (I haven’t read it).

f <- function(x) { dnorm(x, mean=0, sd=0.5) };
g <- function(x) { dnorm(x, mean=3, sd=2) };
curve(f(x),-6,6, col="red", ylim=c(-1,1));
scale <- 0.05;
curve( scale*(log(f(x))-log(g(x))),-6,6, col="red", lty=3, add=T);
curve( scale*(log(g(x))-log(f(x))),-6,6, col="blue", lty=3, add=T);

The plot shows two Gaussians, a lower variance distribution $f$ in red and a wider distribution $g$ in blue. You can also see the (scaled) quantity $\log(\frac{f}{g}) = -\log(\frac{g}{f})$ in red, and its inverse in blue.

The KL divergence $D_{KL}(f\Vert g)$ is the expectation under the red pdf of the red dotted line, and $D_{KL}(g\Vert f)$ is the corresponding expectation for the blue pair. A couple of observations reveal two sources of disagreement between them:

• The KL from $f$ gives little weight to discrepancies outside the narrow region where it has the most probability density; what the broader density $g$ does out here will affect it very little
• The quantity inside the expectations, the difference of logs, is actually anti-symmetric. (Somehow it’s remarkable that we can take two differently positively weighted sums of the same quantity with and without a minus sign, but always guarantee a positive result, since $D_{KL}\geq 0$ !)

## Consequences for optimization

Say we have a multi-modal distribution $p$, like this mixture of two Gaussians:

mu1 <- 4.0;
sd1 <- 0.8;
mu2 <- 1;
sd2 <- 0.5;
curve(0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2),-6,6);

and want to approximate it with a simpler distribution $q$ – we’ll use a single Gaussian – by minimizing the KL divergence.

We can choose to minimize either $D_{KL}(q\Vert p)$ or $D_{KL}(p \Vert q)$. Penny’s claim is that:

• minimizing $D_{KL}(q\Vert p)$ results in matching a Gaussian to one of the modes (the approximating distribution is more compact than the true distribution)
• minimizing $D_{KL}(p\Vert q)$ results in matching moments (our approximating distribution will have the same mean and variance as the mixture, so will be more spread out than either of the two peaks in the true distributions).

In the first case, the approximating density $q$ can choose to “ignore” troublesome parts of the target density $p$ which are difficult to fit by reducing its variance. We’ve seen above that the tails will receive little weight and can’t affect the divergence much.

In the second case, the expectation is taken under the target density $p$, so $q$ has to try to accommodate the whole thing as best it can, multi modes and all.

(Once again David MacKay’s wonderful ITILA book has some interesting material related to this; Chapter 33 on variational free energy minimization.)

I wanted to try this optimization out. The KL divergence between two Gaussians has a closed form but it seems this is not the case for mixtures of Gaussians. So let’s fake it by sampling our densities and using discrete distributions:

xx <- -1000:1000/100;
p <- dmog(xx,4,0.8,1,0.5);

dmog <- function(x, mu1, sd1, mu2, sd2) {
0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2);
}

kl.div <- function(p, q) {
p %*% log(p/q);
}

dkl.pq <- function(pars) {
q <- dnorm(xx, mean=pars[1], sd=pars[2]);
kl.div(p, q);
}

dkl.qp <- function(pars) {
q <- dnorm(xx, mean=pars[1], sd=pars[2]);
kl.div(q, p);
}
init <- c(0,2);
result.qp <- optim(init, dkl.qp);
result.pq <- optim(init, dkl.pq);
plot(xx, p, type="l", col="red", ylim=c(0,1));
curve(dnorm(x, mean=result.qp$par[1], sd=result.qp$par[2]), -10,10,col="blue",add=T)
curve(dnorm(x, mean=result.pq$par[1], sd=result.pq$par[2]), -10,10,col="green",add=T)
result.qp
result.pq

We get the expected result:

• the green curve is the minimum $D_{KL}(p\Vert q)$ fit, which approximate the mixture distribution’s mean (and variance?)
• the blue curve is the minimum $D_{KL}(q \Vert p)$ fit, which here approximately fits the mean and sd of one of the two mixture components ($\mu \approx 1$, $\sigma \approx 0.5$). It’s a bit more touchy though, and depends on the initial conditions now.

UPDATE:

After I posted this Iain Murray linked to some nice further material on Twitter:

• This note describing in more detail when the “$D_{KL}(q\Vert p)$ is more compact” property does and doesn’t apply
• A set of video lectures by David MacKay to accompany the ITILA book

In particular Lecture 14, on approximating distributions by variational free energy minimization, covers much of this material.

It also sets the mathematical scene for what I’m intending to talk about next: Karl Friston’s Free Energy Principle, which applies extensions of these ideas to understanding the brain and behaviour.

## Probability distributions and distances

There’s an interesting correspondence between unimodal p.d.f.’s on a metric space (like the reals $\mathbb{R}^{N}$), and distance functions. I will dig through this for amusement below, with some R code to generate the pictures.

### Squared Euclidean distance and Gaussians

For example, in one dimension the normal p.d.f. is

$f(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 }$

so the log likelihood for a single $x$ and fixed $\sigma$ is proportional to the squared distance from the mean: $\log\mathcal{L}(\mu) = \frac{(x-\mu)^{2}}{2\sigma^{2}}$. Here’s the log likelihood for unit variance with $\mu=\frac{1}{2}$:

mu <- 0.5;
par(mfrow=c(2,1));
curve(dnorm(x, mean=mu, sd=1, log=T),-4,5,xlab="x",ylab=expression(log(phi*("x"))));
abline(v=mu);
curve(dnorm(x, mean=mu, sd=1, log=F),-4,5,xlab="x",ylab=expression(phi*("x")));
abline(v=mu);
text(mu,0,expression(mu=="0.5"));

This is basic stuff at the core of classical statistical techniques: least squares models are computationally straightforward to fit, but they also happen to imply Gaussian errors, which is a reasonable first approximation for a lot of real life data.

Of course we could look at it the other way. If I want to measure the Euclidean distance from a point $a$ to a point $b$, I can back this out from the log likelihood of $a$ under a unit variance Gaussian centred at $b$.

Putting $\sigma=1$:

$\displaystyle f(x;\mu) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} \left(x-\mu\right)^2 }$

$\displaystyle \log f(x;\mu) = - \frac{1}{2}\log(2\pi) -\frac{1}{2}\left(x-\mu \right)^2$

$\displaystyle \left(x-\mu \right)^2 = - 2\log f(x;\mu) - \log(2\pi)$

$\displaystyle \left(x-\mu \right) = \sqrt{-2\log f(x;\mu) - \log(2\pi)}$

a <- 2;
b <- 4;
sqrt(-2*dnorm(b, mean=a, sd=1, log=T)  - log(2*pi));

This is rather a roundabout way of measuring distances. Would you ever want to do this? Well, in the multivariate case, if the covariance matrix isn’t spherical you get a distance measure that depends on direction; something a bit like the Mahalanobis distance.

### L-1 distance and Laplace distribution

For our next trick, we can try measure distances with other norms than $L_{2}$, the good old squared Euclidean distance. For instance if we minimize the absolute error instead of the squared error (e.g. for Quantile Regression) we’re still measuring distances but now they’re in the achingly fashionable $L_{1}$ norm. We’ve implicitly assumed the errors follow the Laplace or “double exponential” distribution so the log likelihood looks like this:

mu <- 0.5;
b <- 1;
par(mfrow=c(2,1));
curve(log(exp(-abs(x-mu)/b)/2*b),-4,5,xlab="x",ylab=expression(log(f(x))));
abline(v=mu);
curve(exp(-abs(x-mu)/b)/2*b,-4,5,xlab="x",ylab=expression(f(x)));
abline(v=mu);
text(mu,0,expression(mu==0.5));

### KL divergence and Dirichlet distributions

What if we go a bit further off-piste: let’s consider the Kullback-Leibler divergence between two arbitrary N-dimensional probability distributions $\mathbf{p}$ and $\mathbf{q}$:

$\displaystyle D_{KL}(\mathbf{p} \Vert \mathbf{q}) = \sum_{i} p_{i} \log \frac{p_{i}}{q_{i}}$

This is a bit different than the examples above, where the log-probability depends on metric distances between points in $\mathbb{R}^{N}$, Firstly: the KL divergence isn’t a proper metric (it’s not symmetric and doesn’t obey the triangle inequality). However, it is does act as a distance in the sense that $D_{KL}(p,q) \geq 0$ with equality only when $\mathbf{p}=\mathbf{q}$, which is good enough for our purposes (it’s a “premetric”).

Secondly, we’re now confining ourselves to points that live on probability simplex $\mathbb{P}^{N}$, instead of anywhere in $\mathbb{R}^{N}$. That means our distributions $\mathbf{p}$ have to satisfy $\sum_{i} p_{i}=1$, and $p_{i}>0$.

A widely used distribution on the simplex is the Dirichlet distribution:

$\displaystyle f(\mathbf{p} ; \mathbf{\alpha}) \propto \prod_{i=1}^K p_i^{\alpha_i - 1}$

which is also called the beta distributionin the case when $N=2$. Then we have a single degree of freedom $p$:

$\displaystyle f(p ; \alpha,\beta) \propto p^{\alpha-1}(1-p)^{\beta-1}$

Now; is it too much to hope for that the log-likelihood for the beta distribution follows the same form as the Gaussian and Laplace distributions above, but with the KL divergence replacing the metric distance? (SPOILERS: yes, but not by much, and let’s see where…)

#### Concentration and mean parameters

Firstly, the other distributions had two parameters: a mean or location ($\mathbf{\mu}$) and a scale or dispersion parameter ($\sigma$ or $b$). Here we seem to have just an unnormalized $N$-dimensional parameter vector of pseudo-counts $\mathbf{\alpha}$. But as David Mackay will tell you in ITILA, you can factor this into a positive real concentration parameter and a normalized mean vector which lives on the simplex: $\mathbf{\alpha} = c\mathbf{\mu}$.

Let’s write out the log-likelihood in this form, lumping all the messy gamma functions at the front into a normalizing factor $Z$.

$\displaystyle \log f(\mathbf{p} ; c, \mathbf{\mu}) = \sum_{i=1}^K (c\mu_i - 1) \log p_i - \log Z$

$\displaystyle = c\sum_{i=1}^K \mu_i \log p_i - \sum_{i=1}^K \log p_i - \log Z$

Recall that the KL divergence from the mean $\mathbf{\mu}$ to $\mathbf{p}$ is

$\displaystyle D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) = \sum_{i} \mu_{i} \log \frac{\mu_{i}}{p_{i}} = \sum_{i} \mu_{i} \log \mu_{i} - \sum_{i} \mu_{i} \log p_{i} = -H(\mu) - \sum_{i} \mu_{i} \log p_{i}$

where $H(\mathbf{\mu})$ is the entropy of the mean distribution $\mathbf{\mu}$. Then our log likelihood is:

$\displaystyle \log f(\mathbf{p} ; c, \mathbf{\mu}) = -c D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) - c H(\mathbf{\mu}) - \sum_{i=1}^K \log p_i - \log Z$

which is looking promising, especially because the $H(\mathbf{\mu})$ doesn’t depend on $p$ so will just become part of the normalizing constant $Z$. But what about the other term?

That’s a more awkward customer, but we can still look at it in terms of KL divergence; not from our mean $\mathbf{\mu}$ but from the uniform distribution $\mathbf{u}$ with $u_{i}=\frac{1}{N}$.

If we substitute in

$\displaystyle D_{KL}(\mathbf{u} \Vert \mathbf{p}) = \sum_{i} u_{i} \log \frac{u_{i}}{p_{i}} = -H( \mathbf{u} ) - \frac{1}{N} \sum_{i} \log p_{i}$

and pull out the constants we end up with this:

$\displaystyle = -c D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) + N D_{KL}(\mathbf{u} \Vert \mathbf{p}) - \log Z$

Curiouser and curiouser!

#### How does it behave?

Our Dirichlet log likelihood doesn’t depend solely on the “distance” from the mean like other two examples, but:

• It does depend on the KL divergence from the mean, and asymptotically solely on that as we make $c \gg N$.
• We can also see that it does so exactly when $\mathbf{\mu} = \mathbf{u}$, with an effective concentration of $c-N$.

So the contours of the Dirichlet distribution with uniform mean are balls of constant $D_{KL}$, which is pleasing. (Someone has made nice 3d prints of these and other related “Bregman balls” here.)

dirichlet.logZ <- function(c, m) {
sum(lgamma(c*m)) - lgamma(sum(c*m));
}
logddirichlet <- function(p, c, m) {
(c*m-1) %*% log(p) - dirichlet.logZ(c,m);
}

m <- c(1,1,1)/3.;
c <- 25.;

x <- 1:99/100.;
y <- 1:99/100.;
z <- matrix(nrow=99,ncol=99)
for (xx in 1:99) {
for (yy in 1:99) {
p <- c(x[xx],y[yy],1-x[xx]-y[yy]);
z[yy,xx] <- logddirichlet(p, c, m);
}
}
filled.contour(x,y,exp(z));

There’s also a repulsive dependence on the divergence from the uniform p.d.f. $\mathbf{u}$, which is easiest to see in the univariate (beta distribution) case.

This accounts for the skewness of the distributions when $p \neq \frac{1}{2}$, and it also explains why the distribution becomes multimodal with peaks at the boundaries when $c$ is small: then the “centrifugal” term pushing away from $\mathbf{u}$ dominates over the “centripetal” term which pulls us toward $\mathbf{m}$.

par(mfrow=c(2,1));
curve(dbeta(x, 25, 5));
curve(dbeta(x, 0.8, 0.8));

#### Connections

I’ve got some more questions about this interpretation of the beta/Dirichlet distributions as depending on divergences, please comment if you know the answers!

Why does this unexpected “centrifugal” term appear in the Dirichlet? It’s to do with the pesky $-1$‘s in the exponents… why are they there?

Sometimes people want to approximate beta distributions by Gaussians. John D.Cook points out that this works reasonably well when the parameters are “large and approximately equal” (exactly the circumstances under which the distribution depends more on $D_{KL}$ from the mean). This paper of David MacKay and Andrew Gelman’s blog both suggest that if you need to approximate Dirichlet distributions with Gaussians, this is best done with a change of coordinates to the softmax basis.

It feels to me like there is already a similarity between $D_{KL}$ and squared Euclidean distance, and so between the Dirichlet and the Gaussian. Maybe it can be made more precise with a better understanding of information geometry; and maybe this might help to come up with or reinterpret a generalization of the Dirichlet with a more general precision/concentration structure for $N>2$.