Distances, Divergences, Dirichlet distributions

by benmoran

Probability distributions and distances

There’s an interesting correspondence between unimodal p.d.f.’s on a metric space (like the reals \mathbb{R}^{N}), and distance functions. I will dig through this for amusement below, with some R code to generate the pictures.

Squared Euclidean distance and Gaussians

For example, in one dimension the normal p.d.f. is

f(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 }

so the log likelihood for a single x and fixed \sigma is proportional to the squared distance from the mean: \log\mathcal{L}(\mu) = \frac{(x-\mu)^{2}}{2\sigma^{2}}. Here’s the log likelihood for unit variance with \mu=\frac{1}{2}:

mu <- 0.5;
curve(dnorm(x, mean=mu, sd=1, log=T),-4,5,xlab="x",ylab=expression(log(phi*("x"))));
curve(dnorm(x, mean=mu, sd=1, log=F),-4,5,xlab="x",ylab=expression(phi*("x")));


This is basic stuff at the core of classical statistical techniques: least squares models are computationally straightforward to fit, but they also happen to imply Gaussian errors, which is a reasonable first approximation for a lot of real life data.

Of course we could look at it the other way. If I want to measure the Euclidean distance from a point a to a point b, I can back this out from the log likelihood of a under a unit variance Gaussian centred at b.

Putting \sigma=1:

\displaystyle f(x;\mu) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} \left(x-\mu\right)^2 }

\displaystyle \log f(x;\mu) = - \frac{1}{2}\log(2\pi) -\frac{1}{2}\left(x-\mu \right)^2

\displaystyle \left(x-\mu \right)^2 = - 2\log f(x;\mu) - \log(2\pi)

\displaystyle \left(x-\mu \right) = \sqrt{-2\log f(x;\mu) - \log(2\pi)}

a <- 2;
b <- 4;
sqrt(-2*dnorm(b, mean=a, sd=1, log=T)  - log(2*pi));

This is rather a roundabout way of measuring distances. Would you ever want to do this? Well, in the multivariate case, if the covariance matrix isn’t spherical you get a distance measure that depends on direction; something a bit like the Mahalanobis distance.

L-1 distance and Laplace distribution

For our next trick, we can try measure distances with other norms than L_{2}, the good old squared Euclidean distance. For instance if we minimize the absolute error instead of the squared error (e.g. for Quantile Regression) we’re still measuring distances but now they’re in the achingly fashionable L_{1} norm. We’ve implicitly assumed the errors follow the Laplace or “double exponential” distribution so the log likelihood looks like this:

mu <- 0.5;
b <- 1;


KL divergence and Dirichlet distributions

What if we go a bit further off-piste: let’s consider the Kullback-Leibler divergence between two arbitrary N-dimensional probability distributions \mathbf{p} and \mathbf{q}:

\displaystyle D_{KL}(\mathbf{p} \Vert \mathbf{q}) = \sum_{i} p_{i} \log \frac{p_{i}}{q_{i}}

This is a bit different than the examples above, where the log-probability depends on metric distances between points in \mathbb{R}^{N}, Firstly: the KL divergence isn’t a proper metric (it’s not symmetric and doesn’t obey the triangle inequality). However, it is does act as a distance in the sense that D_{KL}(p,q) \geq 0 with equality only when \mathbf{p}=\mathbf{q}, which is good enough for our purposes (it’s a “premetric”).

Secondly, we’re now confining ourselves to points that live on probability simplex \mathbb{P}^{N}, instead of anywhere in \mathbb{R}^{N}. That means our distributions \mathbf{p} have to satisfy \sum_{i} p_{i}=1, and p_{i}>0.

A widely used distribution on the simplex is the Dirichlet distribution:

\displaystyle f(\mathbf{p} ; \mathbf{\alpha}) \propto \prod_{i=1}^K p_i^{\alpha_i - 1}

which is also called the beta distributionin the case when N=2. Then we have a single degree of freedom p:

\displaystyle f(p ; \alpha,\beta) \propto p^{\alpha-1}(1-p)^{\beta-1}

Now; is it too much to hope for that the log-likelihood for the beta distribution follows the same form as the Gaussian and Laplace distributions above, but with the KL divergence replacing the metric distance? (SPOILERS: yes, but not by much, and let’s see where…)

Concentration and mean parameters

Firstly, the other distributions had two parameters: a mean or location (\mathbf{\mu}) and a scale or dispersion parameter (\sigma or b). Here we seem to have just an unnormalized N-dimensional parameter vector of pseudo-counts \mathbf{\alpha}. But as David Mackay will tell you in ITILA, you can factor this into a positive real concentration parameter and a normalized mean vector which lives on the simplex: \mathbf{\alpha} = c\mathbf{\mu}.

Let’s write out the log-likelihood in this form, lumping all the messy gamma functions at the front into a normalizing factor Z.

\displaystyle \log f(\mathbf{p} ; c, \mathbf{\mu}) = \sum_{i=1}^K (c\mu_i - 1) \log p_i - \log Z

\displaystyle = c\sum_{i=1}^K \mu_i \log p_i - \sum_{i=1}^K \log p_i - \log Z

Recall that the KL divergence from the mean \mathbf{\mu} to \mathbf{p} is

\displaystyle D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) = \sum_{i} \mu_{i} \log \frac{\mu_{i}}{p_{i}} = \sum_{i} \mu_{i} \log \mu_{i} - \sum_{i} \mu_{i} \log p_{i} = -H(\mu) - \sum_{i} \mu_{i} \log p_{i}

where H(\mathbf{\mu}) is the entropy of the mean distribution \mathbf{\mu}. Then our log likelihood is:

\displaystyle \log f(\mathbf{p} ; c, \mathbf{\mu}) = -c D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) - c H(\mathbf{\mu}) - \sum_{i=1}^K \log p_i - \log Z

which is looking promising, especially because the H(\mathbf{\mu}) doesn’t depend on p so will just become part of the normalizing constant Z. But what about the other term?

That’s a more awkward customer, but we can still look at it in terms of KL divergence; not from our mean \mathbf{\mu} but from the uniform distribution \mathbf{u} with u_{i}=\frac{1}{N}.

If we substitute in

\displaystyle D_{KL}(\mathbf{u} \Vert \mathbf{p}) = \sum_{i} u_{i} \log \frac{u_{i}}{p_{i}} = -H( \mathbf{u} ) - \frac{1}{N} \sum_{i} \log p_{i}

and pull out the constants we end up with this:

\displaystyle = -c D_{KL}(\mathbf{\mu} \Vert \mathbf{p}) + N D_{KL}(\mathbf{u} \Vert \mathbf{p}) - \log Z

Curiouser and curiouser!

How does it behave?

Our Dirichlet log likelihood doesn’t depend solely on the “distance” from the mean like other two examples, but:

  • It does depend on the KL divergence from the mean, and asymptotically solely on that as we make c \gg N.
  • We can also see that it does so exactly when \mathbf{\mu} = \mathbf{u}, with an effective concentration of c-N.

So the contours of the Dirichlet distribution with uniform mean are balls of constant D_{KL}, which is pleasing. (Someone has made nice 3d prints of these and other related “Bregman balls” here.)

dirichlet.logZ <- function(c, m) {
 sum(lgamma(c*m)) - lgamma(sum(c*m));
logddirichlet <- function(p, c, m) {
  (c*m-1) %*% log(p) - dirichlet.logZ(c,m);

m <- c(1,1,1)/3.;
c <- 25.;

x <- 1:99/100.;
y <- 1:99/100.;
z <- matrix(nrow=99,ncol=99)
for (xx in 1:99) { 
  for (yy in 1:99) { 
    p <- c(x[xx],y[yy],1-x[xx]-y[yy]);
    z[yy,xx] <- logddirichlet(p, c, m);


There’s also a repulsive dependence on the divergence from the uniform p.d.f. \mathbf{u}, which is easiest to see in the univariate (beta distribution) case.

This accounts for the skewness of the distributions when p \neq \frac{1}{2}, and it also explains why the distribution becomes multimodal with peaks at the boundaries when c is small: then the “centrifugal” term pushing away from \mathbf{u} dominates over the “centripetal” term which pulls us toward \mathbf{m}.

curve(dbeta(x, 25, 5));
curve(dbeta(x, 0.8, 0.8));



I’ve got some more questions about this interpretation of the beta/Dirichlet distributions as depending on divergences, please comment if you know the answers!

Why does this unexpected “centrifugal” term appear in the Dirichlet? It’s to do with the pesky -1‘s in the exponents… why are they there?

Sometimes people want to approximate beta distributions by Gaussians. John D.Cook points out that this works reasonably well when the parameters are “large and approximately equal” (exactly the circumstances under which the distribution depends more on D_{KL} from the mean). This paper of David MacKay and Andrew Gelman’s blog both suggest that if you need to approximate Dirichlet distributions with Gaussians, this is best done with a change of coordinates to the softmax basis.

It feels to me like there is already a similarity between D_{KL} and squared Euclidean distance, and so between the Dirichlet and the Gaussian. Maybe it can be made more precise with a better understanding of information geometry; and maybe this might help to come up with or reinterpret a generalization of the Dirichlet with a more general precision/concentration structure for N>2.