Indeed, it seems at first like the same approach will not work, because multivariate Cramér-Rao is a matrix inequality, while the scalar proof relies on the Cauchy-Schwarz inequality, which is a statement about inner products. Since an inner product is just a real-valued number, surely a different approach is required for proofs about matrices?

But after reading this 1980 paper of Bultheel I think the same short proof goes through, if we generalise the definition of “inner product” slightly. In fact, this form of Cauchy-Schwarz holds for the familiar *outer* product and the inner product version is just a special case!

Below we’ll confine ourselves to the reals for simplicity, unlike Bultheel who works more abstractly.

We’ll review the scalar case, then extend it to matrices.

An inner product, on a vector space over the reals takes two vectors and returns a real number. The prototypical example is the dot product on , , but we can allow others if they satisfy these requirements:

- Symmetry:
- Bilinearity:
- Positive definiteness: , with equality if and only if .

The Cauchy-Schwarz inequality is the following statement about products of inner products:

We can show this using the definition of the inner product above. Take a vector where is a real scalar.

Positive definiteness says:

We can use bilinearity to expand this:

and symmetry to obtain

Now if we make the choice and simplify:

We obtain the desired inequality:

Now we’d like something a little more powerful. We can get this if we are willing to generalise the notion of an inner product to something that returns a matrix instead of a number. I’ll denote this new “inner product” by .

We also need to generalise our axioms slightly for this wider definition. I will follow Bultheel’s definition, simplifying by considering only real-valued matrices. So:

- Symmetry holds up to a transpose. Now is a matrix, we need to add matrix transposition if we swap the arguments, but we still have:

- Bilinearity still applies, not only with scalar coefficients but also with matrices. We have to be careful about whether we are multiplying on the left or right, because matrix multiplication is not commutative. So we have:

- Positive definiteness: we will demand that is itself positive definite, i.e. as a matrix inequality. We can also insist that implies .

Now a multivariate Cauchy-Schwarz follows from these axioms just as it did in the scalar case, though again we must take care of the transpositions.

Positive definiteness:

Bilinearity:

We substitute :

Using transpose symmetry to tidy up:

we obtain the *matrix form of Cauchy-Schwarz*:

In particular, in the scalar case, this reduces to the usual scalar form of the inequality.

I think this is cute! For one thing, we’ve just defined the *outer* product to be an *inner* product!

(The outer product between two dimensional vectors is the matrix , while the Euclidean dot product is the scalar .)

Yet since the outer product is transpose symmetric, bilinear, and results in a positive definite matrix for a single vector, it’s a perfectly good inner product for these purposes. I wonder why Cauchy-Schwarz is more commonly known in the less general inner product form?

I’m also intrigued by the geometric connotations of “matrix-valued inner products”. The inner product is an algebraic construction which is geometrically motivated, and so bridges these two aspects of mathematics. The inner product is at the core of geometry and defines:

- length of vectors (from the induced norm )
- angles between vectors (from ), and in particular orthogonality when
- projections onto sets (by minimizing the norm, or by orthogonality)

So – what would it mean geometrically for a length or an angle to be matrix valued?

I don’t know! But it does occur to me that if you have two ordinary, independent scalar metrics and , you can always compose these into a new “matrix-valued metric” . (This is still positive definite in the sense above). That declares two vectors to be orthogonal when *both* of the component metrics are: this means that the trace of our matrix, which is the sum of its eigenvalues, will be zero. (If we had used the determinant instead, it would declare orthogonality whenever any of the constituents did.)

Furthermore, the trace of the outer product recovers the inner product. In fact, the trace already gives a proper inner product between square matrices, thought of as a vector space: . So we can squash our matrix-valued inner product back to an ordinary scalar inner product by taking the trace. And if we do this for our diagonal matrix of independent metrics, we recover the usual metric on ! It was there all along, but the matrix-valued metric additionally preserves more information about *along which basis directions* the vectors agree and disagree.

]]>

Given a **likelihood** function and a **prior** distribution that we can evaluate, is the joint likelihood.

The posterior is just where we divide by the **evidence** :

However, is frequently intractable. For example, it may not admit a closed form solution, and it is frequently high-dimensional, so even numerical methods like quadrature may not help much.

Here are three techniques we can use to approach it:

- If the posterior and prior are of particular forms so that they are conjugate to one another, then the integral will have a simple closed form. Then the updates to the prior from the likelihood are often trivial to calculate.
- It is possible to draw samples from the posterior before we know the normalization factor. Then we can approximate the expectation stochastically by sample averages. By drawing enough samples the expectations converge to the true values – the theory behind MCMC techniques. Two hindrances arise: it is difficult to know how many samples is “enough”; and “enough samples” can also take a long time to generate.
- We can introduce an approximating distribution . If we can somehow measure the quality of the approximation and iteratively improve it as far as possible, we can use this approximation with confidence. This is the variational approach derived below.

Start with the KL from to , and rearrange to isolate the interesting quantity :

But doesn’t depend on so we can pull it out of the expectation:

Rearranging, we get

Because we saw previously that , we have

This quantity is the evidence lower bound (ELBO). It is a functional of the approximating distribution .

This bound is valuable because it can be calculated without the unknown normalizing constant . However it is equal to at its maximum, when , which also implies .

We have transformed the problem of taking expectations into one of optimization. Now a new question arises – is this problem any easier than the integral we started with? Not necessarily! However we can now make different choices for the form of , so if we can find a family of distributions that is amenable to our available optimization techniques and which also contains a good approximation to the true posterior, we will be happy with the trade-off.

For example, we can rewrite in terms of yet another KL divergence, this time between the approximate posterior and the prior on the latent variables:

If and have the same form – for instance we have chosen them both to be Gaussian – then the second term will have a closed form expression, so if we have a good way to evaluate the first term, as in Kingma & Welling 2013, then we’ll be able to optimize and solve the problem.

(This bound gets its name from the variational principle, applying the calculus of variations to optimize the function without assuming a particular form. However, we frequently assume a fixed-form approximation using a parametric form of this density. In this case no calculus of variations is required, and the problem reduces to an ordinary optimization.)

]]>

Jensen implies that when is convex.

Setting gives

Furthermore, the KL divergence is just one member of a more general family, the Csiszár -divergences. These have the form

for some convex function . The same argument applies here (noting that the lower bound is now , so this will only translate into non-negativity for particular choices of ).

]]>

My talk is here.

Here are the slides (which probably don’t make much sense without the talk), and here are the associated IPython notebooks I used:

]]>

But what’s the intuitive picture of how the symmetry fails? Recently I saw Will Penny explain this (at the Free Energy Principle workshop, of which hopefully more later). I am going to “borrow” very liberally from his talk. I think he also mentioned that he was using slides from Chris Bishop’s book, so this material might be well known from there (I haven’t read it).

f <- function(x) { dnorm(x, mean=0, sd=0.5) }; g <- function(x) { dnorm(x, mean=3, sd=2) }; curve(f(x),-6,6, col="red", ylim=c(-1,1)); curve(g(x),-6,6, col="blue", add=T); scale <- 0.05; curve( scale*(log(f(x))-log(g(x))),-6,6, col="red", lty=3, add=T); curve( scale*(log(g(x))-log(f(x))),-6,6, col="blue", lty=3, add=T);

The plot shows two Gaussians, a lower variance distribution in red and a wider distribution in blue. You can also see the (scaled) quantity in red, and its inverse in blue.

The KL divergence is the expectation under the red pdf of the red dotted line, and is the corresponding expectation for the blue pair. A couple of observations reveal two sources of disagreement between them:

- The KL from gives little weight to discrepancies outside the narrow region where it has the most probability density; what the broader density does out here will affect it very little
- The quantity inside the expectations, the difference of logs, is actually
*anti*-symmetric. (Somehow it’s remarkable that we can take two differently positively weighted sums of the same quantity with and without a minus sign, but always guarantee a positive result, since !)

Say we have a multi-modal distribution , like this mixture of two Gaussians:

mu1 <- 4.0; sd1 <- 0.8; mu2 <- 1; sd2 <- 0.5; curve(0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2),-6,6);

and want to approximate it with a simpler distribution – we’ll use a single Gaussian – by minimizing the KL divergence.

We can choose to minimize either or . Penny’s claim is that:

- minimizing results in matching a Gaussian to one of the modes (the approximating distribution is more compact than the true distribution)
- minimizing results in matching moments (our approximating distribution will have the same mean and variance as the mixture, so will be more spread out than either of the two peaks in the true distributions).

In the first case, the approximating density can choose to “ignore” troublesome parts of the target density which are difficult to fit by reducing its variance. We’ve seen above that the tails will receive little weight and can’t affect the divergence much.

In the second case, the expectation is taken under the target density , so has to try to accommodate the whole thing as best it can, multi modes and all.

(Once again David MacKay’s wonderful ITILA book has some interesting material related to this; Chapter 33 on variational free energy minimization.)

I wanted to try this optimization out. The KL divergence between two Gaussians has a closed form but it seems this is not the case for mixtures of Gaussians. So let’s fake it by sampling our densities and using discrete distributions:

xx <- -1000:1000/100; p <- dmog(xx,4,0.8,1,0.5); dmog <- function(x, mu1, sd1, mu2, sd2) { 0.5*dnorm(x, mean=mu1, sd=sd1) + 0.5*dnorm(x, mean=mu2, sd=sd2); } kl.div <- function(p, q) { p %*% log(p/q); } dkl.pq <- function(pars) { q <- dnorm(xx, mean=pars[1], sd=pars[2]); kl.div(p, q); } dkl.qp <- function(pars) { q <- dnorm(xx, mean=pars[1], sd=pars[2]); kl.div(q, p); } init <- c(0,2); result.qp <- optim(init, dkl.qp); result.pq <- optim(init, dkl.pq); plot(xx, p, type="l", col="red", ylim=c(0,1)); curve(dnorm(x, mean=result.qp$par[1], sd=result.qp$par[2]), -10,10,col="blue",add=T) curve(dnorm(x, mean=result.pq$par[1], sd=result.pq$par[2]), -10,10,col="green",add=T) result.qp result.pq

We get the expected result:

- the green curve is the minimum fit, which approximate the mixture distribution’s mean (and variance?)
- the blue curve is the minimum fit, which here approximately fits the mean and sd of one of the two mixture components (, ). It’s a bit more touchy though, and depends on the initial conditions now.

UPDATE:

After I posted this Iain Murray linked to some nice further material on Twitter:

- This note describing in more detail when the “ is more compact” property does and doesn’t apply
- A set of video lectures by David MacKay to accompany the ITILA book

In particular Lecture 14, on approximating distributions by variational free energy minimization, covers much of this material.

It also sets the mathematical scene for what I’m intending to talk about next: Karl Friston’s Free Energy Principle, which applies extensions of these ideas to understanding the brain and behaviour.

]]>

There’s an interesting correspondence between unimodal p.d.f.’s on a metric space (like the reals ), and distance functions. I will dig through this for amusement below, with some R code to generate the pictures.

For example, in one dimension the normal p.d.f. is

so the log likelihood for a single and fixed is proportional to the squared distance from the mean: . Here’s the log likelihood for unit variance with :

mu <- 0.5; par(mfrow=c(2,1)); curve(dnorm(x, mean=mu, sd=1, log=T),-4,5,xlab="x",ylab=expression(log(phi*("x")))); abline(v=mu); curve(dnorm(x, mean=mu, sd=1, log=F),-4,5,xlab="x",ylab=expression(phi*("x"))); abline(v=mu); text(mu,0,expression(mu=="0.5"));

This is basic stuff at the core of classical statistical techniques: least squares models are computationally straightforward to fit, but they also happen to imply Gaussian errors, which is a reasonable first approximation for a lot of real life data.

Of course we could look at it the other way. If I want to measure the Euclidean distance from a point to a point , I can back this out from the log likelihood of under a unit variance Gaussian centred at .

Putting :

a <- 2; b <- 4; sqrt(-2*dnorm(b, mean=a, sd=1, log=T) - log(2*pi));

This is rather a roundabout way of measuring distances. Would you ever want to do this? Well, in the multivariate case, if the covariance matrix isn’t spherical you get a distance measure that depends on direction; something a bit like the Mahalanobis distance.

For our next trick, we can try measure distances with other norms than , the good old squared Euclidean distance. For instance if we minimize the absolute error instead of the squared error (e.g. for Quantile Regression) we’re still measuring distances but now they’re in the achingly fashionable norm. We’ve implicitly assumed the errors follow the Laplace or “double exponential” distribution so the log likelihood looks like this:

mu <- 0.5; b <- 1; par(mfrow=c(2,1)); curve(log(exp(-abs(x-mu)/b)/2*b),-4,5,xlab="x",ylab=expression(log(f(x)))); abline(v=mu); curve(exp(-abs(x-mu)/b)/2*b,-4,5,xlab="x",ylab=expression(f(x))); abline(v=mu); text(mu,0,expression(mu==0.5));

What if we go a bit further off-piste: let’s consider the Kullback-Leibler divergence between two arbitrary N-dimensional probability distributions and :

This is a bit different than the examples above, where the log-probability depends on metric distances between points in , Firstly: the KL divergence isn’t a proper metric (it’s not symmetric and doesn’t obey the triangle inequality). However, it is does act as a distance in the sense that with equality only when , which is good enough for our purposes (it’s a “premetric”).

Secondly, we’re now confining ourselves to points that live on probability simplex , instead of anywhere in . That means our distributions have to satisfy , and .

A widely used distribution on the simplex is the Dirichlet distribution:

which is also called the beta distributionin the case when . Then we have a single degree of freedom :

Now; is it too much to hope for that the log-likelihood for the beta distribution follows the same form as the Gaussian and Laplace distributions above, but with the KL divergence replacing the metric distance? (SPOILERS: yes, but not by much, and let’s see where…)

Firstly, the other distributions had two parameters: a mean or location () and a scale or dispersion parameter ( or ). Here we seem to have just an unnormalized -dimensional parameter vector of pseudo-counts . But as David Mackay will tell you in ITILA, you can factor this into a positive real concentration parameter and a normalized mean vector which lives on the simplex: .

Let’s write out the log-likelihood in this form, lumping all the messy gamma functions at the front into a normalizing factor .

Recall that the KL divergence from the mean to is

where is the entropy of the mean distribution . Then our log likelihood is:

which is looking promising, especially because the doesn’t depend on so will just become part of the normalizing constant . But what about the other term?

That’s a more awkward customer, but we can still look at it in terms of KL divergence; not from our mean but from the uniform distribution with .

If we substitute in

and pull out the constants we end up with this:

Curiouser and curiouser!

Our Dirichlet log likelihood doesn’t depend solely on the “distance” from the mean like other two examples, but:

- It does depend on the KL divergence from the mean, and asymptotically solely on that as we make .
- We can also see that it does so exactly when , with an effective concentration of .

So the contours of the Dirichlet distribution with uniform mean are balls of constant , which is pleasing. (Someone has made nice 3d prints of these and other related “Bregman balls” here.)

dirichlet.logZ <- function(c, m) { sum(lgamma(c*m)) - lgamma(sum(c*m)); } logddirichlet <- function(p, c, m) { (c*m-1) %*% log(p) - dirichlet.logZ(c,m); } m <- c(1,1,1)/3.; c <- 25.; x <- 1:99/100.; y <- 1:99/100.; z <- matrix(nrow=99,ncol=99) for (xx in 1:99) { for (yy in 1:99) { p <- c(x[xx],y[yy],1-x[xx]-y[yy]); z[yy,xx] <- logddirichlet(p, c, m); } } filled.contour(x,y,exp(z));

There’s also a *repulsive* dependence on the divergence from the uniform p.d.f. , which is easiest to see in the univariate (beta distribution) case.

This accounts for the skewness of the distributions when , and it also explains why the distribution becomes multimodal with peaks at the boundaries when is small: then the “centrifugal” term pushing away from dominates over the “centripetal” term which pulls us toward .

par(mfrow=c(2,1)); curve(dbeta(x, 25, 5)); curve(dbeta(x, 0.8, 0.8));

I’ve got some more questions about this interpretation of the beta/Dirichlet distributions as depending on divergences, please comment if you know the answers!

Why does this unexpected “centrifugal” term appear in the Dirichlet? It’s to do with the pesky ‘s in the exponents… why are they there?

Sometimes people want to approximate beta distributions by Gaussians. John D.Cook points out that this works reasonably well when the parameters are “large and approximately equal” (exactly the circumstances under which the distribution depends more on from the mean). This paper of David MacKay and Andrew Gelman’s blog both suggest that if you need to approximate Dirichlet distributions with Gaussians, this is best done with a change of coordinates to the softmax basis.

It feels to me like there is already a similarity between and squared Euclidean distance, and so between the Dirichlet and the Gaussian. Maybe it can be made more precise with a better understanding of information geometry; and maybe this might help to come up with or reinterpret a generalization of the Dirichlet with a more general precision/concentration structure for .

]]>