### Variational Bayes and the evidence lower bound

#### by benmoran

*Variational methods* for Bayesian inference have been enjoying a renaissance recently in machine learning.

### Problem: normalization can be intractable when applying Bayes’ Theorem

Given a **likelihood** function and a **prior** distribution that we can evaluate, is the joint likelihood.

The posterior is just where we divide by the **evidence** :

However, is frequently intractable. For example, it may not admit a closed form solution, and it is frequently high-dimensional, so even numerical methods like quadrature may not help much.

Here are three techniques we can use to approach it:

- If the posterior and prior are of particular forms so that they are conjugate to one another, then the integral will have a simple closed form. Then the updates to the prior from the likelihood are often trivial to calculate.
- It is possible to draw samples from the posterior before we know the normalization factor. Then we can approximate the expectation stochastically by sample averages. By drawing enough samples the expectations converge to the true values – the theory behind MCMC techniques. Two hindrances arise: it is difficult to know how many samples is “enough”; and “enough samples” can also take a long time to generate.
- We can introduce an approximating distribution . If we can somehow measure the quality of the approximation and iteratively improve it as far as possible, we can use this approximation with confidence. This is the variational approach derived below.

### Variational lower bound

Start with the KL from to , and rearrange to isolate the interesting quantity :

But doesn’t depend on so we can pull it out of the expectation:

Rearranging, we get

Because we saw previously that , we have

This quantity is the evidence lower bound (ELBO). It is a functional of the approximating distribution .

This bound is valuable because it can be calculated without the unknown normalizing constant . However it is equal to at its maximum, when , which also implies .

We have transformed the problem of taking expectations into one of optimization. Now a new question arises – is this problem any easier than the integral we started with? Not necessarily! However we can now make different choices for the form of , so if we can find a family of distributions that is amenable to our available optimization techniques and which also contains a good approximation to the true posterior, we will be happy with the trade-off.

For example, we can rewrite in terms of yet another KL divergence, this time between the approximate posterior and the prior on the latent variables:

If and have the same form – for instance we have chosen them both to be Gaussian – then the second term will have a closed form expression, so if we have a good way to evaluate the first term, as in Kingma & Welling 2013, then we’ll be able to optimize and solve the problem.

(This bound gets its name from the variational principle, applying the calculus of variations to optimize the function without assuming a particular form. However, we frequently assume a fixed-form approximation using a parametric form of this density. In this case no calculus of variations is required, and the problem reduces to an ordinary optimization.)

Very nicely explained ðŸ™‚

in the beginning p(y,z) should be the joint probability not joint likelihood right? or I am mistaken here

Hi Mohamed. You’re right that p(x, y) is the joint probability density. I’m being sloppy to call it this quantity the likelihood when we’re not considering it as a function of the parameters, e.g. .

nice post, man

Cool. You say â€˜we replace expectations by optimizationâ€™. In my own mind, I see it more like â€˜replace marginalization/integration, over all z, by expectationâ€™. Thoughts?

Hi Hugh, you’re right that we could have said “replace integration/marginalization by optimization”. But it’s also true that the quantity we’re trying to calculate is an expectation: . The central problem is always estimating the normalizing constant, but we can choose to solve it in three different ways.

– Calculate the integral analytically,

– Use Monte Carlo to sample from and use empirical averages of the sample to approximate any expectation we want,

– Or use variational methods to optimize an approximation .

Thank you for the nice post.

I have some minor questions regarding the approximation of the posterior p(y|z).

How do we approximate a conditional probability p(y|z) which should in fact be some function of (y, z) by a single variable function q(z)? I am somewhat confused about this part. Any comments? Thanks!

Hi Daniel. Interesting question!

The *posterior* distribution is a distribution over alone, once we have conditioned on the observed data . And is also a distribution over . So after fixing the value of , it does make sense to compare these two distributions over , and ask about the divergence between them.

The *joint* distribution on the other hand is a bivariate distribution over all the possible values of and together.