### Variational Bayes and the evidence lower bound

#### by benmoran

*Variational methods* for Bayesian inference have been enjoying a renaissance recently in machine learning.

### Problem: normalization can be intractable when applying Bayes’ Theorem

Given a **likelihood** function and a **prior** distribution that we can evaluate, is the joint likelihood.

The posterior is just where we divide by the **evidence** :

However, is frequently intractable. For example, it may not admit a closed form solution, and it is frequently high-dimensional, so even numerical methods like quadrature may not help much.

Here are three techniques we can use to approach it:

- If the posterior and prior are of particular forms so that they are conjugate to one another, then the integral will have a simple closed form. Then the updates to the prior from the likelihood are often trivial to calculate.
- It is possible to draw samples from the posterior before we know the normalization factor. Then we can approximate the expectation stochastically by sample averages. By drawing enough samples the expectations converge to the true values – the theory behind MCMC techniques. Two hindrances arise: it is difficult to know how many samples is “enough”; and “enough samples” can also take a long time to generate.
- We can introduce an approximating distribution . If we can somehow measure the quality of the approximation and iteratively improve it as far as possible, we can use this approximation with confidence. This is the variational approach derived below.

### Variational lower bound

Start with the KL from to , and rearrange to isolate the interesting quantity :

But doesn’t depend on so we can pull it out of the expectation:

Rearranging, we get

Because we saw previously that , we have

This quantity is the evidence lower bound (ELBO). It is a functional of the approximating distribution .

This bound is valuable because it can be calculated without the unknown normalizing constant . However it is equal to at its maximum, when , which also implies .

We have transformed the problem of taking expectations into one of optimization. Now a new question arises – is this problem any easier than the integral we started with? Not necessarily! However we can now make different choices for the form of , so if we can find a family of distributions that is amenable to our available optimization techniques and which also contains a good approximation to the true posterior, we will be happy with the trade-off.

For example, we can rewrite in terms of yet another KL divergence, this time between the approximate posterior and the prior on the latent variables:

If and have the same form – for instance we have chosen them both to be Gaussian – then the second term will have a closed form expression, so if we have a good way to evaluate the first term, as in Kingma & Welling 2013, then we’ll be able to optimize and solve the problem.

(This bound gets its name from the variational principle, applying the calculus of variations to optimize the function without assuming a particular form. However, we frequently assume a fixed-form approximation using a parametric form of this density. In this case no calculus of variations is required, and the problem reduces to an ordinary optimization.)

Very nicely explained 🙂

in the beginning p(y,z) should be the joint probability not joint likelihood right? or I am mistaken here

Hi Mohamed. You’re right that p(x, y) is the joint probability density. I’m being sloppy to call it this quantity the likelihood when we’re not considering it as a function of the parameters, e.g. .

nice post, man

Cool. You say ‘we replace expectations by optimization’. In my own mind, I see it more like ‘replace marginalization/integration, over all z, by expectation’. Thoughts?