Variational Bayes and the evidence lower bound

by benmoran

Variational methods for Bayesian inference have been enjoying a renaissance recently in machine learning.

Problem: normalization can be intractable when applying Bayes’ Theorem

Given a likelihood function and a prior distribution that we can evaluate, $p(y \vert z)p(z) = p(y,z)$ is the joint likelihood.

The posterior is just $p(z\vert y) = p(y\vert z)p(z)/p(y)$ where we divide by the evidence $p(y)$ :

$p(y) = \int p(y\vert z) p(z) dz = \mathbb{E}_{p(z)}[p(y\vert z)]$

However, $z$ is frequently intractable. For example, it may not admit a closed form solution, and it is frequently high-dimensional, so even numerical methods like quadrature may not help much.

Here are three techniques we can use to approach it:

If the posterior $p(z\vert y)$ and prior $p(z)$ are of particular forms so that they are conjugate to one another, then the integral will have a simple closed form. Then the updates to the prior from the likelihood are often trivial to calculate.
It is possible to draw samples from the posterior before we know the normalization factor. Then we can approximate the expectation stochastically by sample averages. By drawing enough samples the expectations converge to the true values – the theory behind MCMC techniques. Two hindrances arise: it is difficult to know how many samples is “enough”; and “enough samples” can also take a long time to generate.
We can introduce an approximating distribution $q(z) \approx p(z\vert y)$ . If we can somehow measure the quality of the approximation and iteratively improve it as far as possible, we can use this approximation with confidence. This is the variational approach derived below.

Variational lower bound

Start with the KL from $q$ to $p$ , and rearrange to isolate the interesting quantity $p(y)$ :

$D_{KL}(q(z) \Vert p(z \vert y)) = \int q(z) \log \frac{q(z)}{p(z \vert y)} dz = \int q(z) \log \frac{q(z)p(y)}{p(z, y)} dz$

$= \mathbb{E}_{q(z)}[\log q(z) - \log p(z, y) + \log p(y)] dz$

But $p(y)$ doesn’t depend on $z$ so we can pull it out of the expectation:

$D_{KL}(q(z) \Vert p(z \vert y)) = \mathbb{E}_{q(z)}[\log q(z) - \log p(z, y)] + \log p(y)$

Rearranging, we get

$\log p(y) = D_{KL}(q(z) \Vert p(z \vert y) ) + \mathbb{E}_{q(z)}[\log p(z, y) - \log q(z) ]$

Because we saw previously that $D_{KL} \geq 0$ , we have

$\log p(y) \geq \mathbb{E}_{q(z)}[\log p(z, y) - \log q(z)] = L[q]$

This quantity $L[q]$ is the evidence lower bound (ELBO). It is a functional of the approximating distribution $q(z)$ .

This bound is valuable because it can be calculated without the unknown normalizing constant $p(y)$ . However it is equal to $p(y)$ at its maximum, when $D_{KL}(q(z)\Vert p(z\vert y))=0$ , which also implies $q(z) = p(z \vert y)$ .

We have transformed the problem of taking expectations into one of optimization. Now a new question arises – is this problem any easier than the integral we started with? Not necessarily! However we can now make different choices for the form of $q(z)$ , so if we can find a family of distributions that is amenable to our available optimization techniques and which also contains a good approximation to the true posterior, we will be happy with the trade-off.

For example, we can rewrite $L$ in terms of yet another KL divergence, this time between the approximate posterior $q(z)$ and the prior $p(z)$ on the latent variables:

$L[q] = \mathbb{E}_{q(z)}[\log p(y\vert z)p(z) - \log q(z)]$

$= \mathbb{E}_{q(z)}[\log p(y\vert z) - \log \frac{q(z)}{p(z)}]$

$= \mathbb{E}_{q(z)}[\log p(y\vert z) ] - D_{KL}(q(z) \Vert p(z))$

If $q(z)$ and $p(z)$ have the same form – for instance we have chosen them both to be Gaussian – then the second term will have a closed form expression, so if we have a good way to evaluate the first term, as in Kingma & Welling 2013, then we’ll be able to optimize $q(z)$ and solve the problem.

(This bound gets its name from the variational principle, applying the calculus of variations to optimize the function $q(z)$ without assuming a particular form. However, we frequently assume a fixed-form approximation using a parametric form of this density. In this case no calculus of variations is required, and the problem reduces to an ordinary optimization.)

11 Comments to “Variational Bayes and the evidence lower bound”

relativeworld says:

April 29, 2017 at 2:47 pm

Very nicely explained 🙂

Mohamad Denno says:

May 12, 2017 at 3:47 pm

in the beginning p(y,z) should be the joint probability not joint likelihood right? or I am mistaken here

- benmoran says:
  
  May 14, 2017 at 6:28 pm
  
  Hi Mohamed. You’re right that p(x, y) is the joint probability density. I’m being sloppy to call it this quantity the likelihood when we’re not considering it as a function of the parameters, e.g. $l(\theta) = p_\theta(x, y)$ .
  
Lynn John says:

August 19, 2017 at 8:04 am

nice post, man

Hugh Perkins says:

October 9, 2017 at 5:23 pm

Cool. You say ‘we replace expectations by optimization’. In my own mind, I see it more like ‘replace marginalization/integration, over all z, by expectation’. Thoughts?

- benmoran says:
  
  December 12, 2017 at 11:14 pm
  
  Hi Hugh, you’re right that we could have said “replace integration/marginalization by optimization”. But it’s also true that the quantity we’re trying to calculate is an expectation: $p(y) = \int dz p(y\vert z) p(z) = \mathbb{E}_{p(z)}\left[p(y\vert z)\right]$ . The central problem is always estimating the normalizing constant, but we can choose to solve it in three different ways.
  
  – Calculate the integral $p(y) = \int dz p(y\vert z) p(z)$ analytically,
  – Use Monte Carlo to sample from $p(y,z)$ and use empirical averages of the sample to approximate any expectation we want,
  – Or use variational methods to optimize an approximation $q(z) \approx p(z\vert y)$ .
  
Daniel says:

December 12, 2017 at 9:08 am

Thank you for the nice post.
I have some minor questions regarding the approximation of the posterior p(y|z).
How do we approximate a conditional probability p(y|z) which should in fact be some function of (y, z) by a single variable function q(z)? I am somewhat confused about this part. Any comments? Thanks!

- benmoran says:
  
  December 12, 2017 at 11:28 pm
  
  Hi Daniel. Interesting question!
  
  The *posterior* distribution $p(z\vert y)$ is a distribution over $z$ alone, once we have conditioned on the observed data $y$ . And $q(z)$ is also a distribution over $z$ . So after fixing the value of $y$ , it does make sense to compare these two distributions over $z$ , and ask about the divergence between them.
  
  The *joint* distribution $p(y,z)$ on the other hand is a bivariate distribution over all the possible values of $y$ and $z$ together.
  
Oliver Goldstein says:

January 30, 2018 at 8:31 pm

Why is the integral intractable p(y).

- Oliver Goldstein says:
  
  January 30, 2018 at 8:33 pm
  
  Actually never mind – solved.
  
josephkj20 says:

April 6, 2018 at 3:10 am

Thank you so much for explaining how $L$ can be rewritten as a KL between approximate posterior and the prior over the latent variables.