2611 - 2614.

See 2504 for the ELBO.

SGA = stochastic gradient ascent.

p_θ(X) is the "evidence" for θ (also the likelihood).

We can use the evidence lower bound (ELBO).

φ is some variational parameter, we keep it fixed, so we presume we already have a good one. (Remember φ is the argmax of the ELBO, so we get as tight as possible to the bound.)

We know it's a lower bound, it might also be a good approximation.

We drop out the prior p(z_i), cause that doesn't depend on θ.

We do a minibatch/monte carlo sample.

It's unbiased (the expectation is the true gradient). The monte carlo estimate was an unbiased sample of the expectation, and the minibatch was unbiased. There's a 'tower property' that shows that this is unbiased overall.

We saw that if it's unbiased, we can make it converge as if we were doing normal gradient descent (rather than stochastic).

Sensible: We can see that we have the original marginal likelihood - the extra KL divergence.

Remember the variational approximation is fixed, so we're going to be pushing our model p_θ(z|x) towards the variational approximation q_φ(z). So we both want better evidence AND match the variational approximation that we've already got. KL is lower bounded, it cannot be less than zero, so there's actually a lot of pressure to maximize the log-evidence, so we will INDEED get something sensible. It's actually better to optimize this way, because the second term acts like a regularizer.