Variational Autoencoders

ilian herzi
6 min readDec 31, 2022

--

A First Principles approach.

It’s useful to start from first principles and work our way up. In this article, we seek to understand where do variational autoencoders (VAE) come from and what are they trying to solve?

This article is the first in a multi-article series exploring the fundamentals of deep generative models. With advances in text-to-image generation (OpenAI’s DALL-E, Imagen, Midjourney, Stable Diffusion, HuggingFace, etc.), text-to-text (ChatGPT, Chinchilla, Flamingo), speech-to-text (Assembly AI) we should try to understand fundamentally where each of these pieces are coming from as there are many mathematical concepts that are inspiring the design of these deep generative model architectures. We’ll start with decoders and work our way up to Diffusion models, then we’ll explore the fundamentals of encoders looking at unsupervised methods like contrastive learning, then we’ll explore what’s out there, how generative models are changing the world, and finally we’ll provide code examples to help people get started and some of the current limitations of these models.

Specifically for decoders, we start from first principles with variational autoencoders, then move to Hierarchical Markovian Variational Autoencoders, then dive into Diffusion.

  1. Generative Models, ELBO, Variational Autoencoders (this)
  2. Hierarchical Markovian Variational Autoencoders
  3. Diffusion (TBA).

Let’s dive in.

Generative Models

We start with some dataset

and we search for a model parameterized by θ that will allow us to model the true underlying distribution

We don’t actually know what the underlying structure of x is for instance it
could be a mixture of gaussians, a poisson random process, or even a deterministic process with random noise. The question then becomes what’s a good way to estimate p(x) given some number of data samples?

Evidence Lower Bound (ELBO)

Idea: try introducing a new latent variable z that is related to x somehow. On first glance, it may not be clear why anyone would want to try something like this, however, notice that there are many ways to introduce new variables. For one, suppose that we want to encode our representation conditioned on the original input. One way to do it (depicted graphically is as follows)

Ref: Figure 1 in [1]

So, if our encoder is parameterized by φ then we have:

How could we introduce this encoder to our original objective? Well we could try the following:

Now we’re onto something, because we know how to relate two random variables using bayes rule. Recall p(a|b) = p(b,a)p(b) which implies that we can write

and substitute this value into 1.1. Now 1.1 becomes

Note: the summation can be rewritten as an expectation over our conditional encoder distribution.

Notice that what we’re comparing in the expectation doesn’t really make sense: what we really want to do is somehow relate our model of our encoder φ to our model of our generative distribution θ. Let’s try reintroducing our encoder (q(z | x)) into 1.2 and see what happens:

Let’s reflect on what we’ve just done. We’ve just rewritten

into two terms that introduces a latent variable z over an encoding distribution. To the astute reader, the second term on the RHS is just the KL divergence between the encoding distribution and the true encoding posterior

Sadly, we don’t actually know what that posterior would be beforehand (because if we did then we don’t need to model an encoder). We know though that the KL divergence is nonnegative so equation 1.3 is lower bounded by:

Note that the last term on the RHS of 1.4 can be written as the KL divergence

To people who’ve read about variational autoencoders before this should look familiar. The first term on the RHS of 1.4 is the reconstruction term. The second term on the RHS is biasing our encoder to some prior we have on the latent space z such as matching our encoder to a normal distribution with mean 0 and an identity covariance matrix. What 1.4 is saying is if we encode x using our encoder parameterized by φ then on average our decoded x from the different values of z should be high, which makes sense! If we knew the true underlying distribution and

was the same value as 1.4 then the second term on the RHS of 1.3 would be 0 and so our encoder is now estimating the true posterior perfectly! What we’ve just derived is called the evidence lower bound where the gap is the distributional difference between our encoder and true underlying posterior.

VAE Algorithm

So far we haven’t discussed distributions at all, the above is fully general. Now suppose that we want to impose some structure onto our encoder, specifically, we want q(z |x) to be a normal distribution.

Variation of link

The VAE algorithm assumes this structure and tries to jointly optimize the following where

A few of notes:

  1. The parameters of the encoder are made to be fully deterministic so that gradients can flow through the entire architecture.
  2. This is a variation of the EM algorithm where both the E,M step occur concurrently. If we sampled randomly from the latent space z then gradient from the decoding step would never reach the encoder parameters φ.
  3. Because the loss is over a distribution we’d actually need to do monte carlo sampling of multiple z_i values.
  4. As written, it’s not clear that we’re predicting a probability to optimize 1.7. Most VAE’s are interested in reconstructing x and skip this step altogether but we could impose that the output is a conditional normal distribution conditioned on (z) and produce a probability if we wanted to.

I hope that this is helpful and I look forward to continuing our journey through the land of generative models.

References:

[1] https://arxiv.org/abs/2208.11970

Generated with a state of the art diffusion model.

--

--

ilian herzi
ilian herzi

Written by ilian herzi

Apple ML Engineer, just taking life one idea at a time

Responses (1)