In this post, I’m reviewing “Auto-Encoding Variational Bayes”, better known as the Variational Autoencoder (VAE) paper. Kingma and Welling introduced a practical way to train generative models with continuous latent variables using approximate Bayesian inference.

The central challenge is training models where the true posterior over latent variables is intractable. By combining stochastic gradient methods with the reparameterization trick, VAE makes this kind of model trainable.


Key concepts

Variational Autoencoder (VAE)

VAE is a generative model built around an encoder-decoder structure, with one important twist: latent variables are probabilistic rather than deterministic. It introduces an approximate posterior distribution to represent latent variables, so the model can capture uncertainty explicitly.

Evidence Lower Bound (ELBO)

ELBO, or Evidence Lower Bound, is central to VAE training. It balances reconstruction accuracy against the complexity of the latent representation, measured through KL divergence. Maximizing ELBO encourages the latent distribution to align with a chosen prior, usually a standard Gaussian.

Reparameterization trick

The paper introduces a “reparameterization trick,” which transforms a random sampling step into a deterministic operation combined with stochastic noise. This allows gradients to flow through sampling operations, making end-to-end optimization via stochastic gradient descent possible.


What I learned

The essence of ELBO

Initially, ELBO felt abstract, and the paper didn’t clearly show how it emerged mathematically. After revisiting it and consulting a few references, I realized ELBO comes directly from Bayesian inference: replace an intractable posterior with a simpler approximate distribution and derive a lower bound on the log evidence. ELBO balances reconstruction accuracy and latent complexity.

KL divergence as a regularizer

One core insight I gained is the intuitive role of KL divergence. KL divergence measures how much the approximate posterior deviates from the chosen prior distribution. Minimizing it discourages overly complex latent representations and keeps the latent space structured.

Why the reparameterization trick matters

The reparameterization trick is crucial. Without it, sampling latent variables would break differentiability and make gradient-based optimization impossible. Reparameterization separates randomness from deterministic parameters, enabling end-to-end training through backpropagation.

Intuition behind probabilistic latent spaces

Traditional autoencoders map data deterministically into latent spaces, limiting their generative capabilities. By introducing probability distributions in the latent space, VAE maps continuous regions more smoothly, allowing for meaningful interpolation and generation.

Complexity in mathematical derivations

The derivations reminded me how math-heavy early deep-learning work was: Bayesian inference identities, integrals, and careful rearrangements to produce a usable training objective (ELBO). That context helped the rest of the paper click.

Relation to EM and Bayesian methods

The paper also highlighted connections to the Expectation-Maximization (EM) algorithm and Bayesian methods. VAE generalizes and scales ideas traditionally handled by EM, using neural networks and stochastic optimization instead of traditional iterative approaches. Understanding this relation helped me place VAE more clearly in the broader history of machine learning.


Summary

“Auto-Encoding Variational Bayes” blends Bayesian inference with neural networks. Through ELBO maximization, KL divergence regularization, and the reparameterization trick, VAEs make probabilistic latent-variable models trainable with gradient descent.

Despite the mathematical complexity, the intuition is clear: make latent representations probabilistic, then train the model to reconstruct data while keeping the latent space structured.