In this post, I review “Neural Discrete Representation Learning”, commonly known as VQ-VAE. The paper introduces a generative model that combines vector quantization with Variational Autoencoders (VAEs).

Its core idea is to replace the continuous latent space typically used in VAEs with discrete embeddings. This helps address posterior collapse, where the decoder becomes so strong that the encoder stops contributing meaningfully.

The discrete latent space makes this model particularly suited to domains naturally represented by discrete data, like speech, language, and structured visual information.


Key concepts

Discrete latent variables

Instead of using continuous distributions, VQ-VAE encodes data into discrete latent variables selected from a predefined embedding dictionary. This discretization helps prevent posterior collapse by forcing the encoder to produce meaningful, constrained representations.

Vector quantization (VQ)

Vector quantization maps encoder outputs to the nearest embeddings in a learned dictionary. There is no straightforward gradient through this discrete step, so the authors use a “straight-through” estimator, effectively copying gradients from decoder inputs back to encoder outputs.

Commitment loss

To ensure the encoder doesn’t produce arbitrary embeddings, VQ-VAE introduces a commitment loss. This term encourages encoder outputs to stay close to their assigned embedding vectors, stabilizing training and improving the learned representations.

What I learned

Posterior collapse is a bigger deal than I thought

Initially, I underestimated posterior collapse. I thought that if the decoder reconstructed well, the encoder must be doing its job. But that is not necessarily true. If the decoder is too powerful, it can bypass the encoder entirely, undermining the point of an autoencoder. VQ-VAE addresses this directly through discretization.

Constraining the latent space can help

Imposing discrete constraints on latent representations can help the model learn better. I thought constraints might hurt performance, but VQ-VAE shows that limiting flexibility can prevent the model from “getting lost” and improve representation quality.

The gradient copying trick

VQ-VAE’s training includes copying decoder input gradients directly to encoder outputs, which feels ad hoc. Despite my initial skepticism, this works well. Sometimes the straightforward trick is good enough.

Tokenizing latent space could lead to new applications

By tokenizing the latent space, VQ-VAE makes it possible to use transformer architectures on latent representations. Since transformers work well with discrete token sequences, VQ-VAE’s discrete embeddings suggest a way to process continuous modalities as if they were language-like sequences.

Discrete representations fit some data types naturally

I’ve seen a hypothesis that VQ-VAE performs particularly well with inherently discrete data like language tokens or audio spectrograms. This makes intuitive sense, but it is still an open question whether discrete latent spaces universally beat continuous ones across domains.


Summary

VQ-VAE introduces a straightforward method to discretize latent representations in autoencoders. By using discrete embeddings, it addresses posterior collapse and opens paths to token-based model architectures.

The bigger question is when discrete latent spaces are actually better than continuous ones. VQ-VAE makes a strong case that they are worth taking seriously.