In this post, I’ll talk about the paper “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”. The paper introduces NeRF, a method for representing complex scenes as continuous neural fields and synthesizing new views from sparse images.

NeRF uses a simple architecture: a fully-connected neural network (MLP) that takes continuous 5D inputs (3D coordinates plus viewing angles) and outputs both color and volume density. This setup synthesizes previously unseen views by querying the network and performing differentiable volume rendering along camera rays.


Key concepts

Neural Radiance Field (NeRF)

A NeRF is essentially a neural network that maps a 5D coordinate (3D spatial location (x, y, z) plus viewing direction (θ, φ)) to two outputs: volume density (opacity) and RGB color. It represents an entire scene within a compact neural network rather than explicitly storing a dense voxel grid or using complex 3D models.

Volume rendering

To synthesize novel views, NeRF uses classical volume rendering techniques. Camera rays pass through the scene, accumulate opacity and color values from sampled points, and project these values into an image. This process is differentiable, so the network can learn directly from images without explicit 3D geometry supervision.

Positional encoding

Directly feeding spatial coordinates into a neural network doesn’t capture fine details well. The authors use positional encoding, mapping coordinates into higher-dimensional spaces with sinusoidal functions, to help the network learn high-frequency variations in geometry and appearance.

Hierarchical sampling (coarse and fine networks)

To make the rendering efficient, NeRF uses a hierarchical sampling strategy. Initially, it samples points coarsely to estimate areas of importance, then densely samples those areas with a second “fine” network. This two-stage approach improves efficiency and quality.


What I learned

The surprising power of simple MLPs

Initially, I assumed NeRF would require a complex architecture. Surprisingly, a plain fully connected network (MLP) was enough. This challenged my assumption that complex scenes need complicated models. NeRF gets complexity from input encoding and sampling rather than architectural depth.

Positional encoding makes a huge difference

At first glance, positional encoding seemed like a minor tweak. But without it, the model struggles to capture high-frequency details like textures and sharp edges. This was counterintuitive, since I thought neural networks naturally handled continuous inputs well. Positional encoding acts like a cheat sheet for the frequencies the network should pay attention to.

Hierarchical sampling: efficiency through bias

The hierarchical sampling was something I wouldn’t have thought of myself. Instead of uniformly sampling every point, NeRF first samples broadly to identify important regions. Then it places more samples in regions with higher density or significance. It biases sampling toward regions that matter, which improves efficiency.

Overfitting as a feature, not a bug

One realization was that NeRF intentionally overfits to a specific scene. Unlike models aimed at generalization, NeRF constructs a specialized network for each scene. That felt unconventional, but it makes sense because NeRF’s goal is not generalization. It is photorealistic rendering for a specific scene.

Why this matters for interpolation and rendering quality

NeRF naturally leads to strong interpolation between views, producing realistic novel perspectives even from limited viewpoints. Because it learns a continuous representation instead of discrete samples, it can produce smooth transitions between views.

Connections with VAEs and GANs

Thinking about VAEs and GANs, I realized NeRF shares a broader idea: continuous representations. VAEs explicitly enforce structured latent distributions and GANs rely on adversarial learning. NeRF takes a more direct route by embedding spatial coordinates into neural networks, but it still captures continuous structure that allows interpolation.


Summary

NeRF shows how neural networks can represent entire scenes using images, positional encodings, and differentiable volume rendering. By intentionally overfitting to a scene and using hierarchical sampling, it achieves photorealistic results without explicit geometric models.

The big lesson for me is that simple design choices can completely change what neural models are useful for. NeRF makes 3D view synthesis feel surprisingly natural for neural networks.