NeRF - Review
In this post I’ll talk about the paper “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”. This paper introduces NeRF, a method to represent complex scenes as continuous neural fields, enabling high-quality view synthesis from sparse images.
NeRF uses a simple yet effective architecture: a fully-connected neural network (MLP) that takes continuous 5D inputs (3D coordinates plus viewing angles) and outputs both color and volume density. This setup allows synthesizing previously unseen views simply by querying the network and performing differentiable volume rendering along camera rays.
Key Concepts
Neural Radiance Field (NeRF)
A NeRF is essentially a neural network that maps a 5D coordinate (3D spatial location (x, y, z)
plus viewing direction (θ, φ)
) to two outputs: volume density (opacity) and RGB color. The beauty lies in representing an entire scene within a compact neural network rather than explicitly storing a dense voxel grid or using complex 3D models.
Volume Rendering
To synthesize novel views, NeRF employs classical volume rendering techniques. Camera rays traverse the scene, accumulate opacity and color values from sampled points, and project these values into an image. Crucially, this process is differentiable, allowing the network to learn directly from input images without any explicit 3D geometry supervision.
Positional Encoding
Interestingly, directly feeding spatial coordinates into a neural network doesn’t work well for capturing fine details. The authors cleverly use positional encoding, mapping coordinates into higher-dimensional spaces using sinusoidal functions, helping the network learn high-frequency variations in geometry and appearance.
Hierarchical Sampling (Coarse & Fine Networks)
To make the rendering efficient, NeRF uses a hierarchical sampling strategy. Initially, it samples points coarsely to estimate areas of importance, then densely samples those areas with a second “fine” network. This two-stage approach ensures computational efficiency and improved quality simultaneously.
Key Takeaways (What I Learned)
The Surprising Power of Simple MLPs
Initially, I assumed NeRF would require complex architectures. Surprisingly, a plain fully-connected network (MLP) was sufficient. This challenged my assumption that complex scenes need complicated models; NeRF elegantly achieves complexity through smart input encoding and sampling strategies rather than architectural complexity.
Positional Encoding Makes a Huge Difference
At first glance, positional encoding seemed like just a minor tweak. But it’s crucial. Without positional encoding, the model struggles to capture high-frequency details like textures and sharp edges. This was counterintuitive, as I originally thought that neural networks naturally handle continuous inputs well. Positional encoding acts like giving the network a “cheat sheet” for frequencies it should pay attention to.
Hierarchical Sampling – Efficiency through Bias
The hierarchical sampling was something I wouldn’t have thought of myself. Instead of uniformly sampling every point, NeRF first broadly samples (“coarse”) to identify important regions. Then, based on these initial guesses, it strategically places more samples in regions with higher density (or significance). It’s a smart trick: effectively biasing sampling toward regions that matter, vastly improving computational efficiency.
Overfitting as a Feature, Not a Bug
One of the most interesting realizations was that NeRF intentionally overfits to a specific scene. Unlike traditional deep learning models aiming for generalization, NeRF constructs a specialized network for each scene, optimizing solely for accuracy within that context. It felt unconventional but makes sense because NeRF’s goal isn’t generalization—it’s to achieve photorealistic rendering for a given scene.
Why This Matters for Interpolation and Rendering Quality
NeRF’s method naturally leads to excellent interpolation between views, providing realistic novel perspectives even from limited viewpoints. Because it learns a continuous representation instead of discrete samples, it effortlessly generates smooth transitions between views. It’s elegant, simple, yet incredibly powerful.
Comparisons and Connections with VAEs and GANs
Reflecting on my earlier reviews (VAE and GAN), I realized NeRF shares a fundamental idea: continuous latent representations. While VAEs explicitly enforce structured latent distributions and GANs rely on adversarial learning, NeRF takes a more straightforward approach by embedding spatial coordinates directly into neural networks. However, it still captures continuous structures that allow interpolation, something VAE explicitly constructs and GAN achieves implicitly.
Summary & Final Thoughts
NeRF brilliantly demonstrates how neural networks can represent entire scenes using only a few images and positional encodings. By intentionally overfitting and employing hierarchical sampling, it achieves stunning photorealism without explicit geometric models.
It’s intriguing how simple design choices can dramatically shift the approach—and effectiveness—of neural models in tasks like 3D view synthesis. This simplicity and elegance are precisely why NeRF has quickly become foundational in neural rendering research.