Evaluating Rendering Techniques: A Comparative Study of Instant-NGP, LERF, Nerfacto, and TensorRF
EN.520.665.01.FA23 Course Project

  • Juyang Bai
  • Yongjae Lee
  • Zhaoliang Zhang

  • Johns Hopkins University

Introduction


NeRF marks a significant milestone in 3D visualization and rendering. Its primary advantage lies in its ability to capture intricate details and complex light interactions, such as reflections and shadows, which are often challenging for traditional methods. This capability opens new doors for creating highly realistic virtual environments, which are indistinguishable from real-world scenes. Furthermore, NeRF's computational efficiency, combined with its scalability, makes it a revolutionary tool in the domain of 3D content creation. NeRF's applications extend across various industries, including but not limited to virtual and augmented reality, architectural visualization, autonomous vehicle navigation, and digital preservation of historical sites. In the entertainment industry, for instance, NeRF can revolutionize the way visual effects are created, offering filmmakers tools to craft more immersive and detailed virtual worlds. In cultural heritage, NeRF offers the potential to digitally reconstruct and preserve historical landmarks with a level of detail previously unattainable.


Methods


Instant-NGP

Thomas et al. [3] introduce a concept of integrating a multiresolution hash encoding scheme with neural graphics primitives. This novel combination addresses key challenges in the field, such as the need for speed, efficiency, and high fidelity in rendering complex scenes. The multiresolution aspect of the encoding method enables a fine-grained and adaptive representation of graphical elements, facilitating instant access and manipulation of graphics data. This work presents several significant contributions. First, it offers a comprehensive description of the multiresolution hash encoding technique, as shown in Figure 1, illustrating its efficiency in compressing and retrieving graphical information. Second, it demonstrates the seamless integration of this encoding with neural graphics primitives, highlighting its effectiveness in various graphics tasks. Third, it provides empirical evidence of the approach’s superior performance in rendering speed, quality, and resource efficiency through extensive experiments and comparisons. Moreover, this work opens new avenues in real-time graphics applications, including gaming, virtual reality, and interactive design, by significantly reducing computational load while maintaining high-quality outputs. This advancement is not just a technical achievement; it also has broader implications for how we interact with and perceive digital environments.

Instant-NGP
Figure 1: Illustration of the multiresolution hash encoding in 2D.

LERF

The advancement of Neural Radiance Fields (NeRFs) has significantly enhanced the photorealistic digital representation of intricate 3D scenes. However, the interpretation and interaction with these 3D scenes remain challenging due to their devoid of context or meaning. Justin et al. [2] introduce Language Embedded Radiance Fields (LERF), a novel method that innovatively integrates natural language processing into NeRFs. LERF leverages language embeddings from models like CLIP, embedding them into NeRF to enable open-ended language queries in 3D. This approach not only facilitates intuitive interaction with 3D scenes using natural language but also enhances the semantic understanding of these scenes at multiple scales. LERF achieves this by rendering CLIP embeddings along training rays and ensuring multi-view consistency. The paper demonstrates how LERF can generate 3D relevancy maps for a wide array of language prompts in real-time, opening new possibilities in fields like robotics, vision-language model understanding, and 3D scene interaction. This integration of language and 3D scene understanding represents a significant step forward in making digital environments more accessible and interactive through natural language.

LERF
Figure 2: LERF Optimization: Left: LERF represents a field of 3D volumes, parameterized by position x, y, z and scale s (orange cube). To render a CLIP embedding along a ray, the field is sampled and averaged according to NeRF's volume rendering weights. The physical scale corresponds to an image scale simg via projective geometry. Right: We pre-compute a multi-scale feature pyramid of CLIP embeddings over training views, and during training interpolate this pyramid with simg and the ray's pixel location to obtain CLIP supervision. The CLIP loss maximizes cosine similarity, and other outputs are supervised with mean squared error using standard per-pixel rendering.

Nerfacto

Nerfacto [4] is a method designed for real data captures of static scenes in Neural Radiance Fields (NeRFs). It combines several techniques, as Figure 3 shows, including camera pose refinement, per-image appearance conditioning, proposal sampling, scene contraction, and hash encoding. This method focuses on improving the reconstruction quality of 3D scenes, especially when using data from less accurate sources like smartphone cameras. It utilizes a piecewise sampler for initial scene samples and a proposal sampler to focus on significant scene regions. The density field in Nerfacto is designed for efficient scene querying using a small fused MLP combined with hash encoding.

Nerfacto
Figure 3: Overview pipeline of Nerfacto

TensorRF

As Figure 4 shows, radiance fields are conceptualized as tensors, represented through vectors (v) and matrices (M). These elements correspond to the XYZ axes of a scene and assist in calculating volume density σ and view-dependent color c via differentiable ray marching. At each shading point x = (x, y, z), they can efficiently estimate tensor components' trilinearly interpolated values A(x) by linearly or bilinearly sampling from vector/matrix factors. The accumulated density component values Aσ(x) yield the volume density (σ). For appearance, they combine appearance values Ac(x) into a vector (Σm[Amc(x)]m), process it with an appearance matrix B, and then decode it using function S to regress the RGB color (c).

TensorRF
Figure 4: Illustration of TensorRF reconstruction and rendering.


Environments


Dataset

In this project, we collect our own dataset. We photoed a video along the bench next to Hackerman Hall. The picture is shown in figure Training a NeRF model requires images and corresponding camera poses, which means we need a registration method to get the poses. Colmap [5, 6] is a powerful software toolkit for Structure-from-Motion (SfM) and Multi-View Stereo (MVS). We use colmap to reconstruct the scene use the bench pictures and get the camera poses. Luckily, NerfStudio integrates colmap internally, just installing colmap in conda environment works. NerfStudio command line could take video as input, we processed the raw data using the command line and generate 349 images with corresponding camera poses.

Settings

We conduct experiments on our machine. Each experiment is run with single nvidia RTX A6000 gpu. The software environment is also kept consistent as PyTorch (v2.0.1) with Python (v3.8.18). We train and test the four methods (Instant-NGP, LERF, Nerfacto, and TensorRF) discussed in previous section. Specifically, the training commands are as follows: Commands By default, Nerfstudio uses Adam optimizer with a learning rate of 0.01 and an epsilon of 1.0e-15. It schedules the decay of the learning rate by 0.0001 using a cosine function. One batch is composed of 4096 rays (pixels) randomly sampled from all images. The maximum number of sampled points for each ray is 1024. Moreover, it utilizes an occupancy grid to skip empty space and boost the training speed, if available. The total number of training iterations is 30k. For method-specific settings, the Instant-NGP method utilizes 4-level multi-resolution feature grids, mapped to hash tables with a maximum size of 219. The coarsest resolution is 16, and the finest resolution is 2048. The number of features per hash entry is 8. The LERF method employs an appearance embedding size of 32, a feature size of 2, and two MLPs, each layered with 12. The grid resolutions range from 16 and 128 to 128 and 512, respectively. The maximum size of its hash table is 219. The sample point proposal network is configured with a 16-channel MLP and 5-level hash tables of maximum resolutions of 128 and 256, and maximum sizes of 217, respectively. This method uses a pre-trained CLIP network, which is ViT-B-16 with a CLIP dimension of 512. The Nerfacto method uses an appearance embedding size of 32, a feature size of 2, and an MLP with 3 layers of 64 channels each. The grid resolutions range from 16 to 2048. The maximum size of its hash table is 219. The configuration of its sample point proposal network is identical to that of LERF. TensoRF employs dense feature grids, which dynamically adjust in resolution from 128 to 300, with growth stages at 2k, 3k, 4k, 5.5k, and 7k steps. It uses an appearance embedding size of 27, and the decomposed tensor sizes for color and density are 48 and 16, respectively.


Results


We compare the performance of four models: Instant-NGP, LERF, Nerfacto, and Tensor-RF, under different downscale factors and the size of the dataset. The experiment result can be referred to 1 and 2. We employ seven metrics to evaluate each model's performance. 'Train Duration' refers to the time required to train the NeRF model, with shorter durations generally preferred for efficiency in learning scene representation. 'Model Size' denotes the disk space occupied by the trained model, measured in Megabytes (MB), indicative of the model’s complexity and storage requirements. Smaller model sizes are desirable, particularly for deployment in environments with limited storage. 'Memory Usage' measures the RAM utilized during training or inference, in Gigabytes (GB), and is crucial for understanding computational resource needs. Lower memory usage is advantageous, especially for operation on less powerful hardware or for scaling to larger problems. 'Peak Signal-to-Noise Ratio (PSNR)' assesses the rendered images' reconstruction quality compared to the ground truth, with higher PSNR values signifying superior image quality due to lower error levels. 'Structural Similarity Index Measure (SSIM)' is a perceptual metric evaluating image similarity in terms of structural information, brightness, and contrast, ranging from -1 to 1, where 1 denotes perfect similarity. 'Learned Perceptual Image Patch Similarity (LPIPS)' is an advanced metric employing deep learning to evaluate perceptual similarity. Unlike PSNR and SSIM, which are based on mathematical models, LPIPS uses a neural network trained to mimic human visual perception, with lower scores indicating higher similarity to the ground truth. 'Frames Per Second (FPS)' measures the rendering speed of the models. Under the same downscale factor and size of the dataset, we can compare the performance among the four models fairly. Instant-NGP excels in PSNR, SSIM, and LPIPS but requires more memory than the other models. LERF shows the poorest performance across all metrics. Nerfacto boasts the shortest training duration and lowest memory usage, resulting in the highest FPS among the models. Its rendering results also maintain reasonable quality. TensorRF strikes a balance, offering acceptable training time and memory usage to achieve moderate performance across all metrics. The downscale factor influences training time and memory usage: higher downscale factors reduce both, leading to higher FPS and potentially better rendering results in terms of PSNR, SSIM, and LPIPS. Dataset size impacts training duration and memory usage, with larger datasets necessitating more time and memory for training, while potentially enhancing model performance in PSNR and SSIM.

result_1
Figure 5: Comparison of Rendering Results for Four Models: Instant-NGP, LERF, Nerfacto, and TensorRF. For this comparison, we set both the downscale factor and the split factor to 1.
result_2
Figure 6: Comparison of Instant-NGP's Rendering Results Under Different Training Dataset Sizes.
result_3
Figure 7: Comparison of Rendering Results for Instant-NGP Under Various Downscale Factors.

Conclusion


We conducted a comprehensive experiment to compare four widely used models under various downscale factors and dataset sizes, evaluating them across seven measurement metrics. The Instant-NGP model delivers the best rendering results but requires more computing resources. In contrast, TensorRF demands fewer resources but lags in rendering quality compared to the other models. Our results suggest that Nerfacto is an optimal choice for computing-limited platforms, as it strikes a balance between the performance of Instant-NGP and TensorRF, achieving satisfactory rendering results with fewer resources. In practice, the selection of a model for radiance field reconstruction should be guided by the specific requirements of the problem at hand.We hope that our comparative analysis will assist others in choosing the most suitable model for their specific needs.


References


[1] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022.
[2] Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19729–19739, October 2023.
[3] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
[4] nerfstudio Team. Nerfacto, 2022.
[5] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[6] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.