LERF Replicattion: Bridging Language and 3D Environments Personal Project
- Juyang Bai
Introduction
Neural Radiance Fields (NeRFs) have revolutionized 3D scene reconstruction by enabling high-fidelity rendering of complex scenes from a set of 2D images. However, traditional NeRFs focus solely on modeling the geometry and appearance, lacking semantic understanding of the scene elements. LERF addresses this limitation by incorporating language embeddings into the radiance field, allowing for a unified representation that captures both visual and semantic information.
Approach
LERF (Language Embedded Radiance Fields) presents a novel approach to grounding language within Neural Radiance Fields (NeRFs). The method begins by optimizing a language field jointly with NeRF, taking both position and physical scale as input to output a CLIP vector. This innovative technique addresses the challenge of mapping CLIP's global image embeddings to specific 3D points in a scene.
LERF learns a field of language embeddings over volumes centered at sample points, rather than individual points. The field is supervised using a multi-scale feature pyramid containing CLIP embeddings generated from image crops of training views. This allows the CLIP encoder to capture different scales of image context, associating the same 3D location with distinct language embeddings at different scales.
To regularize the optimized language field, self-supervised DINO features are incorporated through a shared bottleneck. The architecture consists of two separate networks: one for feature vectors (DINO, CLIP) and another for standard NeRF outputs (color, density), both represented using a multi-resolution hashgrid.
During the querying process, LERF computes relevancy scores using cosine similarity and softmax between rendered embeddings and query embeddings, automatically selecting an appropriate scale for each query. The relevancy score is \[ \min_i \left\{ \frac{\exp(\phi_{\text{lang}} \cdot \phi_{\text{quer}})}{\exp(\phi_{\text{lang}} \cdot \phi_{\text{canon}}) + \exp(\phi_{\text{lang}} \cdot \phi_{\text{quer}})} \right\} \] where \(\phi_{\text{lang}}\) is the language embedding, \(\phi_{\text{quer}}\) is the query embedding, and \(\phi_{\text{canon}}\) is the canonical embedding. This score represents how much closer the rendered embedding is towards the query embedding compared to the canonical embeddings. Also, for each query, the method compute a scale \s\ to evaluate the language field outputs \F_{\text{lang}}\. In practice, relevancy maps across a range of scales 0 to 2 meters with 30 increments, and select the scale that yields the highest relevancy score. Querying LERF process includes a visibility filtering step to discard samples observed by insufficient training views.
This comprehensive approach enables LERF to handle a wide range of language inputs, from visual properties to abstract concepts, and allows for real-time generation of 3D relevancy maps for a broad spectrum of language prompts. The result is a flexible and powerful system for language-based interaction with 3D scenes, with potential applications in robotics, vision-language model analysis, and interactive 3D scene exploration.
Implementation
To reproduce the results in the paper, I followed the same setting in the paper. For the CLIP model, I use the OpenClip [10] ViTB/16 model trained on the LAION-2B dataset. The hashgrid used for representing language features has 32 layers from a resolution of 16 to 512, with a hash table size of 221 and feature dimension of 8. We use the Adam optimizer for proposal networks and fields with weight decay 10−9, with an exponential learning rate scheduler from 10−2 to 10−3 over the first 5000 training steps. All models are trained to 30,000 steps on an NVIDIA RTX A6000 GPU with 40GB memory.
Results
I tested the capabilities of LERF on three scenes, which are teacup, kitchen, and espresso. The test results show that it can effectively process a wide variety of input text queries, encompassing various aspects of natural language specifications that current open-vocab detection frameworks encounter difficulty with.
Conclusion
Through our reproduced experimental results, we confirm that LERF, by fusing CLIP embeddings into NeRF without region proposals or fine-tuning, enables a wide range of natural language queries in 3D scenes. However, it also has several limitations:
- The performance of LERF is sensitive to the input text queries.
- LERF can sometimes produce false positives for visually similar objects. For example, in the Figure 3, "Cup" input query produces a relevancy map that includes part of the "Espresso".
References
[1] Kerr, Justin, et al. "Lerf: Language embedded radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.