LERF Replicattion: Bridging Language and 3D Environments

LERF Replicattion: Bridging Language and 3D Environments
Personal Project

Juyang Bai

Introduction

Neural Radiance Fields (NeRFs) have revolutionized 3D scene reconstruction by enabling high-fidelity rendering of complex scenes from a set of 2D images. However, traditional NeRFs focus solely on modeling the geometry and appearance, lacking semantic understanding of the scene elements. LERF addresses this limitation by incorporating language embeddings into the radiance field, allowing for a unified representation that captures both visual and semantic information.

Approach

LERF_Overview — Figure 1: Overview of LERF.

LERF (Language Embedded Radiance Fields) presents a novel approach to grounding language within Neural Radiance Fields (NeRFs). The method begins by optimizing a language field jointly with NeRF, taking both position and physical scale as input to output a CLIP vector. This innovative technique addresses the challenge of mapping CLIP's global image embeddings to specific 3D points in a scene.

LERF learns a field of language embeddings over volumes centered at sample points, rather than individual points. The field is supervised using a multi-scale feature pyramid containing CLIP embeddings generated from image crops of training views. This allows the CLIP encoder to capture different scales of image context, associating the same 3D location with distinct language embeddings at different scales.

To regularize the optimized language field, self-supervised DINO features are incorporated through a shared bottleneck. The architecture consists of two separate networks: one for feature vectors (DINO, CLIP) and another for standard NeRF outputs (color, density), both represented using a multi-resolution hashgrid.

During the querying process, LERF computes relevancy scores using cosine similarity and softmax between rendered embeddings and query embeddings, automatically selecting an appropriate scale for each query. The relevancy score is \[ \min_i \left\{ \frac{\exp(\phi_{\text{lang}} \cdot \phi_{\text{quer}})}{\exp(\phi_{\text{lang}} \cdot \phi_{\text{canon}}) + \exp(\phi_{\text{lang}} \cdot \phi_{\text{quer}})} \right\} \] where \(\phi_{\text{lang}}\) is the language embedding, \(\phi_{\text{quer}}\) is the query embedding, and \(\phi_{\text{canon}}\) is the canonical embedding. This score represents how much closer the rendered embedding is towards the query embedding compared to the canonical embeddings. Also, for each query, the method compute a scale \s\ to evaluate the language field outputs \F_{\text{lang}}\. In practice, relevancy maps across a range of scales 0 to 2 meters with 30 increments, and select the scale that yields the highest relevancy score. Querying LERF process includes a visibility filtering step to discard samples observed by insufficient training views.

This comprehensive approach enables LERF to handle a wide range of language inputs, from visual properties to abstract concepts, and allows for real-time generation of 3D relevancy maps for a broad spectrum of language prompts. The result is a flexible and powerful system for language-based interaction with 3D scenes, with potential applications in robotics, vision-language model analysis, and interactive 3D scene exploration.

Implementation

To reproduce the results in the paper, I followed the same setting in the paper. For the CLIP model, I use the OpenClip [10] ViTB/16 model trained on the LAION-2B dataset. The hashgrid used for representing language features has 32 layers from a resolution of 16 to 512, with a hash table size of 221 and feature dimension of 8. We use the Adam optimizer for proposal networks and fields with weight decay 10−9, with an exponential learning rate scheduler from 10−2 to 10−3 over the first 5000 training steps. All models are trained to 30,000 steps on an NVIDIA RTX A6000 GPU with 40GB memory.

Results

I tested the capabilities of LERF on three scenes, which are teacup, kitchen, and espresso. The test results show that it can effectively process a wide variety of input text queries, encompassing various aspects of natural language specifications that current open-vocab detection frameworks encounter difficulty with.

Through the results, we can find that LERF are sensitive to the input text queries. As the Figure 2 shows, when we input "Counter", "Board" and "Table", we want to refer to the same object in the scene, which are all tables in the room. However, the visualization results show that the caputer abilities of LERF are various.

Interestingly, from the Figure 3, we found that the LERF can capture the association between objects. When we input "Water Boiler", both "Water Boiler" and "Stove" are captured in the relevancy map.

Conclusion

Through our reproduced experimental results, we confirm that LERF, by fusing CLIP embeddings into NeRF without region proposals or fine-tuning, enables a wide range of natural language queries in 3D scenes. However, it also has several limitations:

The performance of LERF is sensitive to the input text queries.
LERF can sometimes produce false positives for visually similar objects. For example, in the Figure 3, "Cup" input query produces a relevancy map that includes part of the "Espresso".

Despite these challenges, LERF represents a significant advancement in combining language and 3D scene understanding. As a general framework supporting aligned multi-modal encoders, it offers exciting possibilities for future improvements in vision-language models and their applications in 3D environments.

References

[1] Kerr, Justin, et al. "Lerf: Language embedded radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.