Harnessing Input-Adaptive Inference for Efficient VLN

Oregon State University

*Corresponding Author

International Conference on Computer Vision (ICCV) 2025
Overview of our proposed method

Highlights

๐ŸŒŸ Input-Adaptive VLN Framework: We introduce the first input-adaptive inference approach for VLN.
๐ŸŒŸ Efficiency Gains: Across 7 VLN benchmarks, we improve efficiency by 1.7-7.5ร— while retaining 77โ€“89% of task success.
๐ŸŒŸ Visual Corruptions Evaluation: We analyze the robustness of VLN agents under real-world visual corruptions.

Motivation

We aim to accelerate VLN inference without sacrificing performance by preventing unnecessary computation ('overthinking') during inference.

๐Ÿ”’ Challenges

(1) Modern VLN agents use large multi-modal transformers, which are compute-intensive.
(2) Existing input-adaptive methods do not transfer effectively to VLN settings.

๐Ÿ”‘ Solution

We introduce a novel input-adaptive VLN framework that dynamically adjusts computation during navigation to prevent model overthinking while preserving navigation quality.

Our Method

We improve the efficiency of the visual encoder by leveraging the spatial and temporal locality of views in a navigation scene.

๐Ÿ‘‰ k-extension: We only process each navigable view + k adjacent views.
Insight: Views near navigable views contain most decision-relevant context, so we can skip the rest.

๐Ÿ‘‰ Adaptive early-exiting (thresholds): We early-exit adjacent views with thresholds based on proximity to navigable views (closer = more conservative).
Insight: Lower-importance views can be processed for fewer layers without degrading performance.

๐Ÿ‘‰ Locality-sensitive hashing (LSH): We store and re-use similar visual embeddings across steps.
Insight: Consecutive panoramas repeat content; caching avoids re-encoding near-duplicate views.

Our techniques

๐Ÿ‘‰ Adaptation to continuous VLN (scan-only SGM): We modify the subgoal prediction module to only process laser scan data.

Performance and Efficiency of Our Input-Adaptive Agents

We apply our techniques to 3 VLN agents and 7 benchmarks and achieve a strong performanceโ€“efficiency trade-off.

๐Ÿ‘‰ Standard VLN (HAMT and DUET): Averaged across 6 benchmarks, we reduce GFLOPs by 2.3ร— with just an 11.7% drop in success rate (SR).
๐Ÿ‘‰ Continuous VLN (VLN-CE-BERT): We reduce GFLOPs by 7.1โ€“7.5ร— with an 8โ€“14% drop in SR.

Robustness to Visual Corruptions

We evaluate our agent against five real-world visual corruptions and uncover a promising countermeasure to improve the resilience.

๐Ÿ‘‰ Our agent is vulnerable: 12.3โ€“31.3% SR loss with a 3.6โ€“20.9% increase in GFLOPs.
๐Ÿ‘‰ Denoising is a promising countermeasure: Recovers SR by 18% and reduces GFLOPs by 6%.

Example images demonstrating the visual corruptions we evaluate with.

BibTex


        @inproceedings{kang2025harnessing,
          title={Harnessing Input-adaptive Inference for Efficient VLN},
          author={Kang, Dongwoo and Perincherry, Akhil and Coalson, Zachary and Gabriel, Aiden and Lee, Stefan and Hong, Sanghyun},
          booktitle={International Conference on Computer Vision},
          year={2025},
        }