Towards Versatile Embodied Navigation

October 30, 2022 · Entered Twilight · 🏛 Neural Information Processing Systems

Repo contents: .gitignore, LICENSE, README.md, assets, habitat_baselines, habitat_extensions, requirements.txt, run.py, sbatch_scripts, vienna

Authors Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang arXiv ID 2210.16822 Category cs.CV: Computer Vision Citations 35 Venue Neural Information Processing Systems Repository https://github.com/hanqingwangai/VXN ⭐ 21 Last Checked 1 month ago

Abstract

With the emergence of varied visual navigation tasks (e.g, image-/object-/audio-goal and vision-language navigation) that specify the target in different ways, the community has made appealing advances in training specialized agents capable of handling individual navigation tasks well. Given plenty of embodied navigation tasks and task-specific solutions, we address a more fundamental question: can we learn a single powerful agent that masters not one but multiple navigation tasks concurrently? First, we propose VXN, a large-scale 3D dataset that instantiates four classic navigation tasks in standardized, continuous, and audiovisual-rich environments. Second, we propose Vienna, a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. Building upon a full-attentive architecture, Vienna formulates various navigation tasks as a unified, parse-and-query procedure: the target description, augmented with four task embeddings, is comprehensively interpreted into a set of diversified goal vectors, which are refined as the navigation progresses, and used as queries to retrieve supportive context from episodic history for decision making. This enables the reuse of knowledge across navigation tasks with varying input domains/modalities. We empirically demonstrate that, compared with learning each visual navigation task individually, our multitask agent achieves comparable or even better performance with reduced complexity.