Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

February 25, 2020 · Entered Twilight · 🏛 Computer Vision and Pattern Recognition

"Last commit was 5.0 years ago (≥5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: .gitkeep, CMakeLists.txt, Doxyfile, LICENSE, README.md, connectivity, img_features, include, models, pre_training_scheme.png, preprocess, pretrain_finetune.png, scripts, src, teaser.jpg, web, webgl_imgs

Authors Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, Jianfeng Gao arXiv ID 2002.10638 Category cs.CV: Computer Vision Cross-listed cs.CL, cs.LG, cs.RO Citations 329 Venue Computer Vision and Pattern Recognition Repository https://github.com/weituo12321/PREVALENT ⭐ 94 Last Checked 2 months ago

Abstract

Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!" the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art.