Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
February 25, 2020 ยท Entered Twilight ยท ๐ Computer Vision and Pattern Recognition
"Last commit was 5.0 years ago (โฅ5 year threshold)"
Evidence collected by the PWNC Scanner
Repo contents: .gitkeep, CMakeLists.txt, Doxyfile, LICENSE, README.md, connectivity, img_features, include, models, pre_training_scheme.png, preprocess, pretrain_finetune.png, scripts, src, teaser.jpg, web, webgl_imgs
Authors
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, Jianfeng Gao
arXiv ID
2002.10638
Category
cs.CV: Computer Vision
Cross-listed
cs.CL,
cs.LG,
cs.RO
Citations
329
Venue
Computer Vision and Pattern Recognition
Repository
https://github.com/weituo12321/PREVALENT
โญ 94
Last Checked
2 months ago
Abstract
Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!" the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted