IQA: Visual Question Answering in Interactive Environments

December 09, 2017 · Entered Twilight · 🏛 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

"Last commit was 7.0 years ago (≥5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, LICENSE, README.md, __init__.py, constants.py, darknet_object_detection, depth_estimation_network, download_weights.sh, eval.py, game_state.py, generate_questions.py, generate_questions, graph, human_controlled_test.py, layouts, networks, qa_agents, question_embedding, question_to_text.py, questions, reinforcement_learning, requirements.txt, run_thor_tests.py, supervised, tasks.py, test_thor.py, thor_tests, train.py, utils, visualizations, vocabulary.txt

Authors Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi arXiv ID 1712.03316 Category cs.CV: Computer Vision Citations 422 Venue 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Repository https://github.com/danielgordon10/thor-iqa-cvpr-2018 ⭐ 126 Last Checked 2 months ago

Abstract

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98