Analyzing Modular Approaches for Visual Question Decomposition

November 10, 2023 · Entered Twilight · 🏛 Conference on Empirical Methods in Natural Language Processing

Repo contents: .env, .gitignore, LICENSE, README.md, conda-lock.yml, environment.yml, experiments, pdm.lock, pyproject.toml, src, tango.yml

Authors Apoorv Khandelwal, Ellie Pavlick, Chen Sun arXiv ID 2311.06411 Category cs.CV: Computer Vision Cross-listed cs.CL Citations 6 Venue Conference on Empirical Methods in Natural Language Processing Repository https://github.com/brown-palm/visual-question-decomposition ⭐ 7 Last Checked 2 months ago

Abstract

Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.