Generating Natural Questions from Images for Multimodal Assistants
November 17, 2020 Β· Declared Dead Β· π IEEE International Conference on Acoustics, Speech, and Signal Processing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Alkesh Patel, Akanksha Bindal, Hadas Kotek, Christopher Klein, Jason Williams
arXiv ID
2012.03678
Category
cs.CV: Computer Vision
Cross-listed
cs.AI,
cs.CL,
cs.LG
Citations
10
Venue
IEEE International Conference on Acoustics, Speech, and Signal Processing
Last Checked
4 months ago
Abstract
Generating natural, diverse, and meaningful questions from images is an essential task for multimodal assistants as it confirms whether they have understood the object and scene in the images properly. The research in visual question answering (VQA) and visual question generation (VQG) is a great step. However, this research does not capture questions that a visually-abled person would ask multimodal assistants. Recently published datasets such as KB-VQA, FVQA, and OK-VQA try to collect questions that look for external knowledge which makes them appropriate for multimodal assistants. However, they still contain many obvious and common-sense questions that humans would not usually ask a digital assistant. In this paper, we provide a new benchmark dataset that contains questions generated by human annotators keeping in mind what they would ask multimodal digital assistants. Large scale annotations for several hundred thousand images are expensive and time-consuming, so we also present an effective way of automatically generating questions from unseen images. In this paper, we present an approach for generating diverse and meaningful questions that consider image content and metadata of image (e.g., location, associated keyword). We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr to show the relevance of generated questions with human-provided questions. We also measure the diversity of generated questions using generative strength and inventiveness metrics. We report new state-of-the-art results on the public and our datasets.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Computer Vision
π
π
Old Age
π
π
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
π
π
Old Age
SSD: Single Shot MultiBox Detector
π
π
Old Age
Squeeze-and-Excitation Networks
π
π
Old Age
Fast R-CNN
π
π
Old Age
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted