Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

May 19, 2025 Β· Declared Dead Β· πŸ› Interspeech

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu arXiv ID 2505.13062 Category cs.MM: Multimedia Cross-listed cs.SD, eess.AS Citations 1 Venue Interspeech Last Checked 4 months ago
Abstract
Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted