Effectively obtaining acoustic, visual and textual data from videos

September 06, 2025 Β· Declared Dead Β· πŸ› Applied Sciences

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jorge E. LeΓ³n, Miguel Carrasco arXiv ID 2509.05786 Category cs.MM: Multimedia Cross-listed cs.SD, eess.AS Citations 1 Venue Applied Sciences Last Checked 4 months ago
Abstract
The increasing use of machine learning models has amplified the demand for high-quality, large-scale multimodal datasets. However, the availability of such datasets, especially those combining acoustic, visual and textual data, remains limited. This paper addresses this gap by proposing a method to extract related audio-image-text observations from videos. We detail the process of selecting suitable videos, extracting relevant data pairs, and generating descriptive texts using image-to-text models. Our approach ensures a robust semantic connection between modalities, enhancing the utility of the created datasets for various applications. We also discuss the challenges encountered and propose solutions to improve data quality. The resulting datasets, publicly available, aim to support and advance research in multimodal data analysis and machine learning.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted