Learning to Separate Object Sounds by Watching Unlabeled Video

April 05, 2018 · Entered Twilight · 🏛 European Conference on Computer Vision

"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: README.md, data, models, options, test.py, train.py, util

Authors Ruohan Gao, Rogerio Feris, Kristen Grauman arXiv ID 1804.01665 Category cs.CV: Computer Vision Cross-listed cs.MM, cs.SD, eess.AS Citations 295 Venue European Conference on Computer Vision Repository https://github.com/rhgao/separating-object-sounds ⭐ 50 Last Checked 1 month ago

Abstract

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/