Audio-Visual Speech Separation via Bottleneck Iterative Network

July 09, 2025 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Sidong Zhang, Shiv Shankar, Trang Nguyen, Andrea Fanelli, Madalina Fiterau arXiv ID 2507.07270 Category cs.SD: Sound Cross-listed cs.MM, eess.AS Citations 0 Venue arXiv.org Last Checked 4 months ago

Abstract

Integration of information from non-auditory cues can significantly improve the performance of speech-separation models. Often such models use deep modality-specific networks to obtain unimodal features, and risk being too costly or lightweight but lacking capacity. In this work, we present an iterative representation refinement approach called Bottleneck Iterative Network (BIN), a technique that repeatedly progresses through a lightweight fusion block, while bottlenecking fusion representations by fusion tokens. This helps improve the capacity of the model, while avoiding major increase in model size and balancing between the model performance and training cost. We test BIN on challenging noisy audio-visual speech separation tasks, and show that our approach consistently outperforms state-of-the-art benchmark models with respect to SI-SDRi on NTCD-TIMIT and LRS3+WHAM! datasets, while simultaneously achieving a reduction of more than 50% in training and GPU inference time across nearly all settings.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Sound

🔮 🔮 The Ethereal

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, ... (+7 more)

cs.SD 🏛 Speech Synthesis 📚 8.0K cites 9 years ago

R.I.P. 👻 Ghosted

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Morten Kolbæk, Dong Yu, ... (+2 more)

cs.SD 🏛 IEEE/ACM TASLP 📚 763 cites 9 years ago

R.I.P. 👻 Ghosted

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

Jon Barker, Shinji Watanabe, ... (+2 more)

cs.SD 🏛 Interspeech 📚 714 cites 8 years ago

R.I.P. 👻 Ghosted

TasNet: time-domain audio separation network for real-time, single-channel speech separation

Yi Luo, Nima Mesgarani

cs.SD 🏛 ICASSP 📚 711 cites 8 years ago

R.I.P. 👻 Ghosted

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Soroush Mehri, Kundan Kumar, ... (+6 more)

cs.SD 🏛 ICLR 📚 619 cites 9 years ago

R.I.P. 👻 Ghosted

MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation

Li-Chia Yang, Szu-Yu Chou, Yi-Hsuan Yang

cs.SD 🏛 ISMIR 📚 493 cites 9 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečný, H. Brendan McMahan, ... (+4 more)

cs.LG 🏛 arXiv 📚 5.2K cites 9 years ago

R.I.P. 👻 Ghosted

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, ... (+73 more)

cs.AR 🏛 ISCA 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Hoo-Chang Shin, Holger R. Roth, ... (+7 more)

cs.CV 🏛 IEEE TMI 📚 4.9K cites 10 years ago

R.I.P. 👻 Ghosted

Explanation in Artificial Intelligence: Insights from the Social Sciences

Tim Miller

cs.AI 🏛 AI 📚 4.9K cites 9 years ago