Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

November 02, 2022 Β· Declared Dead Β· πŸ› 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Zexu Pan, Gordon Wichern, FranΓ§ois G. Germain, Aswin Subramanian, Jonathan Le Roux arXiv ID 2211.01299 Category eess.AS: Audio & Speech Cross-listed cs.CL, cs.SD Citations 3 Venue 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) Last Checked 3 months ago
Abstract
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Audio & Speech

Died the same way β€” πŸ‘» Ghosted