Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings

June 22, 2025 · Declared Dead · 🏛 European Signal Processing Conference

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jason Clarke, Yoshihiko Gotoh, Stefan Goetze arXiv ID 2506.18055 Category cs.MM: Multimedia Cross-listed cs.SD, eess.AS Citations 1 Venue European Signal Processing Conference Last Checked 4 months ago

Abstract

Audiovisual active speaker detection (ASD) is conventionally performed by modelling the temporal synchronisation of acoustic and visual speech cues. In egocentric recordings, however, the efficacy of synchronisation-based methods is compromised by occlusions, motion blur, and adverse acoustic conditions. In this work, a novel framework is proposed that exclusively leverages cross-modal face-voice associations to determine speaker activity. An existing face-voice association model is integrated with a transformer-based encoder that aggregates facial identity information by dynamically weighting each frame based on its visual quality. This system is then coupled with a front-end utterance segmentation method, producing a complete ASD system. This work demonstrates that the proposed system, Self-Lifting for audiovisual active speaker detection(SL-ASD), achieves performance comparable to, and in certain cases exceeding, that of parameter-intensive synchronisation-based approaches with significantly fewer learnable parameters, thereby validating the feasibility of substituting strict audiovisual synchronisation modelling with flexible biometric associations in challenging egocentric scenarios.