Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

October 11, 2023 ยท Entered Twilight ยท ๐Ÿ› IEEE Transactions on Cognitive and Developmental Systems

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: .gitignore, README.md, website

Authors Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan arXiv ID 2310.07284 Category eess.AS: Audio & Speech Cross-listed cs.CL Citations 17 Venue IEEE Transactions on Cognitive and Developmental Systems Repository https://github.com/haoxiangsnr/llm-tse โญ 42 Last Checked 2 months ago
Abstract
Humans can easily isolate a single speaker from a complex acoustic environment, a capability referred to as the "Cocktail Party Effect." However, replicating this ability has been a significant challenge in the field of target speaker extraction (TSE). Traditional TSE approaches predominantly rely on voiceprints, which raise privacy concerns and face issues related to the quality and availability of enrollment samples, as well as intra-speaker variability. To address these issues, this work introduces a novel text-guided TSE paradigm named LLM-TSE. In this paradigm, a state-of-the-art large language model, LLaMA 2, processes typed text input from users to extract semantic cues. We demonstrate that textual descriptions alone can effectively serve as cues for extraction, thus addressing privacy concerns and reducing dependency on voiceprints. Furthermore, our approach offers flexibility by allowing the user to specify the extraction or suppression of a speaker and enhances robustness against intra-speaker variability by incorporating context-dependent textual information. Experimental results show competitive performance with text-based cues alone and demonstrate the effectiveness of using text as a task selector. Additionally, they achieve a new state-of-the-art when combining text-based cues with pre-registered cues. This work represents the first integration of LLMs with TSE, potentially establishing a new benchmark in solving the cocktail party problem and expanding the scope of TSE applications by providing a versatile, privacy-conscious solution.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Audio & Speech