Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks
November 28, 2020 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Man-Ling Sung, Tan Lee
arXiv ID
2011.14062
Category
eess.AS: Audio & Speech
Cross-listed
cs.CL,
cs.SD
Citations
3
Venue
arXiv.org
Last Checked
3 months ago
Abstract
Spoken term discovery from untranscribed speech audio could be achieved via a two-stage process. In the first stage, the unlabelled speech is decoded into a sequence of subword units that are learned and modelled in an unsupervised manner. In the second stage, partial sequence matching and clustering are performed on the decoded subword sequences, resulting in a set of discovered words or phrases. A limitation of this approach is that the results of subword decoding could be erroneous, and the errors would impact the subsequent steps. While Siamese/Triplet network is one approach to learn segment representations that can improve the discovery process, the challenge in spoken term discovery under a complete unsupervised scenario is that training examples are unavailable. In this paper, we propose to generate training examples from initial hypothesized sequence clusters. The Siamese/Triplet network is trained on the hypothesized examples to measure the similarity between two speech segments and hereby perform re-clustering of all hypothesized subword sequences to achieve spoken term discovery. Experimental results show that the proposed approach is effective in obtaining training examples for Siamese and Triplet networks, improving the efficacy of spoken term discovery as compared with the original two-stage method.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Audio & Speech
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction
R.I.P.
π»
Ghosted
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
R.I.P.
π»
Ghosted
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
R.I.P.
π»
Ghosted
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders
R.I.P.
π»
Ghosted
Utterance-level Aggregation For Speaker Recognition In The Wild
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted