Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

May 04, 2026 · Grace Period · 🏛 Interspeech 2026

Authors Sandra Arcos-Holzinger, Sarah M. Erfani, James Bailey, Sanjeev Khudanpur arXiv ID 2605.02715 Category eess.AS: Audio & Speech Cross-listed cs.CR, cs.LG Citations 0 Venue Interspeech 2026

Abstract

Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibility into local geometric changes. We ask: how do perturbations deform local geometry, and do these shifts track downstream automatic speech recognition (ASR) degradation? To address this, we present GRIDS, a framework using Local Intrinsic Dimensionality (LID) across layer-wise representations in WavLM and wav2vec 2.0. We find that LID increases for all low signal-to noise ratio (SNR) perturbations and diverges at high SNR: benign noise converges toward the clean profile, while adversarial inputs retain early-layer LID elevation. We show LID elevation co-occurs with increased WER, and that layer-wise LID features enable anomaly detection (AUROC 0.78-1.00), opening the door to transcript-free monitoring in S3Ms.