Robustness and explainability in automatic speech recognition and speaker verification

Wu, Xiaoliang

Robustness and explainability in automatic speech recognition and speaker verification

Files

Date

Authors

Abstract

In recent years, speech-based artificial intelligence (AI) systems—such as Automatic Speech Recognition (ASR) and Speaker Verification (SV)—have achieved remarkable performance and are now widely used in applications such as virtual assistants, security authentication, and healthcare. This progress has been largely driven by deep learning models, which offer powerful representation learning and have substantially boosted system accuracy. However, their black– box nature has also introduced new challenges. As these systems become increasingly integrated into real-world scenarios, concerns have emerged regarding their robustness– how well they perform under adversarial or unexpected conditions—and their explainability—whether their outputs can be meaningfully interpreted and trusted. These con cerns are not only of practical relevance but are also reflected in emerging regulatory frame works, such as the EU AI Act, which emphasize the need for AI systems to be both robust and explainable. While explainability research has advanced significantly in computer vision and natural language processing (NLP), the speech domain remains comparatively underex plored. This thesis addresses this gap by investigating both robustness and explainability in the context of ASR and SV. The first aspect we investigate is robustness. A robust model should produce consistent outputs under small, imperceptible changes to the input. To examine this property in ASR systems, we construct adversarial perturbations that sound nearly the same to human listen ers but result in different transcriptions. We make use of a property of human hearing, where louder sounds in a frequency band can mask softer ones occurring at the same time. By placing perturbations under these stronger parts of the signal, we create audio that is indis tinguishable to humans but still misleads the ASR system. This approach exposes the fragility of ASR models under subtle changes that are inaudible to humans and serves as a tool to evaluate their robustness. Having examined robustness, we next focus on explainability. We begin by proposing XASR, a modular and explainable ASR architecture that integrates multiple post-hoc explanation methods– that is, techniques applied after model inference to interpret its predictions such as LIME, SFL, and Causal– to analyze model predictions at the input level. These methods aim to highlight which parts of the input signal, such as specific time frames, are most responsible for the system’s output. However, there is no standard ground truth for what constitutes a correct explanation. As a result, many evaluations rely on user studies, which are subjective and difficult to reproduce. This limits our ability to assess the validity of post-hoc methods. To examine the validity of post-hoc explanations on ASR, we use a phoneme recognition model built with Kaldi and trained on the TIMIT dataset. This setup offers ground-truth frame level phoneme alignments and a simple, controllable architecture, making it ideal for system atic analysis. We apply standard LIME and propose two speech-specific variants– LIME-WS (Window Segment) and LIME-TS (Time Segment)–which restrict perturbations to a narrow temporal window around the phoneme of interest. This localized focus better reflects the tem poral nature of speech. By comparing these outputs to ground-truth phoneme boundaries, we quantitatively assess whether the methods accurately highlight the input regions responsible for specific predictions. However, such methods only analyze the input-output relationship from the outside, without revealing how the model actually processes the input internally. This limitation motivates a shift toward intrinsic explainability– that is, designing models whose internal representations are interpretable by construction, without relying on external post hoc methods. We explore this idea by building explainable representations within the model itself. To achieve this goal, we turn to speaker verification, which offers a more natural setting for exploring intrinsic explainability. Unlike ASR, which relies on detailed frame-level acoustic modeling, SV relies more on high-level speaker traits. This makes it more natural to con nect model behavior with human-understandable attributes, such as accent, nationality, or profession. We propose a concept-based SV model in which each intermediate dimension corresponds to a human-understandable attribute supervised using annotated metadata. This structure allows the model to express verification decisions in explainable terms, such as identifying speaker similarity based on shared accent. Through these contributions—designing adversarial frequency masking attacks, developing the XASRarchitecture with post-hoc explanation integration, validating post-hoc methods with controlled phoneme-level settings, and introducing a concept-based intrinsic explainability model for SV—this thesis lays a foundation for developing speech systems that are not only effective, but also robust, explainable, compliant with emerging regulations, and more likely to earn user trust.

URI

https://era.ed.ac.uk/handle/1842/44479
https://doi.org/10.7488/era/6996

This item appears in the following Collection(s)

Informatics thesis and dissertation collection