Robustness and explainability in automatic speech recognition and speaker verification

Wu, Xiaoliang

Robustness and explainability in automatic speech recognition and speaker verification

Simple item page

dc.contributor.advisor

Rajan, Ajitha

dc.contributor.advisor

Bell, Peter

dc.contributor.author

Wu, Xiaoliang

dc.date.accessioned

2026-03-13T13:03:56Z

dc.date.issued

2026-07-11

dc.description.abstract

In recent years, speech-based artificial intelligence (AI) systems—such as Automatic Speech Recognition (ASR) and Speaker Verification (SV)—have achieved remarkable performance and are now widely used in applications such as virtual assistants, security authentication, and healthcare. This progress has been largely driven by deep learning models, which offer powerful representation learning and have substantially boosted system accuracy. However, their black– box nature has also introduced new challenges. As these systems become increasingly integrated into real-world scenarios, concerns have emerged regarding their robustness– how well they perform under adversarial or unexpected conditions—and their explainability—whether their outputs can be meaningfully interpreted and trusted. These con cerns are not only of practical relevance but are also reflected in emerging regulatory frame works, such as the EU AI Act, which emphasize the need for AI systems to be both robust and explainable. While explainability research has advanced significantly in computer vision and natural language processing (NLP), the speech domain remains comparatively underex plored. This thesis addresses this gap by investigating both robustness and explainability in the context of ASR and SV. The first aspect we investigate is robustness. A robust model should produce consistent outputs under small, imperceptible changes to the input. To examine this property in ASR systems, we construct adversarial perturbations that sound nearly the same to human listen ers but result in different transcriptions. We make use of a property of human hearing, where louder sounds in a frequency band can mask softer ones occurring at the same time. By placing perturbations under these stronger parts of the signal, we create audio that is indis tinguishable to humans but still misleads the ASR system. This approach exposes the fragility of ASR models under subtle changes that are inaudible to humans and serves as a tool to evaluate their robustness. Having examined robustness, we next focus on explainability. We begin by proposing XASR, a modular and explainable ASR architecture that integrates multiple post-hoc explanation methods– that is, techniques applied after model inference to interpret its predictions such as LIME, SFL, and Causal– to analyze model predictions at the input level. These methods aim to highlight which parts of the input signal, such as specific time frames, are most responsible for the system’s output. However, there is no standard ground truth for what constitutes a correct explanation. As a result, many evaluations rely on user studies, which are subjective and difficult to reproduce. This limits our ability to assess the validity of post-hoc methods. To examine the validity of post-hoc explanations on ASR, we use a phoneme recognition model built with Kaldi and trained on the TIMIT dataset. This setup offers ground-truth frame level phoneme alignments and a simple, controllable architecture, making it ideal for system atic analysis. We apply standard LIME and propose two speech-specific variants– LIME-WS (Window Segment) and LIME-TS (Time Segment)–which restrict perturbations to a narrow temporal window around the phoneme of interest. This localized focus better reflects the tem poral nature of speech. By comparing these outputs to ground-truth phoneme boundaries, we quantitatively assess whether the methods accurately highlight the input regions responsible for specific predictions. However, such methods only analyze the input-output relationship from the outside, without revealing how the model actually processes the input internally. This limitation motivates a shift toward intrinsic explainability– that is, designing models whose internal representations are interpretable by construction, without relying on external post hoc methods. We explore this idea by building explainable representations within the model itself. To achieve this goal, we turn to speaker verification, which offers a more natural setting for exploring intrinsic explainability. Unlike ASR, which relies on detailed frame-level acoustic modeling, SV relies more on high-level speaker traits. This makes it more natural to con nect model behavior with human-understandable attributes, such as accent, nationality, or profession. We propose a concept-based SV model in which each intermediate dimension corresponds to a human-understandable attribute supervised using annotated metadata. This structure allows the model to express verification decisions in explainable terms, such as identifying speaker similarity based on shared accent. Through these contributions—designing adversarial frequency masking attacks, developing the XASRarchitecture with post-hoc explanation integration, validating post-hoc methods with controlled phoneme-level settings, and introducing a concept-based intrinsic explainability model for SV—this thesis lays a foundation for developing speech systems that are not only effective, but also robust, explainable, compliant with emerging regulations, and more likely to earn user trust.

dc.identifier.uri

https://era.ed.ac.uk/handle/1842/44479

dc.identifier.uri

https://doi.org/10.7488/era/6996

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.references

Xiaoliang Wu, Peter Bell, and Ajitha Rajan. 2023a. Can We Trust Explainable AI Meth ods on ASR? An Evaluation on Phoneme Recognition. arXiv: 2305.18011 [cs.CL] (Accepted by ICASSP 2024)

dc.relation.references

Xiaoliang Wu, Peter Bell, and Ajitha Rajan. 2023b. “Explanations for automatic speech recogni- tion”. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2023).

dc.relation.references

Xiaoliang Wu and Ajitha Rajan. 2022. “Catch Me If You Can: Blackbox Adversarial Attacks on Automatic Speech Recognition using Frequency Masking”. In Proceedings of 29th Asia-Pacific Software Engineering Conference (APSEC)

dc.relation.references

Xiaoliang Wu. 2022. “Blackbox adversarial attacks and explanations for automatic speech recognition”. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1765–1769

dc.relation.references

Liu, Ke ; Gao, Shangde;Wu,Xiaoliangetal. / Mat-Instructions : A large-scale inorganic material instruction dataset for large language models. Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI-25). Institute of Electrical and Electronics Engineers, 2025. pp. 1-9 (Proceedings of the International Joint Conference on Artificial Intelligence).

dc.relation.references

Xiaoliang Wu, Chau Luu, Peter Bell, and Ajitha Rajan. Explainable attribute-based speaker verification, 2024. URL https://arxiv.org/abs/2405.19796.

dc.subject

speech-based artificial intelligence systems

dc.subject

Automatic Speech Recognition

dc.subject

Speaker Verification

dc.subject

robustness

dc.subject

explainability

dc.subject

voice-based AI systems

dc.subject

transparency

dc.title

Robustness and explainability in automatic speech recognition and speaker verification

dc.type

Thesis

dc.type.qualificationlevel

Doctoral

dc.type.qualificationname

PhD Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Wu2026.pdf
Size:: 5.84 MB
Format:: Adobe Portable Document Format

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection