Learning speaker-specific characteristics with deep neural architecture
نام عام مواد
[Thesis]
نام نخستين پديدآور
Salman, Ahmad
نام ساير پديدآوران
Chen, Ke
وضعیت نشر و پخش و غیره
نام ناشر، پخش کننده و غيره
University of Manchester
تاریخ نشرو بخش و غیره
2012
یادداشتهای مربوط به پایان نامه ها
جزئيات پايان نامه و نوع درجه آن
Thesis (Ph.D.)
امتياز متن
2012
یادداشتهای مربوط به خلاصه یا چکیده
متن يادداشت
Robust Speaker Recognition (SR) has been a focus of attention for researchers since long. The advancement in speech-aided technologies especially biometrics highlights the necessity of foolproof SR systems. However, the performance of a SR system critically depends on the quality of speech features used to represent the speaker-specific information. This research aims at extracting the speaker-specific information from Mel-frequency Cepstral Coefficients (MFCCs) using deep learning. Speech is a mixture of various information components that include linguistic, speaker-specific and speaker's emotional state information. Feature extraction for each information component is inevitable in different speech-related tasks for robust performance. However, almost all forms of speech representation carry all the information as a whole, which is responsible for the compromised performances by SR systems. Motivated by the complex problem solving ability of deep architectures by learning high-level task-specific information in the data, we propose a novel Deep Neural Architecture (DNA) to extract speaker-specific information (SI) from MFCCs, a popular frequency domain speech signal representation. A two-stage learning strategy is adopted, which is based on unsupervised training for network initialisation followed by regularised contrastive learning. To train our network in the 2nd stage, we devise a contrastive loss function to discriminate the speakers on the basis of their intrinsic statistical patterns, distributed in the representations yielded by our deep network. This is achieved in the contrastive pair-wise comparison of these representations for similar or dissimilar speakers. To improve the generalisation and reduce the interference of environmental effects with the speaker-specific representation, we regulate the contrastive loss with the data reconstruction loss in a multi-objective optimisation. A detailed study has been done to analyse the parametric space in training the proposed deep architecture for optimum performance. Finally we compare the performance of our learned speaker-specific representations with several state-of-the-art techniques in speaker verification and speaker segmentation tasks. It is evident that the representations acquired through learned DNA are invariant and comparatively less sensitive to the text, language and environmental variability.
موضوع (اسم عام یاعبارت اسمی عام)
موضوع مستند نشده
Speaker Recognition ; Deep Learning
نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )