PolyU IR
 

PolyU Institutional Repository >
Electronic and Information Engineering >
EIE Theses >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/3880

Title: Feature and model transformation techniques for robust speaker verification
Authors: Yiu, Kwok-kwong Michael
Subjects: Hong Kong Polytechnic University -- Dissertations
Automatic speech recognition
Issue Date: 2005
Publisher: The Hong Kong Polytechnic University
Abstract: Speaker verification is to verify the identity of a speaker based on his or her own voice. It has potential applications in securing remote access services such as phone-banking and mobile-commerce. While today's speaker verification systems perform reasonably well under controlled conditions, their performance is often compromised under real-world environments. In particular, variations in handset characteristics are known to be the major cause of performance degradation. This dissertation addresses the robustness issue of speaker verification systems in three different angles: speaker modeling, feature transformation, and model transformation. This dissertation begins with an investigation on the effectiveness of three kernel-based neural networks for speaker modeling. These networks include probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs), and elliptical basis function networks (EBFNs). Based on the thresholding mechanism of PDBNNs, the original training algorithm of PDBNNs was modified to make PDBNNs appropriate for speaker verification. Experimental results show that GMM- and PDBNN-based speaker models outperform the EBFN ones in both clean and noisy environments. It was also found that the modified learning algorithm of PDBNNs is able to find decision thresholds that reduce the variation in false acceptance rates, whereas the ad hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the false acceptance rates. This property makes the performance of PDBNN-based systems more predictable.
The effect of handset variation can be suppressed by transforming clean speech models to fit the handset-distorted speech. To this end, this dissertation proposes a model-based transformation technique that combines handset-dependent model transformation and reinforced learning. Specifically, the approach transforms the clean speaker model and clean background model to fit the distorted speech by using maximum-likelihood linear regression (MLLR), which is followed by adapting the transformed models via PDBNN's reinforced learning. It was found that MLLR is able to bring the clean models to a region close to the distorted speech and that reinforced learning is a good means of fine-tuning the transformed models to enhance the distinction between client speakers and impostors. In addition to model-based approaches, handset variation can also be suppressed by feature-based approaches. Current feature-based approaches typically identify the handset being used as one of the known handsets in a handset database and use the a priori knowledge about the identified handset to modify the features. However, it will be much more practical and cost effective if handset detector-free systems are adopted. To this end, this dissertation proposes a blind compensation algorithm to handle the situation in which no a priori knowledge about the handset is available (i.e., a handset model which is not in the handset database is used). Specifically, a composite statistical model formed by the fusion of a speaker model and a background model is used to represent the characteristics of enrollment speech. Based on the difference between the claimant's speech and the composite model, a stochastic matching type of approach is proposed to transform the claimant's speech to a region close to the enrollment speech. Therefore, the algorithm can now estimate the transformation online without the necessity of detecting the handset types. Experimental results based on the 2001 NIST Speaker Recognition evaluation set show that the proposed approach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm, and short-time Gaussianization.
Degree: Ph.D., Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, 2005.
Description: xvii, 158 leaves : ill. (some col.) ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577P EIE 2005 Yiu
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/3880
Appears in Collections:EIE Theses
PolyU Electronic Theses

Files in This Item:

File Description SizeFormat
b18099191_ir.pdfFor All Users (Non-printable) 5.72 MBAdobe PDFView/Open
b18099191_link.htmFor PolyU Users 161 BHTMLView/Open



Facebook Facebook del.icio.us del.icio.us LinkedIn LinkedIn


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
No item in the PolyU IR may be reproduced for commercial or resale purposes.

 

© Pao Yue-kong Library, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Powered by DSpace (Version 1.5.2)  © MIT and HP
Feedback | Privacy Policy Statement | Copyright & Restrictions - Feedback