Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/5332
Title: Voice activity detection for nist speaker recognition evaluations
Authors: Yu, Hon-bill
Subjects: Automatic speech recognition.
Signal processing.
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2012
Publisher: The Hong Kong Polytechnic University
Abstract: Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has a substantially lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This dissertation highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this dissertation proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with five popular VADs. 1. Average-Energy (AE)-Based VAD. This is an energy-based VAD with decisions governed by the linear combination of average magnitude of background noises and signal peaks. 2. Automatic Speech Recognition (ASR) Transcripts. In this VAD, speech/non-speech decisions are based on the ASR transcripts provided by NIST. 3. VAD in the ETSI-AMR Option 2 Coder. This VAD is part of the Adaptive Multi-Rate (AMR) codec released by the European Telecommunication Standard Institute (ETSI). 4. Statistical-Model (SM)-Based VAD. This VAD assumes that the complex frequency components of signals and noises follow a Gaussian distribution and uses likelihood-ratio tests in the frequency domain for speech/non-speech decisions. 5. Gaussian-Mixture-Model (GMM)-Based VAD. This is an extension of the statistical-model-based VAD, which considers the long-term temporal information and harmonic structure in noisy speech. These five VADs have been evaluated on the NIST 2010 dataset. The comparison of VADs leads to seven findings: 1. Noise reduction is vital for VAD under extremely low SNR; 2. Removal of the sinusoidal background noise is of primary importance as this kind of background signal could lead to many false detection in AE-based VAD; 3. A reliable threshold strategy is required to address the impulsive signals; 4. ASR transcripts provided by NIST do not produce accurate speech and non-speech segmentations; 5. Spectral subtraction contributes to both AE-and SM-based VADs; 6. Spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD; and 7. The proposed SS+AE-VAD outperforms the SM-based VAD, the GMM-based VAD, the AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.
Description: 66 leaves : ill. (some col.) ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577M EIE 2012 Yu
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/5332
Appears in Collections:EIE Theses
PolyU Electronic Theses

Files in This Item:
File Description SizeFormat 
b25073552_link.htmFor PolyU Users162 BHTMLView/Open
b25073552_ir.pdfFor All Users (Non-printable) 1.89 MBAdobe PDFView/Open


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated. No item in the PolyU IR may be reproduced for commercial or resale purposes.