Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/5332
Title: Voice activity detection for nist speaker recognition evaluations
Authors: Yu, Hon-bill
Keywords: Automatic speech recognition.
Signal processing.
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2012
Publisher: The Hong Kong Polytechnic University
Abstract: Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has a substantially lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This dissertation highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this dissertation proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with five popular VADs. 1. Average-Energy (AE)-Based VAD. This is an energy-based VAD with decisions governed by the linear combination of average magnitude of background noises and signal peaks. 2. Automatic Speech Recognition (ASR) Transcripts. In this VAD, speech/non-speech decisions are based on the ASR transcripts provided by NIST. 3. VAD in the ETSI-AMR Option 2 Coder. This VAD is part of the Adaptive Multi-Rate (AMR) codec released by the European Telecommunication Standard Institute (ETSI). 4. Statistical-Model (SM)-Based VAD. This VAD assumes that the complex frequency components of signals and noises follow a Gaussian distribution and uses likelihood-ratio tests in the frequency domain for speech/non-speech decisions. 5. Gaussian-Mixture-Model (GMM)-Based VAD. This is an extension of the statistical-model-based VAD, which considers the long-term temporal information and harmonic structure in noisy speech. These five VADs have been evaluated on the NIST 2010 dataset. The comparison of VADs leads to seven findings: 1. Noise reduction is vital for VAD under extremely low SNR; 2. Removal of the sinusoidal background noise is of primary importance as this kind of background signal could lead to many false detection in AE-based VAD; 3. A reliable threshold strategy is required to address the impulsive signals; 4. ASR transcripts provided by NIST do not produce accurate speech and non-speech segmentations; 5. Spectral subtraction contributes to both AE-and SM-based VADs; 6. Spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD; and 7. The proposed SS+AE-VAD outperforms the SM-based VAD, the GMM-based VAD, the AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.
Description: 66 leaves : ill. (some col.) ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577M EIE 2012 Yu
URI: http://hdl.handle.net/10397/5332
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b25073552_link.htmFor PolyU Users162 BHTMLView/Open
b25073552_ir.pdfFor All Users (Non-printable) 1.89 MBAdobe PDFView/Open
Show full item record

Page view(s)

373
Checked on Feb 7, 2016


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.