Please use this identifier to cite or link to this item:
Title: Voice activity detection for nist speaker recognition evaluations
Authors: Yu, Hon-bill
Degree: M.Phil.
Issue Date: 2012
Abstract: Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has a substantially lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This dissertation highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this dissertation proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with five popular VADs. 1. Average-Energy (AE)-Based VAD. This is an energy-based VAD with decisions governed by the linear combination of average magnitude of background noises and signal peaks. 2. Automatic Speech Recognition (ASR) Transcripts. In this VAD, speech/non-speech decisions are based on the ASR transcripts provided by NIST. 3. VAD in the ETSI-AMR Option 2 Coder. This VAD is part of the Adaptive Multi-Rate (AMR) codec released by the European Telecommunication Standard Institute (ETSI). 4. Statistical-Model (SM)-Based VAD. This VAD assumes that the complex frequency components of signals and noises follow a Gaussian distribution and uses likelihood-ratio tests in the frequency domain for speech/non-speech decisions. 5. Gaussian-Mixture-Model (GMM)-Based VAD. This is an extension of the statistical-model-based VAD, which considers the long-term temporal information and harmonic structure in noisy speech. These five VADs have been evaluated on the NIST 2010 dataset. The comparison of VADs leads to seven findings: 1. Noise reduction is vital for VAD under extremely low SNR; 2. Removal of the sinusoidal background noise is of primary importance as this kind of background signal could lead to many false detection in AE-based VAD; 3. A reliable threshold strategy is required to address the impulsive signals; 4. ASR transcripts provided by NIST do not produce accurate speech and non-speech segmentations; 5. Spectral subtraction contributes to both AE-and SM-based VADs; 6. Spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD; and 7. The proposed SS+AE-VAD outperforms the SM-based VAD, the GMM-based VAD, the AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.
Subjects: Automatic speech recognition.
Signal processing.
Hong Kong Polytechnic University -- Dissertations
Pages: 66 leaves : ill. (some col.) ; 30 cm.
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of Jun 4, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.