Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/83128
Title: | Voice activity detection for nist speaker recognition evaluations | Authors: | Yu, Hon-bill | Degree: | M.Phil. | Issue Date: | 2012 | Abstract: | Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has a substantially lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This dissertation highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this dissertation proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with five popular VADs. 1. Average-Energy (AE)-Based VAD. This is an energy-based VAD with decisions governed by the linear combination of average magnitude of background noises and signal peaks. 2. Automatic Speech Recognition (ASR) Transcripts. In this VAD, speech/non-speech decisions are based on the ASR transcripts provided by NIST. 3. VAD in the ETSI-AMR Option 2 Coder. This VAD is part of the Adaptive Multi-Rate (AMR) codec released by the European Telecommunication Standard Institute (ETSI). 4. Statistical-Model (SM)-Based VAD. This VAD assumes that the complex frequency components of signals and noises follow a Gaussian distribution and uses likelihood-ratio tests in the frequency domain for speech/non-speech decisions. 5. Gaussian-Mixture-Model (GMM)-Based VAD. This is an extension of the statistical-model-based VAD, which considers the long-term temporal information and harmonic structure in noisy speech. These five VADs have been evaluated on the NIST 2010 dataset. The comparison of VADs leads to seven findings: 1. Noise reduction is vital for VAD under extremely low SNR; 2. Removal of the sinusoidal background noise is of primary importance as this kind of background signal could lead to many false detection in AE-based VAD; 3. A reliable threshold strategy is required to address the impulsive signals; 4. ASR transcripts provided by NIST do not produce accurate speech and non-speech segmentations; 5. Spectral subtraction contributes to both AE-and SM-based VADs; 6. Spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD; and 7. The proposed SS+AE-VAD outperforms the SM-based VAD, the GMM-based VAD, the AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop. | Subjects: | Automatic speech recognition. Signal processing. Hong Kong Polytechnic University -- Dissertations |
Pages: | 66 leaves : ill. (some col.) ; 30 cm. |
Appears in Collections: | Thesis |
Access
View full-text via https://theses.lib.polyu.edu.hk/handle/200/6515
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.