Back to results list
Please use this identifier to cite or link to this item:
|Title:||Utterance partitioning for supervector and i-vector speaker verification||Authors:||Rao, Wei||Advisors:||Mak, Man-wai (EIE)||Keywords:||Automatic speech recognition.||Issue Date:||2015||Publisher:||The Hong Kong Polytechnic University||Abstract:||In recent years, GMMSVM and i-vectors with probabilistic linear discriminant analysis (PLDA) have become prominent approaches to text-independent speaker verification. The idea of GMMSVM is to derive a GMM-supervector by stacking the mean vectors of a target-speaker dependent, MAP-adapted GMM. The supervector is then presented to a speaker-dependent support vector machine (SVM) for scoring. However, a problematic issue of this approach is the severe imbalance between the numbers of speaker-class and impostor-class utterances available for training the speaker-dependent SVMs. Different from high dimension GMM-supervectors, the major advantage of i-vectors is that they can represent speaker-dependent information in a low-dimension space, which opens up opportunity for using statistical techniques such as linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and PLDA to suppress the channel-and session-variability. While these techniques have achieved state-of-the-art performance in recent NIST Speaker Recognition Evaluations (SREs), they require multiple training speakers each providing sufficient numbers of sessions to train the transformation matrices or loading matrices. However, collecting such a corpus is expensive and inconvenient. In a typical training dataset, the number of speakers could be fairly large, but the number of speakers who can provide many sessions is quite limited. The lack of multiple sessions per speaker could cause numerical problems in the within speaker scatter matrix, a problematic issue known as the small sample-size problem in the literature. Although the above-mentioned data imbalance problem and small sample-size problem are caused by different reasons, both of them can be overcome by an utterance partitioning and resampling technique proposed in this thesis. Specifically, the sequence order of acoustic vectors in an enrollment utterance is first randomized; then the randomized sequence is partitioned into a number of segments. Each of these segments is then used to compute a GMM-supervector or an i-vector. A desirable number of supervectors/i-vectors can be produced by repeating this randomization and partitioning process a number of times. This method is referred to as utterance partitioning with acoustic vector resampling (UPAVR). Experiments on the NIST 2002, 2004 and 2010 SREs show that UPAVR can help the SVM training algorithm to find better decision boundaries so that SVM scoring outperforms other speaker comparison methods such as cosine distance scoring. Furthermore, results demonstrate that UPAVR can enhance the capability of LDA and WCCN in suppressing session variability, especially when the number of conversations per training speaker is limited.
This thesis also proposes a new channel compensation method called multi-way LDA that uses not only the speaker labels but also microphone labels in the training i-vectors for estimating the LDA projection matrix. It was found that the method can strengthen the discriminative capability of LDA and overcome the small sample-size problem. To overcome the implicit use of background information in the conventional PLDA scoring in i-vector speaker verification, this thesis proposes a method called PLDA-SVM scoring that uses empirical kernel maps to create a PLDA score space for each target speaker and train an SVM that operates in the score space to produce verification scores. Given a test i-vector and the identity of the target speaker under test, a score vector is constructed by computing the PLDA scores of the test i-vector with respect to the target-speaker’s i-vectors and a set of nontarget-speakers’ i-vectors. As a result, the bases of the score space are divided into two parts: one defined by the target-speaker’s i-vectors and another defined by the nontarget-speakers’ i-vectors. To ensure a proper balance between the two parts, utterance partitioning is applied to create multiple target-speaker’s i-vectors from a single or a small number of utterances. With the new protocol brought by NIST SRE, this thesis shows that PLDA-SVM scoring not only performs significantly better than the conventional PLDA scoring and utilizes the multiple enrollment utterances of target speakers effectively, but also opens up opportunity for adopting sparse kernel machines for PLDA-based speaker verification systems. Specifically, this thesis shows that it is possible to take the advantages of the empirical kernel maps by incorporating them into a more advanced kernel machine called relevance vector machine (RVM). Experiments on NIST 2012 SRE suggest that the performance of PLDA-RVM regression is slightly better than that of PLDA-SVM after performing UP-AVR.
|Description:||PolyU Library Call No.: [THS] LG51 .H577P EIE 2015 Rao
xxiv, 173 pages :illustrations (some color) ;30 cm
|URI:||http://hdl.handle.net/10397/35110||Rights:||All rights reserved.|
|Appears in Collections:||Thesis|
Show full item record
Files in This Item:
|b28068889_link.htm||For PolyU Users||203 B||HTML||View/Open|
|b28068889_ir.pdf||For All Users (Non-printable)||4.11 MB||Adobe PDF||View/Open|
Citations as of Oct 15, 2018
Citations as of Oct 15, 2018
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.