Articulatory-feature based pronunciation modelling for high-level speaker verification

Zhang, Shixiong

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/83150

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	-
dc.creator	Zhang, Shixiong	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/2898	-
dc.language.iso	English	-
dc.title	Articulatory-feature based pronunciation modelling for high-level speaker verification	-
dc.type	Thesis	-
dcterms.abstract	Speaker verification is a binary classification problem whose objective is to determine whether a test utterance was produced by a client speaker. Text-independent speaker verification systems typically extract speaker-dependent features from short-term spectra of speech signals to build speaker-dependent Gaussian mixture models (GMMs). While this short-term spectral approach can achieve a reasonably good performance in controlled environment, the lack of robustness to real-world environment remains a serious problem. To improve the robustness of spectral-based systems, long-term high-level features have been investigated in recent years. Among the high-level features investigated, the use of articulatory features (AFs) for constructing conditional pronunciation models (CPMs) has been very promising. The resulting models are referred to as articulatory-feature based conditional pronunciation models, or simply AFCPMs. The drawback of AFCPMs, however, is that the pronunciation models are phoneme-dependent, meaning that they require one discrete density function for each phoneme. This dissertation demonstrates that this phoneme dependency leads to speaker models with low discriminative power, especially when the amount of training data is limited. To overcome this problem, this dissertation proposes four new techniques for articulatory-feature based pronunciation modeling. 1. Phonetic-Class Dependent AFCPM (CD-AFCPM). In this modeling technique, the density functions are conditioned on phonetic classes instead of phonemes. The phonetic classes are created from phonemes through three different mapping functions, which are obtained by (1) vector quantizing the discrete densities in the phoneme-dependent universal background models, (2) using the phone properties specified in the classical phoneme tree, and (3) combination of (1) and (2). 2. Probabilistic Weighting Scheme. In the original CD-AFCPM, all frames are considered to be equally important during the density estimation. However, frames that have a higher probability of belonging to the phonetic class being modeled should be given a greater weight. This dissertation, therefore, proposes a weighting scheme for computing the pronunciation models such that frames with a higher probability of belonging to a particular class will have a higher contribution to the model of that class. A new scoring method that uses an SVM to combine the scores generated from the phonetic-class models is also proposed. 3. Model Adaptation. Speaker verification based on high-level speaker features requires long enrolment utterances to be reliable. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrolment data. To alleviate this problem, this dissertation proposes a new adaptation method for creating speaker models. The method not only adapts the phoneme-dependent background model but also the phoneme-independent speaker model. 4. Articulatory-Feature Kernels. The log-likelihood ratio scoring method in the original AFCPM does not explicitly use the discriminative information available in the training data because the target speaker models and background models are separately trained. This dissertation proposes converting the speaker models to supervectors in high-dimensional space by stacking the discrete densities in the AFCPMs. An AF-kernel is constructed from the supervectors of target speakers, background speakers, and claimants. Then, an SVM is discrimina-tively trained to classify the supervectors. These four techniques have been evaluated on the NIST 2000 dataset. The evaluation leads to five findings: 1. Among the three mapping functions, the one that combines the classical phoneme tree and Euclidean distance between AFCPMs achieves the best performance; 2. Phonetic-classes AFCPM achieves a significantly lower error rate as compared to conventional AFCPM; 3. The weighting scheme leads to better speaker models and hence helps to improve verification performance; 4. The proposed adaptation method, which uses as much information as possible from the training data, significantly outperforms the classical MAP adaptation method; and 5. The proposed AF-kernel is complementary to the likelihood-ratio scoring method, and their fusion can improve verification performance.	-
dcterms.accessRights	open access	-
dcterms.educationLevel	M.Phil.	-
dcterms.extent	xiv, 115 p. : col. ill. ; 30 cm.	-
dcterms.issued	2008	-
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations.	-
dcterms.LCSH	Phonetics.	-
dcterms.LCSH	Automatic speech recognition.	-
dcterms.LCSH	Speech perception.	-
dcterms.LCSH	Biometry.	-
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/2898

Show simple item record

Page views

429

Last Week
4

Last month
16

Citations as of Aug 2, 2026

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM