Please use this identifier to cite or link to this item:
Title: Statistical pattern recognition : locality preserving embedings and ensemble of rules
Authors: Niu, Ben
Degree: Ph.D.
Issue Date: 2008
Abstract: Database Management Systems (DBMS) create index structures to help with data retrieval and analysis. Developing the state-of-the-art indexing techniques is the fundamental issue in computer science and engineering. A critical problem to be tackled in data indexing concerns the high dimensionality of the data. As we know that for low-dimensional data with only a small number of attributes, one can simply index with a key attribute in one-dimensional space. Further, for multidimensional data with lower hundred attributes, spatial indexing methods, e.g. KD-tree, can be employed to speedup the searching and retrievial based on similarity metrics. However, in many real world applications, such as time series analysis, content based multi-media retrieival, Microarray analysis, and brain electroencephalogram maps, the data we encounter can be of much higher dimensions, say, thousands of attributes. The traditional methods like B-tree and KD-tree loose their effectiveness due to the curse-of-dimensionality. To address this difficulty, we can reduce the dimensionality by either feature extraction or selection. By feature extraction, we seek the embedding subspaces for optimal transform to get the low-dimensional representation (the index features) of the original data. By feature selection we choose only a subset of the attributes, the significant features, and use them to characterize the distribution of the data samples. Among these, Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are popular for feature extraction. They have been successfully applied to facial image databases for biometric authentication. The optimal embedding subspaces are known as Eigenfaces and Fisherfaces for PCA and LDA respectively. Also, they prove to work effectively on the handwritten digit databases for postcode recognition. On the other hand, the feature selection methods, such as the decision tree and its variants, reduce the dimensionality without transforms, thus avoiding the loss of the semantic meaning of the attributes in the problem domain. The selected features and the rules extracted from the decision trees are therefore human understandable, and helpful to make interpretations on the processes underlying the data. Recent advances in dimensionality reduction highlight three techniques, namely, the image-based projection, the locality-based embedding, and the ensemble of decision trees. By image-based projection, we needn't have to convert the 2-dimensional input images (or other types of samples given in the matrix form) to the 1-dimensional vectors before the training step. Moreover, it can reduce significantly the size of the matrix in the associated eigenproblems. Based on this technique the 2-dimensional PCA (2DPCA) and LDA (2DLDA) have been developed, and evaluated on a variety of the facial image databases. The experimental results indicate that the computational efficiency and the recognition accuracy are both improved dramatically. Locality-based embedding on the other hand is a natural extension of the manifold learning methods, which stress to preserve the local information of data density during embedding. The key idea is that we can infer the global structure of the dataset by maximally preserving the patterns of the data in local neighborhoods. The index features extracted describe the local geometry of the samples, desirable for the nearest neighbor based searches which rely mainly on the local information for retrieval. By locality-based embedding, the recently proposed Laplacianfaces method proves to outperform the PCA and the LDA on facial data. It has aroused considerable interests as an alternative to support high-dimensional data indexing. Another technique is the ensemble of decision trees, specifically, the CS4 algorithm. This method involves the construction and utilization of a committee of trees, instead of only one tree, for feature selection and prediction. It is plausible in Bioinformatics applications as it can derive the product rules that are biologically meaningful to interpret and guide the wet lab experiments by biologists. Moreover, it turns out to be more accurate than the traditional decision tree method in cancer prediction based on Microarrays. In this work, we contribute to develop 4 new techniques for dimensionality reduction based on the above mentioned methods. Firstly, we develop 2-dimensional Laplacianfaces. It is known that 2DPCA, 2DLDA and Laplacianfaces can outperform PCA and LDA respectively. However, it is unclear whether the Laplacianfaces, by leveraging image-based projection, can be further improved against 2DPCA and 2DLDA. Thereby, we develop the 2D Laplacianfaces by intergrating the two techniques, locality preserving and image-based projection. The algorithm is evaluated on the facial image databases, FERET and AR. We find that the computational efficiency is significantly improved compared with Laplacianfaces. The training time and memory complexity is reduced from O(m2 x n2) to only O(m x n), where m and n are the number of rows and columns of the sample image. 2D Laplacianfaces is also more accurate than 2DPCA and 2DLDA by utilizing the local information to help with recognition. Secondly, we develop a technique, unsupervised discriminant projection (UDP). In addition to the local information, we also consider the global information in formulating the optimizing criterion. The goal is set to preserve the local density of the data while maximizing the non-local global scatter. On facial image databases UDP outperforms consistently the Locality Preserving Projections (LPP) and PCA, and outperforms LDA when the training sample size per class is small. On the PolyU Palmprint database, UDP achieves the top query accuracy of 99.7%, compared with 86.0%, 95.7%, and 99.0% of PCA, LDA and Laplacianfaces. Thirdly, based on the idea of locality preserving we also propose another method called Mutual Neighborhood based Discrimiant Projection (MNDP). As the errors of classification derive mainly from the samples on the class boundaries, it is important that the geometry of the class boundary should be better preserved for projection. We construct the mutual neighborhoods to highlight those samples that are on the boundary and most likely contribute to the prediction errors. The features extracted from the facial and the handwritten database indicates that it is significantly better than PCA and LDA when the size of the training sample is modest. The problem of singularity in LDA can also be successfully addressed. Finally, besides 2D Laplacianfaces and UDP, we also present an algorithm, adaptive CS4 (ACS4), for feature selection and prediction. ACS4 can identify the most discriminant features and grow a committee of trees for prediction. We evaluate ACS4 on the biology database for DNA methylation analysis. The number of the index features for cell line classification is reduced significantly from 49 to only 2. The computational and the wet-lab costs for cancer diagnosis can thus be reduced by about 20 times in contrast to the previously reports. Meanwhile, we also propose a strategy for adaptive clustering on the gene methylation database. The results of clustering confirm that DNA methylation plays the dominant role in the process of tumourgenesis and embryogenesis in human cells.
Subjects: Hong Kong Polytechnic University -- Dissertations.
Pattern perception -- Statistical methods.
Principal components analysis.
Pages: 172 leaves : ill. ; 31 cm.
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of May 28, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.