Please use this identifier to cite or link to this item:
DC FieldValueLanguage
dc.contributorDepartment of Computing-
dc.creatorWong, Ka-shing Alex-
dc.titleAn evolutionary approach to discover composite features for effective text classification of small classes-
dcterms.abstractIn real world environment, text classification through machine learning often faces special problems caused by small number of positive training samples and significantly skew distributions. The overwhelming number of negative samples and their features may significantly bias the classifier learning process. In addition, features that only appeared in negative samples may be irrelevant to the determination of the target class. In text classification, the difficulties caused by imbalance data are aggravated by the large number of features available. Hence finding a small number of good features is essential to improve the classification of small classes. Apart from the basic word tokens composite features like n-gram phrases and sparse phrases are possible source of good features. They can be generated by combining word tokens to represent the co-occurrence of multiple words and can provide more precise information to distinguish a class. However a major problem with this is the enormous size of the possible combinations. This thesis studies the efficient generation of effective composite features for text classification when the target class is small. We show that this can be done by focusing on features in positive samples and by a heuristic based exploration of the composite features space. Experimental results in our study showed the features in positive samples could offer comparable performance to the features in all samples. At the same time by focusing on positive samples the number of features used could be greatly reduced. This simple application of sampling concept on feature selection offers a key to speed up the feature exploration. Furthermore, by applying several proposed techniques, together with the concept of evolutionary approach, a heuristics-based method was developed to efficiently explore space of composite features. The flexibility of this approach made it feasible to search for an optimal set of features in the very large space of composite features with limited resources. The effectiveness of our approach on classification, particularly small class classification, was evaluated and compared using different classifiers and a commonly used data set. In general, our experiments showed our approach was able to produce high quality composite features by generating and examining a much smaller pool of features than otherwise possible.-
dcterms.accessRightsopen access-
dcterms.extentvii, 133 leaves : ill. ; 30 cm.-
dcterms.LCSHHong Kong Polytechnic University -- Dissertations.-
dcterms.LCSHText processing (Computer science)-
dcterms.LCSHSemantics -- Data processing.-
Appears in Collections:Thesis
Show simple item record

Page views

Last Week
Last month
Citations as of Sep 24, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.