An evolutionary approach to discover composite features for effective text classification of small classes

Wong, Ka-shing Alex

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/86042

DC Field	Value	Language
dc.contributor	Department of Computing	-
dc.creator	Wong, Ka-shing Alex	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/3096	-
dc.language.iso	English	-
dc.title	An evolutionary approach to discover composite features for effective text classification of small classes	-
dc.type	Thesis	-
dcterms.abstract	In real world environment, text classification through machine learning often faces special problems caused by small number of positive training samples and significantly skew distributions. The overwhelming number of negative samples and their features may significantly bias the classifier learning process. In addition, features that only appeared in negative samples may be irrelevant to the determination of the target class. In text classification, the difficulties caused by imbalance data are aggravated by the large number of features available. Hence finding a small number of good features is essential to improve the classification of small classes. Apart from the basic word tokens composite features like n-gram phrases and sparse phrases are possible source of good features. They can be generated by combining word tokens to represent the co-occurrence of multiple words and can provide more precise information to distinguish a class. However a major problem with this is the enormous size of the possible combinations. This thesis studies the efficient generation of effective composite features for text classification when the target class is small. We show that this can be done by focusing on features in positive samples and by a heuristic based exploration of the composite features space. Experimental results in our study showed the features in positive samples could offer comparable performance to the features in all samples. At the same time by focusing on positive samples the number of features used could be greatly reduced. This simple application of sampling concept on feature selection offers a key to speed up the feature exploration. Furthermore, by applying several proposed techniques, together with the concept of evolutionary approach, a heuristics-based method was developed to efficiently explore space of composite features. The flexibility of this approach made it feasible to search for an optimal set of features in the very large space of composite features with limited resources. The effectiveness of our approach on classification, particularly small class classification, was evaluated and compared using different classifiers and a commonly used data set. In general, our experiments showed our approach was able to produce high quality composite features by generating and examining a much smaller pool of features than otherwise possible.	-
dcterms.accessRights	open access	-
dcterms.educationLevel	M.Phil.	-
dcterms.extent	vii, 133 leaves : ill. ; 30 cm.	-
dcterms.issued	2008	-
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations.	-
dcterms.LCSH	Text processing (Computer science)	-
dcterms.LCSH	Semantics -- Data processing.	-
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/3096

Show simple item record

Page views

177

Last Week
0

Last month

Citations as of Jun 22, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM