Please use this identifier to cite or link to this item:
Title: A genetic algorithm based approach for clustering categorical data
Authors: Lee, Ho-kei Sean
Degree: M.Phil.
Issue Date: 2006
Abstract: Given a database of records, clustering is concerned with the grouping of similar records into different groups or clusters based on their attribute values. Many algorithms have been proposed in the past to address the clustering problem but most of them are developed mainly to handle continuous-valued data. Relatively little attention has been paid to the clustering of categorical data. Given that these kind of data is very commonly collected in many applications in business, medicine and the social sciences, etc., it is important that an effective clustering algorithm be developed to handle such data, in this thesis, we propose such an algorithm. This algorithm is based on the use of a simple genetic algorithm (GA) that employs a probabilistic search technique for solutions that are supposedly optimal or near-optimal according to some performance criteria. This GA-based clustering algorithm makes use of an encoding scheme that can encode clustering results in chromosomes effectively. To work with this scheme, we also propose a set of genetic operators that can facilitate the exchange of clustering information between chromosomes on one hand and allow variations to be introduced on the other. For the proposed GA to work well, we have also introduced a fitness function to evaluate clustering quality. This is based on an information theoretic measure that measures how much the presence of a particular attribute value supports or refutes a record in a data set to be classified into a specific cluster. The higher its fitness value based on the evaluation function, the better the solution encoded in a chromosome. Unlike traditional algorithm, the proposed GA-based clustering algorithm has the advantage that it can automatically determine the number of clusters hidden in a dataset. The proposed algorithm has been tested with both simulated and real data; the results show that it is very promising and can have many real applications.
Subjects: Hong Kong Polytechnic University -- Dissertations
Cluster analysis -- Data processing
Pages: vii, 103 leaves : ill. ; 31 cm
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of May 28, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.