Please use this identifier to cite or link to this item:
Title: Discovering clusters in databases using an evolutionary approach
Authors: Chung, Lap-hang Lewis
Degree: M.Phil.
Issue Date: 2000
Abstract: Data Mining is concerned with the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The grand challenge of data mining is to collectively handle the problems imposed by the nature of real-world databases which tend to be dynamic, incomplete, redundant, noisy, and very large. Among many different problems that data mining is concerned with, the problem of discovering clusters in databases has recently received more attention. Clustering problem is concerned with the discovering of meaningful groupings of data records in a database based on their attribute values. The ability to do so can have many applications in many different areas in business and finance, computing and engineering, natural and social science, etc. Many of the existing clustering techniques were developed to handle a special type of data mining problem called spatial data mining. The databases that are involved contain continuous-valued records and the techniques that are used are, by and large, based on distance measures that can be defined in the Euclidean space. In other words, these techniques are not very useful when employed to handle mixed continuous- and discrete-valued data records. For clustering techniques that can be used to deal with mixed data, many of them use different distance measures for continuous and discrete-valued data separately. Moreover, they are not good at handling data records that are noisy and that contain missing values. They are also not able to discover clusters whose boundaries overlap. Furthermore, many of them do not make explicit the characteristics of each cluster discovered or the differences between them and this makes the result difficult to interpret and use. To overcome these problems, we propose a new clustering algorithm in this thesis. When compared with existing algorithms, it has several advantageous features. It is able to (i) handle mixed continuous- and discrete-valued data; (ii) discover overlapping clusters; (iii) perform data transformation; (iv) handle noisy and missing value; and (v) explicitly represent the characteristics of each discovered clusters. The proposed clustering algorithm is based on the use of a simple genetic algorithm (GA). By representing a cluster label as a gene, particular grouping of records is encoded in a chromosome. Once different groupings are generated, the most interesting chromosomes are then evolved using the operators of selection, crossover, and mutation. To determine how interesting the chromosomes in the whole population is, all of them are evaluated by a fitness function. The fitness function is defined in terms of a probabilistic similarity measure and can be interpreted as an objective measure of interestingness of the rules that characterize the particular grouping in a chromosome. Since the similarity measure is probabilistic, it can be defined when the data being dealt with contains noisy, dynamic, incomplete, missing, or even erroneous values. In addition, the defined measure enables us to discover overlapping clusters. The ultimate goal of the evolutionary process is, therefore, to identify the fittest chromosome by maximizing the fitness function. During the process, it should be noted that a set of rules is discovered to characterize the specific grouping encoded in a chromosome. This non-black-box approach makes it possible for the patterns underlying the databases to be made explicit. To evaluate the performance of the proposed clustering approach, we used many sets of real and simulated data in our experimentation. In addition, for comparison with existing methods, we have also introduced an evaluation criterion. The results of the experiments show that the proposed approach is able to handle real problem more effectively.
Subjects: Data mining
Hong Kong Polytechnic University -- Dissertations
Pages: vii, 123 leaves : ill. ; 30 cm
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of Jun 11, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.