Unsupervised pattern discovery for sequence and mixed attribute databases

Wu, Pak-kit

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/84119

Title:	Unsupervised pattern discovery for sequence and mixed attribute databases
Authors:	Wu, Pak-kit
Degree:	M.Phil.
Issue Date:	2011
Abstract:	That the world contains a vast amount of digital information getting ever vaster ever more rapidly, there is a great need to reveal new insights which previously remain hidden from the data of mixed data types such that comprehensive information could be well structured, effectively organized and further applied to analysis, classification, interpretation, understanding and summarization. As most data from databases come from diverse sources, many of them are not necessarily provided with explicit class information. A pattern discovery method which automatically discovers pattern and knowledge from data without relying on prior classificatory knowledge is in great need. For a large database, how to discover statistically significant patterns and how to discretize its continuous data into interval events are still research and practical problems. Discovering patterns from a large mixed-mode database, where these data types may be a mixture of interval-scaled, symmetric binary, asymmetric binary, category, ordinal or ratio-scaled, is regarded as a classification problem when classes of the samples are given and solved as a discrete-data problem by discretizing the continuous data into intervals maximizing the interdependence between that attribute and the class labels. However, when class information is unavailable, discovering patterns becomes difficult. To tackle the aforementioned problems in an unsupervised manner, which is the problem of unsupervised pattern discovery, one would search for statistically significant patterns by mining the database. The proposed approach adopts a probabilistic approach to detect statistically significant patterns and transform them into a relational table to represent the original data. Given a mixed-mode dataset, we partition it into a number of attribute clusters, each of which contains some sort of correlated relationship. This process is known as attribute clustering. Once all optimal attribute clusters are found, the most representative attribute so-called mode could be discovered in each attribute cluster. To deal with the discretization problem, a mode-driven discretization algorithm is introduced to treat the mode just like the class label to drive the discretization of other continuous attributes in the attribute group by maximizing the interdependence between the continuous attributes and the mode. Treating intervals as discrete events, association patterns can be discovered. If the attribute clusters obtained are crisp clusters, significant patterns overlapping different clusters cannot be found. A new method of "fuzzifying" the crisp attribute clusters is introduced to detect significant patterns which overlap different fuzzy clusters. In validating the premises proposed in the thesis, extensive experiments using a number of synthetic data sets, data sets from UCI machine learning archive and two large sets from real world databases were conducted to verify each of the questions conceived. In particular to demonstrate the usefulness of the proposed approach, the two large sets of real world data are chosen to be analyzed: one is from a number of meteorological surface stations while another one is from a delay coking unit in a petrochemical refinery. The discovery of patterns from the data of weather stations reflects the local and global characteristics of the correlated meteorological parameters. The finding from the data of the delay coking reveals the relationship among the large number of sensors and controllers of the coking plant facilities. These findings provide significant evidences to support the usefulness and effectiveness of the proposed approaches in analyzing the data to extract significant patterns and knowledge for interpretation, understanding and summarization.
Subjects:	Database management. Database searching. Pattern recognition systems. Hong Kong Polytechnic University -- Dissertations
Pages:	xi, 143 leaves : ill. ; 30 cm.
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/6189

Show full item record

Page views

49

Last Week
0

Last month

Citations as of Apr 14, 2024

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM