Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/84868
Title: Clustering and classification on uncertain data
Authors: Xu, Lei
Degree: Ph.D.
Issue Date: 2013
Abstract: We study the problem of mining on uncertain objects whose locations are uncertain and described by probability density functions (pdf). Clustering and classification are two important tasks in data mining. Clustering on uncertain objects is different from traditional case on certain objects. UK-meansis proposed based on K-meansbutitis time consuming. Pruning techniques are proposed to improve the efficiency of UK-means. First we analyze existing pruning algorithms and experimentally show that there exists a new bottleneck in the performance due to the overhead of pruning candidate clusters for assignment of each uncertain object in each iteration. In this thesis, we will show that by considering squared Euclidean distance, UK-means (without pruning techniques) is reduced to K-means and performs much faster than pruning algorithms, however, with some discrepancies in the clustering results due to different distance functions used. Thus, we propose Approximate UK-means to heuristically identify objects of boundary cases and re-assign them to better clusters. In addition, we propose three models for the representation of cluster representative (certain model, uncertain model and heuristic model) to calculate expected squared Euclidean distance between objects and cluster representatives. The experimental results show that our approach (Approximate UK-means) reduces the discrepancies of K-means' clustering results by taking more time than K-means.
In the case of classification on uncertain objects, some existing algorithms are hundreds or thousands times more complex than traditional ones, because an uncertain object is represented by hundreds or thousands of samples. Due to the complex representation of uncertain objects and existing algorithms, it is time consuming to classify uncertain objects. In this thesis, we propose a novel supervised UK-means algorithm to classify uncertain objects more efficiently. In supervised UK-means, we consider to select features that can capture the relevant properties of uncertain data similarly to feature selection on certain objects. We also extend supervised UK-meansto ensemble learning. We experimentally demonstrate that our proposed approaches are more efficient than existing algorithms and can attain comparatively accurate results on non-overlapping data sets. In supervised UK-means, the classes are assumed to be well separated. But the real data are usually distributed arbitrarily and the classes cannot be separated by simple linear boundaries. We propose Supervised UK-means with Multiple Subclasses (SUMS) which considers that the objects in the same class can be further divided into several groups (subclasses) within the class and tries to learn the subclass representatives to classify objects more accurately. Moreover, we propose a Bounded Supervised UK-means with Multiple Subclasses (BSUMS) to avoid over-fitting. From our experiments, Supervised UK-means with Multiple Subclasses (SUMS) and BSUMS perform better than supervised UK-means on synthetic data sets and real data sets.
Subjects: Data mining.
Cluster analysis.
Classification.
Hong Kong Polytechnic University -- Dissertations
Pages: xvi, 113 p. : ill. ; 30 cm.
Appears in Collections:Thesis

Show full item record

Page views

42
Last Week
1
Last month
Citations as of Mar 24, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.