Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/6404
Title: Clustering and classification on uncertain data
Authors: Xu, Lei
Keywords: Data mining.
Cluster analysis.
Classification.
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2013
Publisher: The Hong Kong Polytechnic University
Abstract: We study the problem of mining on uncertain objects whose locations are uncertain and described by probability density functions (pdf). Clustering and classification are two important tasks in data mining. Clustering on uncertain objects is different from traditional case on certain objects. UK-meansis proposed based on K-meansbutitis time consuming. Pruning techniques are proposed to improve the efficiency of UK-means. First we analyze existing pruning algorithms and experimentally show that there exists a new bottleneck in the performance due to the overhead of pruning candidate clusters for assignment of each uncertain object in each iteration. In this thesis, we will show that by considering squared Euclidean distance, UK-means (without pruning techniques) is reduced to K-means and performs much faster than pruning algorithms, however, with some discrepancies in the clustering results due to different distance functions used. Thus, we propose Approximate UK-means to heuristically identify objects of boundary cases and re-assign them to better clusters. In addition, we propose three models for the representation of cluster representative (certain model, uncertain model and heuristic model) to calculate expected squared Euclidean distance between objects and cluster representatives. The experimental results show that our approach (Approximate UK-means) reduces the discrepancies of K-means' clustering results by taking more time than K-means.
In the case of classification on uncertain objects, some existing algorithms are hundreds or thousands times more complex than traditional ones, because an uncertain object is represented by hundreds or thousands of samples. Due to the complex representation of uncertain objects and existing algorithms, it is time consuming to classify uncertain objects. In this thesis, we propose a novel supervised UK-means algorithm to classify uncertain objects more efficiently. In supervised UK-means, we consider to select features that can capture the relevant properties of uncertain data similarly to feature selection on certain objects. We also extend supervised UK-meansto ensemble learning. We experimentally demonstrate that our proposed approaches are more efficient than existing algorithms and can attain comparatively accurate results on non-overlapping data sets. In supervised UK-means, the classes are assumed to be well separated. But the real data are usually distributed arbitrarily and the classes cannot be separated by simple linear boundaries. We propose Supervised UK-means with Multiple Subclasses (SUMS) which considers that the objects in the same class can be further divided into several groups (subclasses) within the class and tries to learn the subclass representatives to classify objects more accurately. Moreover, we propose a Bounded Supervised UK-means with Multiple Subclasses (BSUMS) to avoid over-fitting. From our experiments, Supervised UK-means with Multiple Subclasses (SUMS) and BSUMS perform better than supervised UK-means on synthetic data sets and real data sets.
Description: xvi, 113 p. : ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577P COMP 2013 Xu
URI: http://hdl.handle.net/10397/6404
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b26530673_link.htmFor PolyU Users203 BHTMLView/Open
b26530673_ir.pdfFor All Users (Non-printable) 1 MBAdobe PDFView/Open
Show full item record

Page view(s)

357
Last Week
10
Last month
Checked on Mar 19, 2017

Download(s)

167
Checked on Mar 19, 2017

Google ScholarTM

Check



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.