Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/35149
Title: Parallel data mining algorithms for multi-dimensional points on GPUs
Authors: Matsumoto, Takazumi
Advisors: Yiu, Man Lung (COMP)
Hung, Edward (COMP)
Chung, Korris (COMP)
Keywords: Data mining.
Data mining -- Mathematics
High performance computing
Issue Date: 2015
Publisher: The Hong Kong Polytechnic University
Abstract: Data mining tasks such as clustering, outlier detection and similarity search typically employ a series of algorithms to operate on a large set of data, making them amenable to parallelization. Thus parallelization of data mining operations such as distance computation has been extensively studied in the literature. In recent years, the use of Graphics Processing Units (GPUs) for data mining has been steadily increasing. As state-of-the-art processors now include both CPU and GPU, it is important to consider which approaches benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency when applicable. Similarity search, also referred to as k-nearest neighbor search, is a particular application of distance computation where only to top k results (e.g., top 10) are required. It is used extensively in multimedia search, where only a small subset of possible results are used, and numerous approaches using both exact and approximate k-nearest neighbor methods have been proposed over the years. Our contribution is a new exact distance algorithm that not only outperforms competing exact GPU methods, but can compete with leading approximate GPU methods. It can also leverage heterogeneous (CPU-GPU) processing for improved efficiency. Outlier detection, also known as anomaly detection, involves detecting data points that deviate from expected patterns. It is an important part of applications such as fraud detection and system fault detection. In many real-world applications such as wireless sensor readings and weather forecasts, data are uncertain in nature. Rather than discarding uncertainty information or storing many sample readings, it is possible to instead approximate the probability distribution function (pdf) to compactly store and efficiently calculate uncertain distance when required. Existing GPU approaches are limited to working with certain data, while our contribution is a parallel method for outlier detection on uncertain data. Clustering is another common operation in data mining. Our work focuses on a particular application, label generation for web search results that automatically generates labels for related topics returned by a user's web search. This is accomplished using a fuzzy transduction-based relational model. We also contribute a GPU implementation of our proposed solution.
Description: PolyU Library Call No.: [THS] LG51 .H577P COMP 2015 Matsumoto
xvi, 176 pages :illustrations
URI: http://hdl.handle.net/10397/35149
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b28157369_link.htmFor PolyU Users203 BHTMLView/Open
b28157369_ir.pdfFor All Users (Non-printable)5.56 MBAdobe PDFView/Open
Show full item record

Page view(s)

50
Last Week
4
Last month
Checked on Mar 19, 2017

Download(s)

100
Checked on Mar 19, 2017

Google ScholarTM

Check



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.