Parallel data mining algorithms for multi-dimensional points on GPUs

Matsumoto, Takazumi

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/83266

Title:	Parallel data mining algorithms for multi-dimensional points on GPUs
Authors:	Matsumoto, Takazumi
Degree:	Ph.D.
Issue Date:	2015
Abstract:	Data mining tasks such as clustering, outlier detection and similarity search typically employ a series of algorithms to operate on a large set of data, making them amenable to parallelization. Thus parallelization of data mining operations such as distance computation has been extensively studied in the literature. In recent years, the use of Graphics Processing Units (GPUs) for data mining has been steadily increasing. As state-of-the-art processors now include both CPU and GPU, it is important to consider which approaches benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency when applicable. Similarity search, also referred to as k-nearest neighbor search, is a particular application of distance computation where only to top k results (e.g., top 10) are required. It is used extensively in multimedia search, where only a small subset of possible results are used, and numerous approaches using both exact and approximate k-nearest neighbor methods have been proposed over the years. Our contribution is a new exact distance algorithm that not only outperforms competing exact GPU methods, but can compete with leading approximate GPU methods. It can also leverage heterogeneous (CPU-GPU) processing for improved efficiency. Outlier detection, also known as anomaly detection, involves detecting data points that deviate from expected patterns. It is an important part of applications such as fraud detection and system fault detection. In many real-world applications such as wireless sensor readings and weather forecasts, data are uncertain in nature. Rather than discarding uncertainty information or storing many sample readings, it is possible to instead approximate the probability distribution function (pdf) to compactly store and efficiently calculate uncertain distance when required. Existing GPU approaches are limited to working with certain data, while our contribution is a parallel method for outlier detection on uncertain data. Clustering is another common operation in data mining. Our work focuses on a particular application, label generation for web search results that automatically generates labels for related topics returned by a user's web search. This is accomplished using a fuzzy transduction-based relational model. We also contribute a GPU implementation of our proposed solution.
Subjects:	Data mining. Data mining -- Mathematics High performance computing Hong Kong Polytechnic University -- Dissertations
Pages:	xvi, 176 pages : illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/8035

Show full item record

Page views

59

Last Week
0

Last month

Citations as of Apr 21, 2024

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM