Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/86635
Title: Matrixmap : programming abstraction and implementation of matrix computation for big data applications
Authors: Huangfu, Yaguang
Degree: M.Phil.
Issue Date: 2016
Abstract: Big data refers to information that exceeds the processing capacity of conventional database systems and is characterized by its volume, velocity and variety. Big data programming requires parallel programming systems to implement parallel programming models to scale up with flexibility. A parallel programming model is an abstraction which expresses the application logic, defines how to load data into data structures, and perform parallel operations on the structure. The computation core of many big data applications can be expressed as general matrix computations, including linear algebra operations and irregular matrix operations. Many common machine learning algorithms and graph algorithms can be implemented by matrix operations. However, existing parallel programming systems do not provide programming abstraction and efficient implementation for general matrix computations. For example, Data-Parallel programming systems such as Spark are inefficient to support matrix operations. Graph-Parallel programming systems such as GraphLab are for graph algorithms but do not support matrix operations. Large-scale matrix computation systems such as MadLINQ are specified for linear algebra operations, but do not support irregular matrix operations.
In this thesis, we describe the design and implementation of MatrixMap, a unified and efficient data-parallel programming framework for general matrix computations. MatrixMap provides powerful yet simple abstraction, consisting of a distributed in-memory data structure called bulk key matrix and a computation interface defined by matrix patterns. Users can easily load data into bulk key matrices and program algorithms into parallel matrix patterns. Bulk key matrix is the fundamental data structure of MatrixMap, a scalable and constant distributed shared memory data structure, which stores vector-oriented data indexed by key and can keep data across matrix patterns. Matrix patterns can be programmed by user-defined lambda function. Mathematical matrix is the special case with key and value in number. We implement MatrixMap on a shared nothing cluster with multi-cores support. The BSP model is used to compute each pattern and to form an asynchronous computation pipeline of getting, computing and saving data. Furthermore, we leverage sparse matrices and BLAS (Basic Linear Algebra Subprograms) to speed up in-memory matrix computations. MatrixMap outperforms current state-of-the-art systems by employing three key techniques: matrix patterns with lambda functions for irregular and linear algebra matrix operations, asynchronous computation pipeline with optimized data shuffling strategies for specific matrix patterns, and in-memory data structure reusing data in iterations. Moreover, it can automatically handle the parallelization and distribute execution of programs on a large cluster. Based on MatrixMap, many example applications have been implemented and tested. The experiment results show that MatrixMap can be 12 times faster than Spark.
Subjects: Parallel programming (Computer science)
Matrices.
Computer algorithms.
Hong Kong Polytechnic University -- Dissertations
Pages: x, 71 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

46
Last Week
0
Last month
Citations as of Mar 24, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.