Please use this identifier to cite or link to this item:
Title: Efficient big data system in the cloud : resource provisioning and scheduling
Authors: Yuan, Yi
Degree: Ph.D.
Issue Date: 2014
Abstract: Big data and big data analysis have achieved tremendous popularity recently. This new generation of systems, beyond conventional data analysis with sampled data, represents a new era in data exploration as well as utilization across petabytes and zettabytes datasets. First proposed by Google, MapReduce has become the de facto standard framework in parallel processing for big data applications. Nevertheless, MapReduce framework is also criticized for its inefficiency in performance. Thus, many studies investigate different aspects of MapRedcue framework to improve its performance. A MapReduce system consumes such resources as computing, memory, disk, network. How these resources are managed determines the performance of a MapReduce system. In this thesis, we investigate several challenges in managing resources for MapReduce system in the cloud. We study resource management at two different levels: Cluster level and Machine level. In cluster level, we investigate challenges in building clusters in the cloud. In machine level, we investigate challenges in scheduling tasks of MapReduce jobs among machines. First, we focus on the resource provisioning problem for building clusters in the cloud. Since running a MapReduce system needs a cluster which consists of a set of network-connected machines, this infrastructure requirement prevents small companies from making use of big data techniques. Cloud offers these small / mediate companies a choice and cloud providers are delighted to offer this service. However, we find that running MapReduce systems in cloud is not straight forward. MapReduce systems are both computational-intensive and IO-intensive which causes severe interference if machines for building the cluster are fine-chosen. We investigate this interference with detailed measurement, formulate it into a resource provisioning problem and propose a set of novel algorithms to solve the problem.
Second, we focus on task scheduling after a cluster is built. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. MapReduce distributes a job into map tasks and reduce tasks. They are parallelized across cluster. However, reduce tasks must wait until all map tasks finish because reduce tasks rely on all intermediate data produced by map tasks. To fully utilize the cluster, multiple MapReduce jobs with different importance can be scheduled together to efficiently utilize computation resources. Scheduling these tasks efficiently is complicated. We mathematically formulate this special task scheduling problem and develop a 3-approximation algorithm. Comprehensive simulations and real experiments prove the advantage ofour approach. Third, we study data locality for task scheduling in real MapReduce system. Data locality is very important because data migration introduces large network communications. We formulate a task scheduling problem with consideration of data locality and develop an algorithm within a constant factor to the optimal solution. We further develop a heuristic algorithm and achieve better performance. We validate the advantage of our approaches with comprehensive simulations and real experiments.
Subjects: Big data.
Database management.
Cloud computing.
Hong Kong Polytechnic University -- Dissertations
Pages: xviii, 116 p. : ill. ; 30 cm.
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of May 28, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.