Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/78082
Title: Towards efficient analytic query processing in main-memory column-stores
Authors: Xu, Wenjian
Advisors: Lo, Eric (COMP)
Keywords: Querying (Computer science)
Computer algorithms
Issue Date: 2018
Publisher: The Hong Kong Polytechnic University
Abstract: Recently, there is a resurgence of interest in main-memory analytic databases because of the large RAM capacity of modern servers and the increasing demand for real-time analytic platforms. In such databases, operations like scan, sort and join are at the heart of almost every query plan. However, current implementations of these operations have not fully leveraged the new features (e.g., SIMD, multi-core) provided by modern hardware. The goal of this dissertation is to design efficient algorithms for scan, sort and join by judiciously exploiting every bit of RAM and all the available parallelisms in each processing unit. Scan is a crucial operation since it is closest to the underlying data in the query plan. To accelerate scans, a state-of-the-art in-memory data layout chops data into multiple bytes and exploits early-stop capability by high-order bytes comparisons. As column widths are usually not multiples of byte, the last-byte of such layout is padded with 0's, wasting memory bandwidth and computation power. To fully leverage the resources, we propose to weave a secondary index into the vacant bits (i.e., bits originally padded with 0's), forming our new storage layout. This storage layout enables skip-scan, a new fast scan that enables both data skipping and early stopping without any space overhead.
With the advent of fast scans and denormalization techniques, sorting could become the new bottleneck. Queries with multiple attributes in clauses like GROUP BY, ORDER BY, SQL:2003 PARTITION BY are common in real workloads. When executing such queries, state-of-the-art main-memory column-stores require one round of sorting per input column. To accelerate that kind of multi­column sorting operation, we propose a new technique called "code massaging", which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and/or by improving the degree of SIMD data level parallelism. Join stays as a time-consuming operation when the denormalization overhead is too large to be applicable. Hash joins have been studied, improved, and reexamined over decades. Its major optimization direction is to partition the input columns to make the working set fit into the caches, such that the locality of hash probing is improved. As an alternative, we propose to utilize a secondary index to improve hash joins without physical partitioning. Specifically, in the build phase, hash values are scattered evenly into logical partitions of the hash table; in the probe phase, the secondary index is used as hints to re-order the probing sequence, such that the locality of hash probing is increased. Finally, we benchmark the performance of the proposed techniques in our column-store research prototype. Extensive experiments on benchmarks and real data show that our methods offer significant performance improvement over their counterparts. In addition, our methods also show decent scalability on modern multi-core CPUs.
Description: xx, 150 pages : illustrations
PolyU Library Call No.: [THS] LG51 .H577P COMP 2018 XuW
URI: http://hdl.handle.net/10397/78082
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
991022141358003411_link.htmFor PolyU Users167 BHTMLView/Open
991022141358003411_pira.pdfFor All Users (Non-printable)4.33 MBAdobe PDFView/Open
Show full item record
PIRA download icon_1.1View/Download Contents

Page view(s)

10
Citations as of Sep 18, 2018

Download(s)

1
Citations as of Sep 18, 2018

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.