Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/93464
Title: Analytical range query acceleration
Authors: Li, Zhe
Degree: Ph.D.
Issue Date: 2022
Abstract: With the ever-increasing data amount in daily business operations, efficiently retrieving query results has gotten considerably more difficult. This is especially true for analytical range queries that need to process a large portion of data to generate analytical results. In this thesis, we identify three analytical range queries, each of which is a read-only query containing a range parameter to limit the scope of the dataset or the output result size. We provide efficient algorithms to reduce their query response time.
The first query is range query on block-based storage systems, such as HDFS and Databricks. The range parameter here could be represented by a hyper- rectangle, which is used to select the required records from the multi-dimensional dataset. Due to the large result size, the IO and data scan are the bottlenecks for such a query. To reduce the query cost, existing approaches split the dataset into small partitions to avoid unnecessary data scan. These techniques mostly rely on historical queries to determine the partition layout (i.e., split positions). However, such query-driven approaches assume the future queries are identical to the historical, which is rarely the case in practice. In this work, we fill the research gap of query-driven partitioning when future queries are different from the historical but similar in general. We formally define the similarity and propose split functions to minimize the query cost for future queries. Experimental results show that our method could be up to 70 times more efficient than the state-of-the-art.
The second query is range aggregate query in one or two dimensions, such as COUNT, SUM, MIN, and MAX. The range parameter here could be expressed as an interval (for a single dimension) or a rectangle (in two dimensions). Then a selected aggregate query is executed on the records within this range. Such aggregate queries are frequently used for both analysts (in OLAP) and various OLTP scenarios. For example, Foursquare, with more than 50 million monthly active users, helps users find the number of specific POIs (e.g., restaurants) within given regions. In this work, we investigate how to provide approximate range aggregate query results efficiently with bounded errors. We offer an index-based solution which expresses discrete points with polynomial functions, in which we can provide the best tradeoff between query response time, accuracy, and space among all competitors.
The third query is similarity search on keyword-induced point groups, where the range parameter contains a value K to restrict the result size. Such that only the most similar K point groups are returned. A keyword-induced point group is formed by geo-positions (e.g., tweet's geo-tag) sharing the same keyword (e.g., the topic in a tweet). We found that if two keyword-induced points groups are close in Hausdorff distance, their keywords are highly likely to have semantic connections. Such information is crucial to targeted marketing and recommendation. However, as the time complexity of this similarity search is proportional to the query point group's size and the dataset size, analysts are unable to retrieve query results quickly. To speed up this query, we suggest a pruning-based method. Experiments on Twitter data show that our technique is up to 6 times faster than the state-of-the-art.
Subjects: Querying (Computer science)
Information retrieval -- Data processing
Hong Kong Polytechnic University -- Dissertations
Pages: xxii, 165 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

34
Last Week
0
Last month
Citations as of Apr 28, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.