Analytical range query acceleration

Li, Zhe

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/93464

Title:	Analytical range query acceleration
Authors:	Li, Zhe
Degree:	Ph.D.
Issue Date:	2022
Abstract:	With the ever-increasing data amount in daily business operations, eﬃciently retrieving query results has gotten considerably more diﬃcult. This is especially true for analytical range queries that need to process a large portion of data to generate analytical results. In this thesis, we identify three analytical range queries, each of which is a read-only query containing a range parameter to limit the scope of the dataset or the output result size. We provide eﬃcient algorithms to reduce their query response time. The first query is range query on block-based storage systems, such as HDFS and Databricks. The range parameter here could be represented by a hyper- rectangle, which is used to select the required records from the multi-dimensional dataset. Due to the large result size, the IO and data scan are the bottlenecks for such a query. To reduce the query cost, existing approaches split the dataset into small partitions to avoid unnecessary data scan. These techniques mostly rely on historical queries to determine the partition layout (i.e., split positions). However, such query-driven approaches assume the future queries are identical to the historical, which is rarely the case in practice. In this work, we fill the research gap of query-driven partitioning when future queries are diﬀerent from the historical but similar in general. We formally define the similarity and propose split functions to minimize the query cost for future queries. Experimental results show that our method could be up to 70 times more eﬃcient than the state-of-the-art. The second query is range aggregate query in one or two dimensions, such as COUNT, SUM, MIN, and MAX. The range parameter here could be expressed as an interval (for a single dimension) or a rectangle (in two dimensions). Then a selected aggregate query is executed on the records within this range. Such aggregate queries are frequently used for both analysts (in OLAP) and various OLTP scenarios. For example, Foursquare, with more than 50 million monthly active users, helps users find the number of specific POIs (e.g., restaurants) within given regions. In this work, we investigate how to provide approximate range aggregate query results eﬃciently with bounded errors. We oﬀer an index-based solution which expresses discrete points with polynomial functions, in which we can provide the best tradeoﬀ between query response time, accuracy, and space among all competitors. The third query is similarity search on keyword-induced point groups, where the range parameter contains a value K to restrict the result size. Such that only the most similar K point groups are returned. A keyword-induced point group is formed by geo-positions (e.g., tweet's geo-tag) sharing the same keyword (e.g., the topic in a tweet). We found that if two keyword-induced points groups are close in Hausdorﬀ distance, their keywords are highly likely to have semantic connections. Such information is crucial to targeted marketing and recommendation. However, as the time complexity of this similarity search is proportional to the query point group's size and the dataset size, analysts are unable to retrieve query results quickly. To speed up this query, we suggest a pruning-based method. Experiments on Twitter data show that our technique is up to 6 times faster than the state-of-the-art.
Subjects:	Querying (Computer science) Information retrieval -- Data processing Hong Kong Polytechnic University -- Dissertations
Pages:	xxii, 165 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/11711

Show full item record

Page views

130

Last Week
5

Last month

Citations as of Dec 7, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM