Structural similarity on XML data and its applications

Ng, Kar-leung

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/84393

Title:	Structural similarity on XML data and its applications
Authors:	Ng, Kar-leung
Degree:	Ph.D.
Issue Date:	2007
Abstract:	This dissertation addresses issues of detecting the structural similarity of XML (extensible Markup Language) documents from heterogeneous sources, and its applications to the areas of querying applications and web mining. This topic has brought much attention and a number of similarity measures have been proposed in recent years. Unlike most distance metrics which are based on the direct transformation between documents, a successful similarity measure should be able to assign higher scores to documents of similar types. To address the problem, we detect and analyze the document conformity against a schema which governs the document structure. Therefore, the goal of our study is to investigate issues involved in defining the structural measure which is supporting the detection of documents of similar types. (1) We first present a formal framework in defining the structural similarity of a document against a schema. We illustrate that the choice of schema languages, DTD or XML Schema, do not constitute major difference in the framework. (2) We extend the framework to compare documents without the prerequisite of a schema. Structural similarity has a wide variety of applications in automatic document processing. In the second half of the dissertation, we demonstrate its applicability to XML indexing, proximity querying and group detection using the clustering technique. We first propose RRSi, a novel structural index designed for structure-based query lookup on heterogeneous sources of XML documents supporting proximate query answers. The index successfully avoids the redundant processing of structurally irrelevant candidates that might show good content relevance. An optimized version, oRRSi, of the index is also developed to further reduce in both space and computational complexity. To the best of our knowledge, the structural indexes are the first work supporting proximity twig queries on XML documents. The experiment results show that the RRSi and oRRSi based query processing significantly outperforms previously proposed techniques in the XML repositories with structural heterogeneity. Then we examine the applicability of structural similarity in the area of web mining. A sitemap is a convenient navigation link system reflecting the true key website structure, and have become a standard website feature. Although website owners may choose to present their services or information in a variety of different ways, a certain level of similarity in web structure and content are often observed for websites in the same domain since they typically follow some evolved community standard. Clustering sitemaps by structure helps to detect groups of websites in identical domains and is complimentary to the link based ranking algorithmic function. We examine in this dissertation how to cluster sitemaps as tree structured documents. We introduce a new similarity measure between sitemaps, which reflects their key characteristics in the scoring. Moreover, the measure supports a centroid-based clustering algorithm avoiding pair-wise comparisons that achieves a significant gain in efficiency. We implemented the proposed clustering algorithm and ran extensive experiments on real and synthetic datasets showing their effectiveness and efficiency over other clustering algorithms, which were based on previous similarity metrics.
Subjects:	Hong Kong Polytechnic University -- Dissertations. XML (Document markup language) Data structures (Computer science)
Pages:	xiii, 154 p. : ill. ; 30 cm.
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/2634

Show full item record

Page views

60

Last Week
0

Last month

Citations as of Apr 21, 2024

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM