Please use this identifier to cite or link to this item:
Title: Structural similarity on XML data and its applications
Authors: Ng, Kar-leung
Keywords: Hong Kong Polytechnic University -- Dissertations
XML (Document markup language)
Data structures (Computer science)
Issue Date: 2007
Publisher: The Hong Kong Polytechnic University
Abstract: This dissertation addresses issues of detecting the structural similarity of XML (extensible Markup Language) documents from heterogeneous sources, and its applications to the areas of querying applications and web mining. This topic has brought much attention and a number of similarity measures have been proposed in recent years. Unlike most distance metrics which are based on the direct transformation between documents, a successful similarity measure should be able to assign higher scores to documents of similar types. To address the problem, we detect and analyze the document conformity against a schema which governs the document structure. Therefore, the goal of our study is to investigate issues involved in defining the structural measure which is supporting the detection of documents of similar types. (1) We first present a formal framework in defining the structural similarity of a document against a schema. We illustrate that the choice of schema languages, DTD or XML Schema, do not constitute major difference in the framework. (2) We extend the framework to compare documents without the prerequisite of a schema. Structural similarity has a wide variety of applications in automatic document processing. In the second half of the dissertation, we demonstrate its applicability to XML indexing, proximity querying and group detection using the clustering technique. We first propose RRSi, a novel structural index designed for structure-based query lookup on heterogeneous sources of XML documents supporting proximate query answers. The index successfully avoids the redundant processing of structurally irrelevant candidates that might show good content relevance. An optimized version, oRRSi, of the index is also developed to further reduce in both space and computational complexity. To the best of our knowledge, the structural indexes are the first work supporting proximity twig queries on XML documents. The experiment results show that the RRSi and oRRSi based query processing significantly outperforms previously proposed techniques in the XML repositories with structural heterogeneity. Then we examine the applicability of structural similarity in the area of web mining. A sitemap is a convenient navigation link system reflecting the true key website structure, and have become a standard website feature. Although website owners may choose to present their services or information in a variety of different ways, a certain level of similarity in web structure and content are often observed for websites in the same domain since they typically follow some evolved community standard. Clustering sitemaps by structure helps to detect groups of websites in identical domains and is complimentary to the link based ranking algorithmic function. We examine in this dissertation how to cluster sitemaps as tree structured documents. We introduce a new similarity measure between sitemaps, which reflects their key characteristics in the scoring. Moreover, the measure supports a centroid-based clustering algorithm avoiding pair-wise comparisons that achieves a significant gain in efficiency. We implemented the proposed clustering algorithm and ran extensive experiments on real and synthetic datasets showing their effectiveness and efficiency over other clustering algorithms, which were based on previous similarity metrics.
Description: xiii, 154 p. : ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577P COMP 2007 Ng
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b21167539_link.htmFor PolyU Users 162 BHTMLView/Open
b21167539_ir.pdfFor All Users (Non-printable) 1.66 MBAdobe PDFView/Open
Show full item record
PIRA download icon_1.1View/Download Contents

Page view(s)

Last Week
Last month
Citations as of Feb 19, 2019


Citations as of Feb 19, 2019

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.