Please use this identifier to cite or link to this item:
Title: Modeling and querying probabilistic RDFS data with correlated triples using Bayesian networks
Authors: Szeto, Chi Cheong
Keywords: RDF (Document markup language)
Semantic Web.
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2014
Publisher: The Hong Kong Polytechnic University
Abstract: Resource Description Framework (RDF)is a World Wide Web Consortium (W3C) data model for the Semantic Web. RDF data are RDF triples, and an RDF triple is a triple (subject, property, object). RDF Schema (RDFS) extends RDF by providing a vocabulary to describe application-specific classes and properties, class and property hierarchies, and which classes and properties are used together. RDFS reasoning leverages the vocabulary to derive additional RDF triples from the data. In recent years, probabilistic models for RDF have been proposed to better represent the real-life information, which is full of uncertainties. Existing models either have limited capabilities to model correlated data or ignore the semantics of the data. We argue that being able to model correlated RDF data is necessary. First, RDF data using the RDFS vocabulary are correlated. Second, correlated data occur in practice. Hence, we introduce a probabilistic model called probabilistic RDFS (pRDFS), which encodes statistical relationships among correlated RDF triples and satisfies the RDFS semantics. Representing and performing probabilistic inference on correlated data are expensive. We use Bayesian networks to represent the correlated data and probabilistic logic sampling to perform approximate inference. Since there may exist some truth value assignments that violate the RDFS semantics, we devise a consistency checking algorithm for pRDFS. The algorithm checks that the probabilities of all inconsistent truth value assignments for the correlated RDF triplesare zeros. It is executed once on static data. For data that are frequently updated, we propose an incremental approach that provides fast rechecking each time the data are updated. SPARQL is a W3C query language for RDF. The pattern of a SPARQL query is a conjunction of triple patterns, and a triple pattern is an RDF triple any member of which can be replaced with a variable. A solution to the query is the bindings of the query variables such that the query pattern matches the data or the data derived through the RDFS reasoning. We extend the query by including truth values in the triple patterns to match the uncertain data. Apart from the bindings of the query variables, an answer to the extended query includes the probability of the bindings, which is equal to the probability of the matched data. pRDFS fully specifies the probability distribution of declared data, but not derived data. A single probability value may not be able to specify the probability of the matched data containing derived data, and we show how to compute the probability bounds of the matched data in this case. Finally, we present an experimental evaluation of the running time performance of our proposed algorithms with respect to the data size, the percentage of uncertain data, the size of correlated data (by varying the number of nodes in a Bayesian network), and the complexity of the probability distributions (by varying the degree of nodes in a network). The algorithms were tested on the Berlin SPARQL Benchmark, the Lehigh University Benchmark, and random uncertain data.
Description: xv, 119 p. : ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577P COMP 2014 Szeto
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b2747284x_link.htmFor PolyU Users 203 BHTMLView/Open
b2747284x_ir.pdfFor All Users (Non-printable)1.19 MBAdobe PDFView/Open
Show full item record

Page view(s)

Last Week
Last month
Checked on Aug 21, 2017


Checked on Aug 21, 2017

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.