Please use this identifier to cite or link to this item:
Title: Preprocessing frameworks for threaded discussion analysis by graphical probabilistic modeling
Authors: Sze, Chun-ming Donahue
Degree: M.Phil.
Issue Date: 2009
Abstract: User generated content (UGC) has become the fastest growing sector of the World Wide Web. Today, one major type of massive UGC data is generated from web forums. The web forum, similar to USENET, is a bulletin board commonly used by users to exchange ideas, publish topics, or simply send replies via the HTML based browser. Since almost all computers are equipped with the pre-installed browser and can be easily accessed, the web forum has become more popular, and is considered as a significant contributor of the UGC data. With the growing importance of such web forum data, there are increasing and compelling needs to develop techniques to help analyze such tons of data, for example, grouping them in a meaningful and an user-friendly manner. Recently, Bayesian methods have grown from specialist niche to mainstream in the field of pattern recognition and machine learning. The graphical probabilistic model (GPM), induced by probability and graph theories, offers numerous useful properties to analyze data by using diagrammatic representations of probability distributions under the Bayesian perspective. By using effective algorithms like Gibbs Sampling, one may formulate topical problems (e.g. hot topics in a forum) in the latent variable model and obtains quality results in a tractable manner. In addition, we may also infer the relationship between different textual type variables (e.g. author, entity, word, and sentiment) in the Markov random fields. To analyze the web forum, one of the easiest ways is to directly convert a post or a thread as a bag of words (BOW) vector space representation and perform one of the graphical probabilistic modeling for instance latent variable modeling (for topical modeling) or Markov random fields (for non-topical modeling). However, the transformation of bag of words of threaded text may lead to a serious loss of important information, making the analysis or mining process ineffective. By using different graph models and inference techniques, we can develop a set of preprocessing frameworks to facilitate the analysis of web forum data. In topical modeling, we propose a framework for word-thread matrix formation. In order to provide more representative bag of words for latent variable modeling, our framework is designed to measure both implicit and explicit relationships between posts and replies. It consists two parts. In the first part, a threaded text is transformed to a directed acyclic graph (DAG) by a set of feature link generation functions. In the second part, different graph based ranking algorithms can be applied. Our framework, then, extracts a list of words by weighting the importance ranking value with traditional feature selection method. In non-topical modeling, on the other hand, we propose a distributional similarity model (DSM) to analyze the relationship between different textual type variables of a thread in the Markov random fields. This model is employed to measure not only the co-occurrence but also a distributional similarity in different type of distance level commonly found in threaded text. Empirical results obtained for the Hong Kong popular web forums show that the proposed methods are effective.
Subjects: Hong Kong Polytechnic University -- Dissertations.
Graphical modeling (Statistics)
User-generated content.
World Wide Web.
Pages: 103 leaves : ill. (some col.) ; 30 cm.
Appears in Collections:Thesis

Show full item record

Page views

Last Week
Last month
Citations as of Jun 4, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.