New document-context term weights and clustering for information retrieval

Dang, Kai-fung Edward

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/84356

Title:	New document-context term weights and clustering for information retrieval
Authors:	Dang, Kai-fung Edward
Degree:	Ph.D.
Issue Date:	2010
Abstract:	In this thesis we investigate new methods to deal with the polysemy and word mismatch problems in information retrieval (IR). We tackle polysemy by using 'document-contexts', which are text windows centred on query terms in a document. Analysis of the words in the vicinity of a query term can identify its specific meaning in the context. In IR, many of the commonly used term weights are variants of the TF-IDF form. The tradition TF-IDF weight of a term depends only on the occurrence statistics of the term itself. We have studied a novel 'context-dependent' term weight, which incorporates information based on the words found in the document-contexts of a term. These term weights are generated by a Boost and Discount (B&D) procedure, which utilizes any relevance information that is available to estimate the probability of relevance of a context. Such relevance information may come from actual relevance judgments that a user makes on a (small) number of documents, as in 'relevance feedback' (RF). The theoretical justification of our scheme to calculate the new term weights is provided by a probabilistic non-relevance decision model of IR. We present experiments in the RF setting to test the context-dependent term weights. We demonstrate that using the new term weights can yield statistically significant improvement in retrieval compared with the traditional weights. Regarding the word mismatch problem, one plausible solution is to use clustering techniques. A traditional clustering evaluation measure used in IR is the MK1, which is a score calculated for the single 'optimal cluster' that can be extracted from the clus-tering result. MK1 is appropriate if a single retrieved cluster is desired. However, in some applications it may be desirable for the retrieval results to be presented in multiple clusters according to sub-topics. For this case, we introduce a new evaluation measure, called CS, which corresponds to finding an optimal combination of clusters. We define a sub-class of CS, called CS1, applicable when the clusters are disjoint. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution of CS1 can be obtained by a linear time algorithm. We discuss how our approach can be generalized to overlapping clusters, and present greedy algorithms to obtain optimal estimates. We claim that one particular 'cost effectiveness' algorithm yields the global optimal solution for clusters that overlap only by nesting. A mathematical proof of this claim by induction is presented. We have also investigated whether clustering techniques can further improve the retrieval effectiveness in relevance feedback using context-dependent term weights. B&D utilizes information extracted from the judged documents to provide evidence of relevance or non-relevance in the unseen documents. We use clustering to seek contexts from unseen documents that are similar to those in the judged documents. In this way, additional relevance information can be obtained for B&D. Experiments on the TREC-2005 collection show that a 'clustered SVM' scheme is effective in further improving relevance feedback effectiveness as compared to standard B&D, yielding small but statistically significant improvements in MAP. Thus, this is a promising direction for further research.
Subjects:	Hong Kong Polytechnic University -- Dissertations Information retrieval
Pages:	xiii, 160 p. : ill. ; 30 cm.
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/5901

Show full item record

Page views

146

Last Week
0

Last month

Citations as of Jun 22, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM