PolyU IR
 

PolyU Institutional Repository >
Computing >
COMP Theses >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/4084

Title: New document-context term weights and clustering for information retrieval
Authors: Dang, Kai-fung Edward
Subjects: Hong Kong Polytechnic University -- Dissertations
Information retrieval
Issue Date: 2010
Publisher: The Hong Kong Polytechnic University
Abstract: In this thesis we investigate new methods to deal with the polysemy and word mismatch problems in information retrieval (IR). We tackle polysemy by using 'document-contexts', which are text windows centred on query terms in a document. Analysis of the words in the vicinity of a query term can identify its specific meaning in the context. In IR, many of the commonly used term weights are variants of the TF-IDF form. The tradition TF-IDF weight of a term depends only on the occurrence statistics of the term itself. We have studied a novel 'context-dependent' term weight, which incorporates information based on the words found in the document-contexts of a term. These term weights are generated by a Boost and Discount (B&D) procedure, which utilizes any relevance information that is available to estimate the probability of relevance of a context. Such relevance information may come from actual relevance judgments that a user makes on a (small) number of documents, as in 'relevance feedback' (RF). The theoretical justification of our scheme to calculate the new term weights is provided by a probabilistic non-relevance decision model of IR. We present experiments in the RF setting to test the context-dependent term weights. We demonstrate that using the new term weights can yield statistically significant improvement in retrieval compared with the traditional weights. Regarding the word mismatch problem, one plausible solution is to use clustering techniques. A traditional clustering evaluation measure used in IR is the MK1, which is a score calculated for the single 'optimal cluster' that can be extracted from the clus-tering result. MK1 is appropriate if a single retrieved cluster is desired. However, in some applications it may be desirable for the retrieval results to be presented in multiple clusters according to sub-topics. For this case, we introduce a new evaluation measure, called CS, which corresponds to finding an optimal combination of clusters. We define a sub-class of CS, called CS1, applicable when the clusters are disjoint. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution of CS1 can be obtained by a linear time algorithm. We discuss how our approach can be generalized to overlapping clusters, and present greedy algorithms to obtain optimal estimates. We claim that one particular 'cost effectiveness' algorithm yields the global optimal solution for clusters that overlap only by nesting. A mathematical proof of this claim by induction is presented. We have also investigated whether clustering techniques can further improve the retrieval effectiveness in relevance feedback using context-dependent term weights. B&D utilizes information extracted from the judged documents to provide evidence of relevance or non-relevance in the unseen documents. We use clustering to seek contexts from unseen documents that are similar to those in the judged documents. In this way, additional relevance information can be obtained for B&D. Experiments on the TREC-2005 collection show that a 'clustered SVM' scheme is effective in further improving relevance feedback effectiveness as compared to standard B&D, yielding small but statistically significant improvements in MAP. Thus, this is a promising direction for further research.
Degree: Ph.D., Dept. of Computing, The Hong Kong Polytechnic University, 2010
Description: xiii, 160 p. : ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577P COMP 2010 Dang
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/4084
Appears in Collections:COMP Theses
PolyU Electronic Theses

Files in This Item:

File Description SizeFormat
b23930342_ir.pdfFor All Users (Non-printable) 1.42 MBAdobe PDFView/Open
b23930342_link.htmFor PolyU Users 162 BHTMLView/Open



Facebook Facebook del.icio.us del.icio.us LinkedIn LinkedIn


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
No item in the PolyU IR may be reproduced for commercial or resale purposes.

 

© Pao Yue-kong Library, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Powered by DSpace (Version 1.5.2)  © MIT and HP
Feedback | Privacy Policy Statement | Copyright & Restrictions - Feedback