Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/3042
Title: Ai-Times : a parallel web news retrieval system
Authors: Luo, Weidong
Subjects: Hong Kong Polytechnic University -- Dissertations
News Web sites
Web search engines
Information storage and retrieval systems -- Newspapers
Issue Date: 2007
Publisher: The Hong Kong Polytechnic University
Abstract: The explosion in the availability of online information easily accessible through the Internet is a reality. As the available information increases, the inability to process, assimilate and use such large amount of information becomes more and more apparent. Online news information suffers from these problems. Currently available web news retrieval systems face a number of problems in that web-based news retrieval requires the ability to quickly and accurately process and update very large amounts of data that is constantly being updated. In this thesis, we present the design and implementation of Ai-Times, a parallel web news retrieval system the goal of which is to accurately retrieve and organize the web news information. This version of Ai-Times introduces the following novel algorithms: A novel optimized crawler algorithm whose fetching-speed is 6 times faster than that of the traditional crawler; A keen tag based extraction algorithm which can extract the data rich content with minimal manual effort and which also allows data to be classified as important or not important so that the crawler can revisit and update important data; A modified vector space model improved using query expansion and term reweighting and the most valuable contribution, an modified MapReduce improved by estimating the execution time of each subtask, which is proven to be able to reduce the number of the unusual tasks and shorten the whole job execution time.
Description: x, 88 leaves : ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577M COMP 2007 Luo
Rights: All rights reserved.
Type: Thesis
URI: http://hdl.handle.net/10397/3042
Appears in Collections:COMP Theses
PolyU Electronic Theses

Files in This Item:
File Description SizeFormat 
b21459344_link.htmFor PolyU Users 162 BHTMLView/Open
b21459344_ir.pdfFor All Users (Non-printable) 2.09 MBAdobe PDFView/Open


All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated. No item in the PolyU IR may be reproduced for commercial or resale purposes.