PolyU Institutional Repository >
Electronic and Information Engineering >
EIE Theses >
Please use this identifier to cite or link to this item:
|Title: ||Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences|
|Authors: ||Wang, Rongbo|
|Subjects: ||Hong Kong Polytechnic University -- Dissertations|
Chinese language -- Data processing
Chinese language -- Sentences
Chinese language -- Word formation
Chinese language -- Machine translating
|Issue Date: ||2006 |
|Publisher: ||The Hong Kong Polytechnic University|
|Abstract: ||This thesis addresses two important problems in Chinese information processing, namely Chinese chunk segmentation and the similarity measure of Chinese sentences. The three main contributions reported in this thesis are: (1) a novel Chinese chunk segmentation technique using a statistical model combined with correction rules generated using an error-correction mechanism; (2) a novel similarity measure of Chinese sentences using both word/chunk sequences and POS (Part of Speech) tag sequences of Chinese sentences; and (3) the optimization of parameters used in the combined similarity measure approach by applying a relevance feedback technique and a neural network model. In the first investigation, a statistical model combined with correction rules generated by an error-correction mechanism is proposed for Chinese chunk segmentation. Chunk segmentation of Chinese sentences in the training corpus was carried out manually to provide a ground rule for training the statistical model with which preliminary chunk segmentation results will be obtained. The chunk segmentation result (correctly and incorrectly segmented chunks) from the statistical model is utilized to generate a set of correction rules for refining the segmentation result. This set of correction rules is generated by an error-correction mechanism in which a comparison between the preliminary segmentation result and the manually segmented result is performed. The statistical model and the learned correction rules can then be used to perform Chinese chunk segmentation of unseen sentences. In the second investigation, novel similarity measures of Chinese sentences are proposed by using word/chunk sequences and POS tag sequences of Chinese sentences. The sentence similarity measure is one of very important components in example-based machine translation (EBMT). For Chinese sentences there is no delimiter between any two words, which is different from English sentences. Hence, Chinese word/chunk delimitation should be performed first before a sentence similarity measure can be computed. Both word/chunk sequence feature and POS tag sequence feature used in our proposed similarity measures are based on word/chunk segmentation. Sentence structure information is partially reflected in the POS tag sequence. For the proposed word-sequence-matching-based (WSMB) method, we take into consideration three factors between two sentences: the number of identical word sequences, the length of each identical word sequence, and the average weighting (AW) of each identical word sequence. In computing AW, we weight every POS tag according to its importance. The POS-tag-sequence-matching-based (PTSMB) method is to measure the similarity of Chinese sentences in terms of their structures. If the constituents in two Chinese sentences are similar, then we can judge that these two Chinese sentences are similar in structure. The main idea of this similarity measure is that we perform matching between the POS's of two Chinese sentences using directed graphs. The POS weighting is also utilized in the process.|
In the third investigation, we propose a human-computer interaction approach to optimize parameters used in the combined similarity measure of Chinese sentences based on a relevance feedback scheme and a neural network model. In the relevance feedback process, users' intentions and preferences to rank the candidate sentences are captured and used to modify parameters in the similarity measure. For the parameter optimization research, a web-based questionnaire was designed to collect users' feedback data. In this pioneering study, we constructed 50 groups of sentences. There is one source sentence and ten sentences to be retrieved for every group. The ten test sentences are shown in descending order of similarity to the source sentence. The user is asked to provide a new rank according to his or her judgment if he/she does not agree with the ranking done by the computer. The new rank is converted to a set of numerals and stored in a database for the parameter optimization using a neural network model. One clear advantage of this approach is its ability to fine-tune the measure to reflect the user's or users' preferences in matching Chinese sentences. Experimental results show a visible improvement of the similarity measure performance. In addition to the theoretical and experimental studies in Chinese chunk segmentation and the similarity measure of Chinese sentences, we also implemented them into an EBMT prototype in which we also addressed other issues such as data structure, sentence indexing, and user-friendly interface design.
|Degree: ||Ph.D., Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University, 2006|
|Description: ||xviii, 156 leaves : ill. ; 30 cm.|
PolyU Library Call No.: [THS] LG51 .H577P EIE 2006 Wang
|Rights: ||All rights reserved.|
|Appears in Collections:||EIE Theses|
PolyU Electronic Theses
All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
No item in the PolyU IR may be reproduced for commercial or resale purposes.