Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences

Wang, Rongbo

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/83770

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	-
dc.creator	Wang, Rongbo	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/2660	-
dc.language.iso	English	-
dc.title	Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences	-
dc.type	Thesis	-
dcterms.abstract	This thesis addresses two important problems in Chinese information processing, namely Chinese chunk segmentation and the similarity measure of Chinese sentences. The three main contributions reported in this thesis are: (1) a novel Chinese chunk segmentation technique using a statistical model combined with correction rules generated using an error-correction mechanism; (2) a novel similarity measure of Chinese sentences using both word/chunk sequences and POS (Part of Speech) tag sequences of Chinese sentences; and (3) the optimization of parameters used in the combined similarity measure approach by applying a relevance feedback technique and a neural network model. In the first investigation, a statistical model combined with correction rules generated by an error-correction mechanism is proposed for Chinese chunk segmentation. Chunk segmentation of Chinese sentences in the training corpus was carried out manually to provide a ground rule for training the statistical model with which preliminary chunk segmentation results will be obtained. The chunk segmentation result (correctly and incorrectly segmented chunks) from the statistical model is utilized to generate a set of correction rules for refining the segmentation result. This set of correction rules is generated by an error-correction mechanism in which a comparison between the preliminary segmentation result and the manually segmented result is performed. The statistical model and the learned correction rules can then be used to perform Chinese chunk segmentation of unseen sentences. In the second investigation, novel similarity measures of Chinese sentences are proposed by using word/chunk sequences and POS tag sequences of Chinese sentences. The sentence similarity measure is one of very important components in example-based machine translation (EBMT). For Chinese sentences there is no delimiter between any two words, which is different from English sentences. Hence, Chinese word/chunk delimitation should be performed first before a sentence similarity measure can be computed. Both word/chunk sequence feature and POS tag sequence feature used in our proposed similarity measures are based on word/chunk segmentation. Sentence structure information is partially reflected in the POS tag sequence. For the proposed word-sequence-matching-based (WSMB) method, we take into consideration three factors between two sentences: the number of identical word sequences, the length of each identical word sequence, and the average weighting (AW) of each identical word sequence. In computing AW, we weight every POS tag according to its importance. The POS-tag-sequence-matching-based (PTSMB) method is to measure the similarity of Chinese sentences in terms of their structures. If the constituents in two Chinese sentences are similar, then we can judge that these two Chinese sentences are similar in structure. The main idea of this similarity measure is that we perform matching between the POS' of two Chinese sentences using directed graphs. The POS weighting is also utilized in the process. In the third investigation, we propose a human-computer interaction approach to optimize parameters used in the combined similarity measure of Chinese sentences based on a relevance feedback scheme and a neural network model. In the relevance feedback process, users' intentions and preferences to rank the candidate sentences are captured and used to modify parameters in the similarity measure. For the parameter optimization research, a web-based questionnaire was designed to collect users' feedback data. In this pioneering study, we constructed 50 groups of sentences. There is one source sentence and ten sentences to be retrieved for every group. The ten test sentences are shown in descending order of similarity to the source sentence. The user is asked to provide a new rank according to his or her judgment if he/she does not agree with the ranking done by the computer. The new rank is converted to a set of numerals and stored in a database for the parameter optimization using a neural network model. One clear advantage of this approach is its ability to fine-tune the measure to reflect the user's or users' preferences in matching Chinese sentences. Experimental results show a visible improvement of the similarity measure performance. In addition to the theoretical and experimental studies in Chinese chunk segmentation and the similarity measure of Chinese sentences, we also implemented them into an EBMT prototype in which we also addressed other issues such as data structure, sentence indexing, and user-friendly interface design.	-
dcterms.accessRights	open access	-
dcterms.educationLevel	Ph.D.	-
dcterms.extent	xviii, 156 leaves : ill. ; 30 cm	-
dcterms.issued	2006	-
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	-
dcterms.LCSH	Chinese language -- Data processing	-
dcterms.LCSH	Chinese language -- Sentences	-
dcterms.LCSH	Chinese language -- Word formation	-
dcterms.LCSH	Chinese language -- Machine translating	-
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/2660

Show simple item record

Page views

264

Last Week
7

Last month
19

Citations as of Apr 12, 2026

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM