Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/3769
Title: Post-processing for handwritten Chinese character recognition
Authors: Xu, Ruifeng
Keywords: Optical character recognition devices
Chinese character sets (Data processing)
Chinese characters -- Data processing
Pattern recognition systems
Optical data processing
Hong Kong Polytechnic University -- Dissertations
Issue Date: 2001
Publisher: The Hong Kong Polytechnic University
Abstract: Some post-processing techniques for improving the performance of Handwritten Chinese Character Recognition (HCCR) system by selecting the most promising candidate characters are presented here. Aiming to remove mis-recognized and unrecognized characters in the recognition result, three post-processing approaches, namely the one based on contextual linguistics information, the one based on confusing character characteristics produced by a recognizer, and the one based on a hybrid approach, are studied in this thesis and their performance are evaluated and compared. In the study of the post-processing approach based on contextual linguistics information, the dictionary-based post-processing method is presented. The dictionary-based techniques, including sentence fragments detection and contextual approximate word matching for removing erroneous characters, are studied and its performance is evaluated. Post-processing Techniques based on statistical language models are then proposed. A Chinese word BI-gram model is established and employed in HCCR post-processing to identify a most linguistic-promising sentence with the maximum word co-occurrence production by selecting plausible candidate characters. To obtain the description capacity of long-distance restrictions among Chinese sentences, the word BI-Gram model is extended to a distant word BI-Gram model with a maximum distance 3 and prior to post-processing. Their upgrading performances are evaluated and compared.
To recover the unrecognized characters and enhance the theoretical upper improvement limit for the post-processing approach based on contextual linguistics information, the post-processing techniques based on the characteristics of confusing characters produced by recognizer are studied. Analyzing the recognition results for the training samples, the confusing characters for each character category are collected and constructed into a confusing character set. Based on this set, a statistical Noisy-Channel model is used to identify the most promising input character when a candidate sequence is given. This method proves to be effective in removing unrecognized characters. Considering the confusing characters as observed features of character categories, the classification algorithm based on neural networks can be employed to identify the most plausible input as the production of the candidate sequence. All together 3755 character categories in GB2312-80 character-set are clustered into several hundred groups after searching through the transitive closure of the similarity matrix associated with the confusing character set. A group of neural networks for these category groups are established and trained to produce a candidate to match the input character and to adjust the confidence parameter of candidates for a given candidate sequence. A better performance in comparing with the one based on Noisy-Channel model is achieved. A three-stage hybrid post-processing system is then built. The post-processing technique based on confusing character characteristics of a recognizer is firstly conducted to append similar-shaped characters into the candidate set. Then the dictionary-based method is employed to append linguistic-prone characters and bind the candidate characters into a word-lattice. Finally the statistical language model is applied to identify a most promising sentence by selecting plausible words from the word-lattice. On the average, this hybrid post-processing system achieves 6.2% recognition rate improvement for the first candidate when the character recognition rate is 90% for the first candidate and 95% for the top-10 candidates by online HCCR engine. For the offline HCCR engine with the original recognition rate of 81% and 92% for the first and the top-l0 candidates, 12% recognition rate improvement for the first candidate is achieved.
Description: xiii, 162 leaves : col. ill. ; 30 cm.
PolyU Library Call No.: [THS] LG51 .H577M COMP 2001 Xu
URI: http://hdl.handle.net/10397/3769
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b15731807_link.htmFor PolyU Users 162 BHTMLView/Open
b15731807_ir.pdfFor All Users (Non-printable) 8.4 MBAdobe PDFView/Open
Show full item record

Page view(s)

457
Last Week
2
Last month
Checked on Feb 26, 2017

Download(s)

173
Checked on Feb 26, 2017

Google ScholarTM

Check



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.