Post-processing for handwritten Chinese character recognition

Xu, Ruifeng

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/85077

Title:	Post-processing for handwritten Chinese character recognition
Authors:	Xu, Ruifeng
Degree:	M.Phil.
Issue Date:	2001
Abstract:	Some post-processing techniques for improving the performance of Handwritten Chinese Character Recognition (HCCR) system by selecting the most promising candidate characters are presented here. Aiming to remove mis-recognized and unrecognized characters in the recognition result, three post-processing approaches, namely the one based on contextual linguistics information, the one based on confusing character characteristics produced by a recognizer, and the one based on a hybrid approach, are studied in this thesis and their performance are evaluated and compared. In the study of the post-processing approach based on contextual linguistics information, the dictionary-based post-processing method is presented. The dictionary-based techniques, including sentence fragments detection and contextual approximate word matching for removing erroneous characters, are studied and its performance is evaluated. Post-processing Techniques based on statistical language models are then proposed. A Chinese word BI-gram model is established and employed in HCCR post-processing to identify a most linguistic-promising sentence with the maximum word co-occurrence production by selecting plausible candidate characters. To obtain the description capacity of long-distance restrictions among Chinese sentences, the word BI-Gram model is extended to a distant word BI-Gram model with a maximum distance 3 and prior to post-processing. Their upgrading performances are evaluated and compared. To recover the unrecognized characters and enhance the theoretical upper improvement limit for the post-processing approach based on contextual linguistics information, the post-processing techniques based on the characteristics of confusing characters produced by recognizer are studied. Analyzing the recognition results for the training samples, the confusing characters for each character category are collected and constructed into a confusing character set. Based on this set, a statistical Noisy-Channel model is used to identify the most promising input character when a candidate sequence is given. This method proves to be effective in removing unrecognized characters. Considering the confusing characters as observed features of character categories, the classification algorithm based on neural networks can be employed to identify the most plausible input as the production of the candidate sequence. All together 3755 character categories in GB2312-80 character-set are clustered into several hundred groups after searching through the transitive closure of the similarity matrix associated with the confusing character set. A group of neural networks for these category groups are established and trained to produce a candidate to match the input character and to adjust the confidence parameter of candidates for a given candidate sequence. A better performance in comparing with the one based on Noisy-Channel model is achieved. A three-stage hybrid post-processing system is then built. The post-processing technique based on confusing character characteristics of a recognizer is firstly conducted to append similar-shaped characters into the candidate set. Then the dictionary-based method is employed to append linguistic-prone characters and bind the candidate characters into a word-lattice. Finally the statistical language model is applied to identify a most promising sentence by selecting plausible words from the word-lattice. On the average, this hybrid post-processing system achieves 6.2% recognition rate improvement for the first candidate when the character recognition rate is 90% for the first candidate and 95% for the top-10 candidates by online HCCR engine. For the offline HCCR engine with the original recognition rate of 81% and 92% for the first and the top-l0 candidates, 12% recognition rate improvement for the first candidate is achieved.
Subjects:	Optical character recognition devices Chinese character sets (Data processing) Chinese characters -- Data processing Pattern recognition systems Optical data processing Hong Kong Polytechnic University -- Dissertations
Pages:	xiii, 162 leaves : col. ill. ; 30 cm
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/3750

Show full item record

Page views

143

Last Week
0

Last month

Citations as of Jun 22, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM