Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/9410
Title: Mining pinyin-to-character conversion rules from large-scale corpus : a rough set approach
Authors: Wang, X
Chen, Q
Yeung, DS
Keywords: Data mining
Natural languages
Rough set theory
Text analysis
Issue Date: 2004
Publisher: Institute of Electrical and Electronics Engineers
Source: IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics, 2004, v. 34, no. 2, p. 834-844 How to cite?
Journal: IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics 
Abstract: The paper introduces a rough set technique for solving the problem of mining Pinyin-to-character (PTC) conversion rules. It first presents a text-structuring method by constructing a language information table from a corpus for each pinyin, which it will then apply to a free-form textual corpus. Data generalization and rule extraction algorithms can then be used to eliminate redundant information and extract consistent PTC conversion rules. The design of our model also addresses a number of important issues such as the long-distance dependency problem, the storage requirements of the rule base, and the consistency of the extracted rules, while the performance of the extracted rules as well as the effects of different model parameters are evaluated experimentally. These results show that by the smoothing method, high precision conversion (0.947) and recall rates (0.84) can be achieved even for rules represented directly by pinyin rather than words. A comparison with the baseline tri-gram model also shows good complement between our method and the tri-gram language model.
URI: http://hdl.handle.net/10397/9410
ISSN: 1083-4419
DOI: 10.1109/TSMCB.2003.817101
Appears in Collections:Journal/Magazine Article

Access
View full-text via PolyU eLinks SFX Query
Show full item record

SCOPUSTM   
Citations

12
Last Week
0
Last month
Citations as of Sep 18, 2017

WEB OF SCIENCETM
Citations

5
Last Week
0
Last month
0
Citations as of Sep 16, 2017

Page view(s)

36
Last Week
0
Last month
Checked on Sep 18, 2017

Google ScholarTM

Check

Altmetric



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.