Please use this identifier to cite or link to this item:
Title: Determination of context window size
Authors: Hung, KY
Luk, R 
Yeung, D
Chung, K 
Shu, W
Keywords: Text Analysis
Textual Data Mining
Association Score
Mutual Information
Issue Date: 2001
Publisher: World Scientific Publishing Co
Source: International journal of computer processing of languages, 2001, v. 14, no. 1, p. 71-80 How to cite?
Journal: International journal of computer processing of languages 
Abstract: Context windows are important for a variety of natural language analysis and processing. A trade-off exists between the task performance and the size of the context. Lucassen and Mercer used mutual information to determine the size of the context for English text. We apply the same technique to determine the Context window size for Chinese text. In addition, we use the association score, proposed by Church. The association score is directly related to the prediction ability of units in the context. To reduce the effects of spurious associations, the association score values at the N% quartile is used, instead of the maximum, and the association score derived from low frequency occurrences (i.e. <5) are discarded. A window size of 9 characters was found to be large enough for most associations between characters themselves, and between words themselves. An alternative approach using the (nonparametric) lambda statistic LB is examined, which overcomes spurious association problems and the averaging effect of mutual information. We conclude that the statistic is more suitable for exhaustive contextual models (e.g. variable N-gram models) whereas the association score is more suitable for non-exhaustive contextual models (e.g. identification of collocation).
ISSN: 1793-8406
DOI: 10.1142/S0219427901000291
Appears in Collections:Journal/Magazine Article

View full-text via PolyU eLinks SFX Query
Show full item record

Page view(s)

Last Week
Last month
Citations as of Oct 21, 2018

Google ScholarTM



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.