Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/54859
Title: On document representation and term weights in text classification
Authors: Liu, Y
Issue Date: 2009
Publisher: Information Science Reference
Source: In M Song, & YF Wu (Eds.), Handbook of research on text and Web mining technologies, p. 1-22. Hershey, Pa.: Information Science Reference, 2009 How to cite?
Abstract: In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.
URI: http://hdl.handle.net/10397/54859
ISBN: 9781599049908 (2 v. set : hbk.)
DOI: 10.4018/978-1-59904-990-8.ch001
Appears in Collections:Book Chapter

Access
View full-text via PolyU eLinks SFX Query
Show full item record

Page view(s)

18
Last Week
2
Last month
Checked on Oct 16, 2017

Google ScholarTM

Check

Altmetric



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.