Complex-ZH : a new dataset for lexical complexity prediction in Mandarin and Cantonese

Qiu, L; Guo, S; Wong, TS; Chersoni, E; Lee, J; Huang, CR

doi:10.18653/v1/2024.tsar-1.3

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114024

DC Field	Value	Language
dc.contributor	Department of Chinese and Bilingual Studies	en_US
dc.creator	Qiu, L	en_US
dc.creator	Guo, S	en_US
dc.creator	Wong, TS	en_US
dc.creator	Chersoni, E	en_US
dc.creator	Lee, J	en_US
dc.creator	Huang, CR	en_US
dc.date.accessioned	2025-07-10T01:31:44Z	-
dc.date.available	2025-07-10T01:31:44Z	-
dc.identifier.isbn	979-8-89176-176-6	en_US
dc.identifier.uri	http://hdl.handle.net/10397/114024	-
dc.description	Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), Miami, Florida, USA, 15 November 2024	en_US
dc.language.iso	en	en_US
dc.publisher	Association for Computational Linguistics	en_US
dc.rights	©2024 Association for Computational Linguistics	en_US
dc.rights	ACL materials are Copyright © 1963–2025 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.	en_US
dc.rights	The following publication Le Qiu, Shanyue Guo, Tak-Sum Wong, Emmanuele Chersoni, John Lee, and Chu-Ren Huang. 2024. CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese. In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 20–26, Miami, Florida, USA. Association for Computational Linguistics is available at https://doi.org/10.18653/v1/2024.tsar-1.3.	en_US
dc.title	Complex-ZH : a new dataset for lexical complexity prediction in Mandarin and Cantonese	en_US
dc.type	Conference Paper	en_US
dc.identifier.spage	20	en_US
dc.identifier.epage	26	en_US
dc.identifier.doi	10.18653/v1/2024.tsar-1.3	en_US
dcterms.abstract	The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), p. 20–26. Miami, Florida, USA: Association for Computational Linguistics, 2024	en_US
dcterms.issued	2024	-
dc.relation.ispartofbook	Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)	en_US
dc.relation.conference	Workshop on Text Simplification, Accessibility and Readability [TSAR]	en_US
dc.description.validate	202507 bcwh	en_US
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	a3877	-
dc.identifier.SubFormID	51498	-
dc.description.fundingSource	Others	en_US
dc.description.fundingText	Faculty of Humanities of the Hong Kong Polytechnic University	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	CC	en_US
dc.relation.rdata	https://github.com/Laniqiu/CompLex-ZH	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
2024.tsar-1.3.pdf		432.26 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Page views

174

Citations as of Feb 9, 2026

Downloads

67

Citations as of Feb 9, 2026

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

Google ScholarTM

Altmetric

Google Scholar^TM