Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques

Liu, K; Ye, R; Zhongzhu, L; Ye, R

doi:10.1371/journal.pone.0265633

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/94267

DC Field	Value	Language
dc.contributor	Department of Chinese and Bilingual Studies	-
dc.creator	Liu, K	-
dc.creator	Ye, R	-
dc.creator	Zhongzhu, L	-
dc.creator	Ye, R	-
dc.date.accessioned	2022-08-11T02:01:31Z	-
dc.date.available	2022-08-11T02:01:31Z	-
dc.identifier.uri	http://hdl.handle.net/10397/94267	-
dc.language.iso	en	en_US
dc.publisher	Public Library of Science	en_US
dc.rights	© 2022 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.	en_US
dc.rights	The following publication Liu, K., Ye, R., Zhongzhu, L., & Ye, R. (2022). Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques. Plos one, 17(3), e0265633 is available at https://doi.org/10.1371/journal.pone.0265633	en_US
dc.title	Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.volume	17	-
dc.identifier.issue	3	-
dc.identifier.doi	10.1371/journal.pone.0265633	-
dcterms.abstract	The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform trigram, POS (Part-of-speech) unigram, POS bigram and POS trigram entropy from two balanced Chinese comparable corpora (translated vs non-translated) for data mining and analysis. We then applied four data mining techniques including Support Vector Machines (SVMs), Linear discriminant analysis (LDA), Random Forest (RF) and Multilayer Perceptron (MLP) to distinguish translated Chinese from original Chinese based on these seven features. Our results show that SVMs is the most robust and effective classifier, yielding an AUC of 90.5% and an accuracy rate of 84.3%. Our results have affirmed the hypothesis that translational language is categorically different from original language. Our research demonstrates that combining information-theoretic indicator of Shannon's entropy together with machine learning techniques can provide a novel approach for studying translation as a unique communicative activity. This study has yielded new insights for corpus-based studies on the translationese phenomenon in the field of translation studies.	-
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	PLoS one, 2022, v. 17, no. 3, e0265633	-
dcterms.isPartOf	PLoS one	-
dcterms.issued	2022	-
dc.identifier.scopus	2-s2.0-85126996894	-
dc.identifier.pmid	35324927	-
dc.identifier.eissn	1932-6203	-
dc.identifier.artn	e0265633	-
dc.description.validate	202208 bckw	-
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	a1531	en_US
dc.identifier.SubFormID	45351	en_US
dc.description.fundingSource	Self-funded	en_US
dc.description.pubStatus	Published	en_US
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
journal.pone.0265633.pdf		1.05 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Page views

46

Last Week
1

Last month

Citations as of May 12, 2024

Downloads

31

Citations as of May 12, 2024

SCOPUS^TM
Citations

7

Citations as of May 16, 2024

WEB OF SCIENCE^TM
Citations

5

Citations as of May 16, 2024

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM