Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/55630
Title: Crowdsourcing method in empirical linguistic research : Chinese studies using mechanical turk-based experimentation
Authors: Wang, Shichang
Keywords: Computational linguistics -- Research.
Human computation.
Data mining.
Issue Date: 2016
Publisher: The Hong Kong Polytechnic University
Abstract: Empirical linguistic research is driven by linguistic data. However linguistic data collection, be it corpus annotation, scripture and audio material transcription, survey, or psycholinguistic experiment, etc., has been proved to be very time-and resource-intensive. As a result, linguistic researchers have to frequently make compromises on linguistic data: instead of using large scale linguistic data, they have to use small scale linguistic data; when recruiting subjects for surveys or psycholinguistic experiments, instead of using random sampling, they have to use convenient sampling (recruiting subjects on the basis of proximity, ease-of-access, and willingness to participate). They typically only use college students as the subject pool which is rather homogeneous; and even when they use convenient sampling, they usually cannot use samples of a very large size. Since linguistic data is the foundation of empirical linguistic research, compromises on linguistic data may corrupt the whole research project. In a word, linguistic data has become the bottleneck of empirical linguistic research. In order to solve this problem, we need to find a more efficient and economic data collection method. In recent years, the crowdsourcing technology, which means outsourcing tasks to crowds in the form of open call via Internet, has become a promising new method of linguistic data collection to break the bottleneck.
This dissertation reports our work on exploring the application of crowd-sourcing method, especially Mechanical Turk-based linguistic experimentation (Mechanical Turk is a primary genre of crowdsourcing), in empirical linguistic research. We have three correlated general goals which concern methodology, language resource, and linguistic theory respectively: (1) to explore Mechanical Turk-based linguistic experimentation, (2) to build useful linguistic datasets using Mechanical Turk-based experiments, and (3) to investigate some linguistic theoretical issues using the data collected. This dissertation consists of three studies. Study one is a pilot study on Mechanical Turk-based linguistic experimentation which is used to lay a methodological foundation for our research. We reviewed literature on Mechanical Turk-based experimentation, analyzed platform usability, conducted a pilot experiment, proposed a general framework of Mechanical Turk-based experiment, and also discussed data quality control methods. Study two firstly created a very large semantic transparency dataset of Chinese nominal compound using Mechanical Turk-based experiments. This dataset contains the overall and constituent semantic transparency rating data of about 1,200 disyllabic Chinese nominal compounds. We also conducted a semantic transparency rating experiment using the traditional laboratory-based method which enabled us to further evaluate the Mechanical Turk-based experimentation by comparing the data collected by Mechanical Turk-based experiment and Laboratory-based experiment. And based on the semantic transparency dataset we created, we explored the uncertainty of semantic transparency judgment among raters and the effect of semantic head of compound on semantic transparency rating. Study three firstly created a large manual Chinese word segmentation dataset using Mechanical Turk-based experiments. This dataset contains 152 long Chinese sentences selected mainly from the Sinica corpus; each sentence was segmented manually by more than 120 online subjects. This dataset is then used to investigate the effect of semantic transparency on word intuition and the measurement of the word intuition of Chinese speakers.
Description: PolyU Library Call No.: [THS] LG51 .H577P CBS 2016 Wang
xv, 294 pages :illustrations
URI: http://hdl.handle.net/10397/55630
Rights: All rights reserved.
Appears in Collections:Thesis

Files in This Item:
File Description SizeFormat 
b29041417_link.htmFor PolyU Users208 BHTMLView/Open
b29041417_ira.pdfFor All Users (Non-printable)6.67 MBAdobe PDFView/Open
Show full item record

Page view(s)

103
Last Week
2
Last month
Checked on Oct 15, 2017

Download(s)

54
Checked on Oct 15, 2017

Google ScholarTM

Check



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.