Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/84115
Title: Crowdsourcing method in empirical linguistic research : Chinese studies using mechanical turk-based experimentation
Authors: Wang, Shichang
Degree: Ph.D.
Issue Date: 2016
Abstract: Empirical linguistic research is driven by linguistic data. However linguistic data collection, be it corpus annotation, scripture and audio material transcription, survey, or psycholinguistic experiment, etc., has been proved to be very time-and resource-intensive. As a result, linguistic researchers have to frequently make compromises on linguistic data: instead of using large scale linguistic data, they have to use small scale linguistic data; when recruiting subjects for surveys or psycholinguistic experiments, instead of using random sampling, they have to use convenient sampling (recruiting subjects on the basis of proximity, ease-of-access, and willingness to participate). They typically only use college students as the subject pool which is rather homogeneous; and even when they use convenient sampling, they usually cannot use samples of a very large size. Since linguistic data is the foundation of empirical linguistic research, compromises on linguistic data may corrupt the whole research project. In a word, linguistic data has become the bottleneck of empirical linguistic research. In order to solve this problem, we need to find a more efficient and economic data collection method. In recent years, the crowdsourcing technology, which means outsourcing tasks to crowds in the form of open call via Internet, has become a promising new method of linguistic data collection to break the bottleneck.
This dissertation reports our work on exploring the application of crowd-sourcing method, especially Mechanical Turk-based linguistic experimentation (Mechanical Turk is a primary genre of crowdsourcing), in empirical linguistic research. We have three correlated general goals which concern methodology, language resource, and linguistic theory respectively: (1) to explore Mechanical Turk-based linguistic experimentation, (2) to build useful linguistic datasets using Mechanical Turk-based experiments, and (3) to investigate some linguistic theoretical issues using the data collected. This dissertation consists of three studies. Study one is a pilot study on Mechanical Turk-based linguistic experimentation which is used to lay a methodological foundation for our research. We reviewed literature on Mechanical Turk-based experimentation, analyzed platform usability, conducted a pilot experiment, proposed a general framework of Mechanical Turk-based experiment, and also discussed data quality control methods. Study two firstly created a very large semantic transparency dataset of Chinese nominal compound using Mechanical Turk-based experiments. This dataset contains the overall and constituent semantic transparency rating data of about 1,200 disyllabic Chinese nominal compounds. We also conducted a semantic transparency rating experiment using the traditional laboratory-based method which enabled us to further evaluate the Mechanical Turk-based experimentation by comparing the data collected by Mechanical Turk-based experiment and Laboratory-based experiment. And based on the semantic transparency dataset we created, we explored the uncertainty of semantic transparency judgment among raters and the effect of semantic head of compound on semantic transparency rating. Study three firstly created a large manual Chinese word segmentation dataset using Mechanical Turk-based experiments. This dataset contains 152 long Chinese sentences selected mainly from the Sinica corpus; each sentence was segmented manually by more than 120 online subjects. This dataset is then used to investigate the effect of semantic transparency on word intuition and the measurement of the word intuition of Chinese speakers.
Subjects: Computational linguistics -- Research.
Human computation.
Data mining.
Hong Kong Polytechnic University -- Dissertations
Pages: xv, 294 pages : illustrations
Appears in Collections:Thesis

Show full item record

Page views

39
Last Week
0
Last month
Citations as of Apr 14, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.