title	layout	img	img_link	caption	active_tab
NAACL 2015 Tutorial on Crowdsourcing for NLP	tutorial	turk-engraving-detail	http://en.wikipedia.org/wiki/The_Turk	An engraving of the Mechanical Turk, the 18th century chess-playing automaton	main_page

Tutorial : Crowdsourcing for NLP

Date and Time : Sunday, May 31 at 9am

Instructors : Callison-Burch, Lyle Ungar, and Ellie Pavlick

Abstract : Crowdsourced applications to scientific problems is a hot research area, with over 10,000 publications in the past five years. Platforms such as Amazon’s Mechanical Turk and CrowdFlower provide researchers with easy access to large numbers of workers. The crowd’s vast supply of inexpensive, intelligent labor allows people to attack problems that were previously impractical and gives potential for detailed scientific inquiry of social, psychological, economic, and linguistic phenomena via massive sample sizes of human annotated data. We introduce crowdsourcing and describe how it is being used in both industry and academia. Crowdsourcing is valuable to computational linguists both (a) as a source of labeled training data for use in machine learning and (b) as a means of collecting computational social science data that link language use to underlying beliefs and behavior. We present case studies for both categories: (a) collecting labeled data for use in natural language processing tasks such as word sense disambiguation and machine translation and (b) collecting experimental data in the context of psychology; e.g. finding how word use varies with age, sex, personality, health, and happiness.

We will also cover tools and techniques for crowdsourcing. Effectively collecting crowdsourced data requires careful attention to the collection process, through selection of appropriately qualified workers, giving clear instructions that are understandable to non-experts, and performing quality control on the results to eliminate spammers who complete tasks randomly or carelessly in order to collect the small financial reward. We will introduce different crowdsourcing platforms, review privacy and institutional review board issues, and provide rules of thumb for cost and time estimates. Crowdsourced data also has a particular structure that raises issues in statistical analysis; we describe some of the key methods to address these issues.

Here is a [video recording of our tutorial at the NAACL 2015 conference](http://techtalks.tv/talks/crowdsourcing-for-nlp/61562/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial.md

tutorial.md

Files

tutorial.md

Latest commit

History

tutorial.md

File metadata and controls