Rumbling around in my head for some time has been this question: can linguistics take advantage of powerful prediction markets to further our research goals?
It's not clear to me what predictions linguists could compete over, so this remains an open question. However, having just stumbled on to an Amazon.com service designed to harness the power of
crowdsourcing called
Mechanical Turk (HT
Complex Systems Blog) I'm tempted to believe this somewhat related idea could be useful very quickly to complete large scale annotation projects (something I've posted about
before), despite the potential for lousy annotations.
The point of crowdsourcing is to complete tasks that are difficult for computers, but easy for humans. For example, here are five tasks currently being listed:
1. Create an image that looks like another image.
2. Extract Meeting Date Information from Websites
3. Your task is to identify your 3 best items for the lists you're presented with.
4. Describe the sport and athlete's race and gender on Sports Illustrated covers
5. 2 pictures to look at and quickly rate subjectively
It should be easy enough to crowdsource annotation tasks (e.g., create a web site people can log in to from anywhere which contains the data with an easy-to-use interface for tagging). "Alas!", says you, "surely the poor quality of annotations would make this approach hopeless!"
Would it?
Recently, Breck Baldwin over at the LingPipe blog discussed the problems of inter-annotator agreement (gasp! there's inter-annotator DIS-agreement even between hip geniuses like Baldwin and Carpenter? Yes ... sigh ... yes there is). However (here's where the genius part comes in) he concluded that, if you're primarily in the business of recall (i.e, making sure the net you cast catches all the fish in the sea, even if you also pick up some hub caps along the way), then the reliability of annotators is not a critical concern. Let's let Breck explain:
The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.
Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.
Unless I've mis-understood Baldwin's post (I'm just a lousy linguist mind you, not a genius, hehe) then the major issue is adjudicating the error rate of a set of crowdsourced raters. Couldn't a bit of sampling do this nicely? If you restricted the annotators to, say, grad students in linguistics and related fields, the threshold of "remotely qualified" should be met, and there's plenty of grad students floating around the world.
This approach strikes me as related to the recent revelations that Wikipedia and Digg and other groups that try to take advantage of web democracy/crowd wisdom are actually functioning best when they have a small group of "moderators" or "chaperones" (read Chris Wilson's article on this topic
here).
So, take a large group of raters scattered around the whole wide world, give them the task and technology to complete potentially huge amounts of annotations quickly, chaperone their results just a bit, and voilà, large scale annotation projects made easy.
You're welcome, hehe.