Tuesday, March 11, 2008

On Crowdsourcing and Linguistics

Rumbling around in my head for some time has been this question: can linguistics take advantage of powerful prediction markets to further our research goals?

It's not clear to me what predictions linguists could compete over, so this remains an open question. However, having just stumbled on to an Amazon.com service designed to harness the power of crowdsourcing called Mechanical Turk (HT Complex Systems Blog) I'm tempted to believe this somewhat related idea could be useful very quickly to complete large scale annotation projects (something I've posted about before), despite the potential for lousy annotations.

The point of crowdsourcing is to complete tasks that are difficult for computers, but easy for humans. For example, here are five tasks currently being listed:

1. Create an image that looks like another image.
2. Extract Meeting Date Information from Websites
3. Your task is to identify your 3 best items for the lists you're presented with.
4. Describe the sport and athlete's race and gender on Sports Illustrated covers
5. 2 pictures to look at and quickly rate subjectively

It should be easy enough to crowdsource annotation tasks (e.g., create a web site people can log in to from anywhere which contains the data with an easy-to-use interface for tagging). "Alas!", says you, "surely the poor quality of annotations would make this approach hopeless!"

Would it?

Recently, Breck Baldwin over at the LingPipe blog discussed the problems of inter-annotator agreement (gasp! there's inter-annotator DIS-agreement even between hip geniuses like Baldwin and Carpenter? Yes ... sigh ... yes there is). However (here's where the genius part comes in) he concluded that, if you're primarily in the business of recall (i.e, making sure the net you cast catches all the fish in the sea, even if you also pick up some hub caps along the way), then the reliability of annotators is not a critical concern. Let's let Breck explain:

The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.

Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.

Unless I've mis-understood Baldwin's post (I'm just a lousy linguist mind you, not a genius, hehe) then the major issue is adjudicating the error rate of a set of crowdsourced raters. Couldn't a bit of sampling do this nicely? If you restricted the annotators to, say, grad students in linguistics and related fields, the threshold of "remotely qualified" should be met, and there's plenty of grad students floating around the world.

This approach strikes me as related to the recent revelations that Wikipedia and Digg and other groups that try to take advantage of web democracy/crowd wisdom are actually functioning best when they have a small group of "moderators" or "chaperones" (read Chris Wilson's article on this topic here).

So, take a large group of raters scattered around the whole wide world, give them the task and technology to complete potentially huge amounts of annotations quickly, chaperone their results just a bit, and voilĂ , large scale annotation projects made easy.

You're welcome, hehe.


lingpipe-blog.com said...

With enough annotators and enough independence of annotation mistakes, you should be able to converge to a consistent annotation.

There are two empirical issues. One, do you get enough independence of annotation. Two, is the truth really out there? That is, with all the adjudication in the world after the fact, can we ever catch up with the long tail of hard cases?

There are some fundamental problems already, such as the phrase "Bob Dylans", which refers to multiple actors playing the "same" real character under different aspects in a recent movie. How do we annotate that?

And what about that pesky phosphorence gene that was spliced into the mouse genome to support genomics experiments? It's originally a jellyfish gene, but jellyfish aren't an organism in Entrez. Do we consider it a mouse gene even if it's been engineered into mice?

I'm primarily interested in trying to get a statistical handle on just how many annotators we'll need to be confident that unioning their answers gives us high recall. I think the problem is that you're going to need a lot more annotators to also get high precision, and I'm not sure it'll converge, but it's what I'd like to estimate by measuring correlations of mistakes.

Leon Peshkin at Harvard, who we're working with on our NIH grant, is pursuing exactly the crowdsourcing model you propose to annotate biomedical texts for relations among genes and between genes and diseases.

Chris said...

Thanks for the comment. I'll have to look in to Peshkin's work. Thanks!