Showing posts sorted by relevance for query crowdsourcing. Sort by date Show all posts
Showing posts sorted by relevance for query crowdsourcing. Sort by date Show all posts

Tuesday, March 11, 2008

On Crowdsourcing and Linguistics

Rumbling around in my head for some time has been this question: can linguistics take advantage of powerful prediction markets to further our research goals?


It's not clear to me what predictions linguists could compete over, so this remains an open question. However, having just stumbled on to an Amazon.com service designed to harness the power of crowdsourcing called Mechanical Turk (HT Complex Systems Blog) I'm tempted to believe this somewhat related idea could be useful very quickly to complete large scale annotation projects (something I've posted about before), despite the potential for lousy annotations.

The point of crowdsourcing is to complete tasks that are difficult for computers, but easy for humans. For example, here are five tasks currently being listed:

1. Create an image that looks like another image.
2. Extract Meeting Date Information from Websites
3. Your task is to identify your 3 best items for the lists you're presented with.
4. Describe the sport and athlete's race and gender on Sports Illustrated covers
5. 2 pictures to look at and quickly rate subjectively

It should be easy enough to crowdsource annotation tasks (e.g., create a web site people can log in to from anywhere which contains the data with an easy-to-use interface for tagging). "Alas!", says you, "surely the poor quality of annotations would make this approach hopeless!"

Would it?

Recently, Breck Baldwin over at the LingPipe blog discussed the problems of inter-annotator agreement (gasp! there's inter-annotator DIS-agreement even between hip geniuses like Baldwin and Carpenter? Yes ... sigh ... yes there is). However (here's where the genius part comes in) he concluded that, if you're primarily in the business of recall (i.e, making sure the net you cast catches all the fish in the sea, even if you also pick up some hub caps along the way), then the reliability of annotators is not a critical concern. Let's let Breck explain:

The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.

Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.

Unless I've mis-understood Baldwin's post (I'm just a lousy linguist mind you, not a genius, hehe) then the major issue is adjudicating the error rate of a set of crowdsourced raters. Couldn't a bit of sampling do this nicely? If you restricted the annotators to, say, grad students in linguistics and related fields, the threshold of "remotely qualified" should be met, and there's plenty of grad students floating around the world.

This approach strikes me as related to the recent revelations that Wikipedia and Digg and other groups that try to take advantage of web democracy/crowd wisdom are actually functioning best when they have a small group of "moderators" or "chaperones" (read Chris Wilson's article on this topic here).

So, take a large group of raters scattered around the whole wide world, give them the task and technology to complete potentially huge amounts of annotations quickly, chaperone their results just a bit, and voilà, large scale annotation projects made easy.

You're welcome, hehe.

Wednesday, March 3, 2010

Oldest Example of Written English Discovered

No, not quite. The title of this post comes from a Digg link which linked to this article. The writing is dated at around 500 years old, which couldn't possible be "oldest example of written English" could it? The Huntington Library has the Ellesmere Chaucer, a manuscript c. 1405, so that's got it beat by a 100 years already and I haven't even bothered to look for Old English manuscripts. The claim in the title is quite different from the claim in the original article which begins with this:

What is believed to be the first ever example of English written in a British church has been discovered. Problem is, no-one can read it.

This just means there's a lot of Latin written in English churches. The cool part is that they're crowdsourcing the interpretation.

If anyone thinks they can identify any further letters from the enhanced photographs, please contact us via the Salisbury Cathedral website.The basic questions of what exactly the words are and why the text was written on the cathedral wall remain unanswered. It would be wonderful for us to solve the mystery (link added).

Go on, give it a shot.


Looks like the original lyrics to Judas Priest's Better by You Better Than Me to me.

Wednesday, November 18, 2009

Crowdsourcing Annotation

(image from Phrase Detectives)

Thanks to the LingPipe blog here, I discovered an online annotation game called Phrase Detectives designed to encourage people to contribute to the creation of hand annotated corpora by making a game of it. It was created by the University of Essex, School of Computer Science and Electronic Engineering. Of course, they have a wiki, Anawiki. I'm not crazy about the cutesy cartoon mascot (they given it a name: Sherlink Holmes. Ugh. I guess Annie would be a bit too obvious?) . I've wondered aloud about this kind of thing before, so I'm glad to see it coming to fruition.

I haven't started playing the game yet, but I'm looking forward to it. For now, here is the project description:

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words.

However, the success of Wikipedia and other projects shows that another approach might be possible: take advantage of the willingness of Web users to contribute to collaborative resource creation. AnaWiki is a recently started project that will develop tools to allow and encourage large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (in the first instance, of a corpus annotated with information about anaphora).

Cheers.

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...