Monday, February 11, 2008

The Perils of Semantic Annotation

One of the most challenging tasks a linguist can engage in is that of annotating natural language text for semantics. It is simultaneously interesting, tedious and tricky, which makes it altogether maddening. We perform this task for a variety of reasons. Sometimes to create training data for learning algorithms (which was a big topic of discussion at last year's NAACL HLT) or to explicate the semantics of events like the FrameNet project. Part of my dissertation is very FrameNet-like, so I do a lot of annotating (I will save my bile-filled hateful remarks about the general crappiness of annotator apps for another post).

Generally speaking, the annotator's task is to read naturally occurring sentences, then identify and tag the semantic roles of the participants involved in the particular event represented by the sentence. It would be easy if all of English was composed of sentences like "Bobby kicked the ball"; that would be sweet. "Bobby" is an AGENT, "the ball" is a PATIENT. Done. Let's move on. But that's not how real language works, is it?

In any case, I have been annotating sentences involving the verb "exclude" recently and I find it's a particularly challenging set. The BNC “exclude” sentence below was difficult to annotate because the exclude event is not clear about its participants:

The new Minister for Health, Dr Noel Browne, a dedicated reformer of the health services and much concerned in-particular with the eradication of tuberculosis in Ireland, modified the earlier bill to exclude the compulsion elements.

At first, I thought “Dr Noel Browne” was the agent doing the excluding, but then I realized it was the bill which excluded. But which bill? I concluded that “the earlier bill” is NOT participating in the exclude event because, logically, it must be the version of the bill that came AFTER the early one which did the excluding. So, this requires a presupposed later bill. So, should I annotate the good Dr. as the agent, or leave this participant alone (FrameNet's annotator app has the ability to mark an unexpressed element, and I believe this is exactly why, but I don't use their app). Also, it’s not clear if the “to” means “in order to” as a purpose statement. Is the bill explicitly, directly excluding, or was that simply the intent of the changes? If it’s indirect, that makes Dr. Noel a better candidate for the agent of exclusion.


