Wednesday, January 26, 2011

do we need parsed corpora?

Maybe not, according to Google: For many tasks, words and word combinations provide all the representational machinery we need to learn from text...invariably, simple models and a lot of data trump more elaborate models based on less data.

I've been wondering about this very issue for 5 years or so. When I first started collecting parsed BNC data for my defunct dissertation, I needed sentences involving various verbs and prepositions, but the examples I found were often of the wrong structural type because of preposition attachment ambiguity. I used Tgrep2 queries to find proper examples, but even then there were false positives, so I did some error correction. One of the more interesting discoveries I made was a relationship between a verb's role in its semantic class and its error rate.

I was trying to find a way to objectively define core members of a semantic verb class and peripheral members. I had a pretty good intuition about which were which, but I wanted to get beyond intuition (yes yes, it's all very Beth Levin).

For example, one of the objective clues for barrier verbs (a class of negative verbs encoding obstruction, like prevent, ban, exclude, etc) was the unusual role of the preposition from in sentences like these:
  • She prevented them from entering the pub.
  • He banned them from the pub.
  • They were excluded from the pub.
The preposition from is usually used to mark sources (He drove here from Buffalo) but in these sentences it's acting much more like a complementizer. This is fairly unique to barrier verbs and I felt it was distinctive of the verb class, so I wanted a bunch of examples. Because I needed to exclude examples involving old-fashioned source from, I used a Tgrep2 search that required the PP to be in a particular relationship to the verb (the BNC parse was a bit odd as I recall, and required some gymnastics).

Again, I had a lot of false positives even with Tgrep2 so I did some manual error analysis and discovered that certain verbs had very low error rates while others had very high rates and the difference coincided nicely with my intuition about which verbs were core members of the class and which were peripheral: core members like prevent had very low error rates. This means that when prevent is followed by a from-PP, it's almost always the complementizer from; obvious to adults, the meaning of a barrier verb doesn't easily include source (necessary for old-fashioned from), but how would a kid learn that? If I ban you from the pub, how does a kid know the pub is NOT where you started (source) but rather the opposite, it's where you're not allowed to end up (goal)? Cool little learning problem, I thought ... and with a data set other than frikkin dative (which Pinker and Levin have, let's face it, done to death).

I assumed there was something central to the meaning of the verb class that caused this special use of from. Then it occurred to me, if this is true, why do I need the parse? Imagine I ignore structure, take all sentences where from follows a relevant verb, then sample for false positives. That should give me basically the same thing.

I became increasingly fascinated with this methodology. I was now interested in how I was studying language, not what I was studying. And that led me to ask whether or not the parse info was all that valuable for other linguistic studies? But then I realized that when big news stories start getting old, the media always, always starts reporting on themselves, on how the news gets made ... I didn't like where I was heading ...

...and then I got a job and that was that ...

HT: Melodye


Margaret said...

Wow, great blog. Working does tend to move us from our initial goal, but that is not always a bad thing. Obviously, your move to work solved some "problem" you probably couldn't have articulated, but you knew how to solve it. Did it kill your parents for you to walk away from (note the word from) your PhD?

Chris said...

Margaret, thanks for the comment. Nahhh, my parents are hippies. They approach my entire life with the phrase "well, as long as you're happy..."

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been cre...