oh snap! daume talkin trash 'bout "stupid" penn tree bank

Hal Daume at his excellent NLPers blog is wondering aloud about parsing algorithms doing "real" syntax:

One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them [emphasis added].

Oh no he di'nt!

[UPDATE: hal responds thoughtfully in the comments and properly corrects my misunderstandings of his post.]

It's certainly fair to say that the Penn Treebank is not annotated for everything. Sure. But show me the perfect resource and I'll let you throw all the stones you want. More to the point, once you get beyond deciding what the basic chunks are (NPs,VPs, PPs, etc), there's little agreement on what is and what is not a "real" syntactic thing. In order to annotate anything above this level, you have to choose a theoretical camp to park your tent in. You have to take sides. Daume is happy to be a Chomskyan. He's taken his side. Good for him.

In order to annotate Daume's beloved deep transformations, one must first admit such things exist. I do not. And if Daume started annotating the Penn Treebank with such things, I wouldn't care. I would argue he is wasting his time chasing unicorns.

Daume may believe that Chomskyan theory is "real" syntax, but I do not. Nor do most linguists (if you surveyed all linguists throughout the world, yes I do believe a majority would disagree with the statement I believe in Chomskyan deep structure).

UPDATE: Daume's comments and his responses are well worth reading.


hal said...

Hrmmm. I think you're misreading me, or maybe I wasn't really clear.

(a) Really I didn't mean that the Treebank stood in the way, but rather that researchers have essentially decided that PTB Syntax = the end-all be-all definition of syntax. I don't think any linguist (either side of the fence) would agree with that. Sure agreement beyond what is there is hard/impossible.

(b) "Daume is happy to be a Chomskyan" -- OMG where did you get that? "Real" was in quotes for a reason, I "snickered" the whole time. I think everything should work from MDL. I definitely don't agree with a lot of Chomskyan linguists (I would self define more on the LFG side), but I think there is _something_ to the idea of movement. I don't think anyone would deny that PP fronting exists. Regardless of what you think about "deep structure".

(c) Or do you actually think that PP fronting is a unicorn? Or pro-drop? Is there really no relationship between the sentences "John ate and apple" and "An apple was eaten by John" and "What did John eat?"

I guess just because I don't think that Chomskyan linguistics is right, it doesn't mean that there aren't some useful ideas in there.

Chris said...

hal, always happy to take criticism, I am the "lousy linguist" remember, haha...

Yes, I failed to interpreted your meaning with the quotes, my bad, sorry about that. A rush to judgment on my part, based entirely in my own theoretical biases (a double bad on me), again, I apologize.

I see now that I was unnecessarily combative in my response, the result of rash posting and years in a functional linguistics department where Chomsky was viewed as the "enemy".

Good point about PP fronting and such. Is it a unicorn? Well, surely there are sentences that start with PP phrases, but are we sure they are "fronted"? Or, are they simply what they are? I suspect extreme constructionists would say there is no "fronting" (take Croft for example) but rather, it is a construction unto itself. So what should the annotation be?

I should think on this more, but I'll say this: having been on the laborious side of annotating myself (wherein I spent many many hours annotating corpora for various reasons, mostly NE tagging) I recall a never-ending debate on guidelines (i.e., what counts as an X?) that made me loathe to believe in the ease of creating annotations that were non-theoretical. At some point, you're faced with a decision that flat-out calls for a theoretical guideline.

