I've been wondering about this very issue for 5 years or so. When I first started collecting parsed BNC data for my defunct dissertation, I needed sentences involving various verbs and prepositions, but the examples I found were often of the wrong structural type because of preposition attachment ambiguity. I used Tgrep2 queries to find proper examples, but even then there were false positives, so I did some error correction. One of the more interesting discoveries I made was a relationship between a verb's role in its semantic class and its error rate.
I was trying to find a way to objectively define core members of a semantic verb class and peripheral members. I had a pretty good intuition about which were which, but I wanted to get beyond intuition (yes yes, it's all very Beth Levin).
For example, one of the objective clues for barrier verbs (a class of negative verbs encoding obstruction, like prevent, ban, exclude, etc) was the unusual role of the preposition from in sentences like these:
- She prevented them from entering the pub.
- He banned them from the pub.
- They were excluded from the pub.
Again, I had a lot of false positives even with Tgrep2 so I did some manual error analysis and discovered that certain verbs had very low error rates while others had very high rates and the difference coincided nicely with my intuition about which verbs were core members of the class and which were peripheral: core members like prevent had very low error rates. This means that when prevent is followed by a from-PP, it's almost always the complementizer from; obvious to adults, the meaning of a barrier verb doesn't easily include source (necessary for old-fashioned from), but how would a kid learn that? If I ban you from the pub, how does a kid know the pub is NOT where you started (source) but rather the opposite, it's where you're not allowed to end up (goal)? Cool little learning problem, I thought ... and with a data set other than frikkin dative (which Pinker and Levin have, let's face it, done to death).
I assumed there was something central to the meaning of the verb class that caused this special use of from. Then it occurred to me, if this is true, why do I need the parse? Imagine I ignore structure, take all sentences where from follows a relevant verb, then sample for false positives. That should give me basically the same thing.
I became increasingly fascinated with this methodology. I was now interested in how I was studying language, not what I was studying. And that led me to ask whether or not the parse info was all that valuable for other linguistic studies? But then I realized that when big news stories start getting old, the media always, always starts reporting on themselves, on how the news gets made ... I didn't like where I was heading ...
...and then I got a job and that was that ...