Thursday, August 22, 2013

corpus data 1: barrier verb frequencies

This is the eigth in a series of posts detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (list of previous posts here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

Corpus Data
All data was extracted from The British National Corpus in roughly 2007 (yeah yeah, I could re-do this ... someday). Below are four tables representing the co-occurrence percentages of the most frequent verbs in each of the four categories for which I extracted barrier verb data.

Recall that barrier verbs can occur in one of four full syntactic templates* (S = clause, or an ING verb) which I call the Barrier Verb Construction (BVC):
  • A: verb X from S — prevent bad guys from stealing the TV.
  • B: verb X from NP — exclude students from the auditorium.
  • C: verb X against S — guard against getting athlete's foot.
  • D: verb X against NP — defend yourself against the police.
Without getting into the greasy details, the BVC data below was extracted from a parsed version of the British National Corpus, so it involved more than mere word frequencies (it required specific syntactic relationships to hold in tree structures). The numbers are sorted by the percentage of total occurrences (this equals the total BVC occurrences divided by the total frequency of each verb as reported by Adam Kilgarriff).

How to read the table: the verb prevent had a total frequency of occurrence of 10286 according to the Kilgarriff data. I found 2152 correct occurrences of prevent intype A of the BVC. I interpret this to mean that about 21% of all instances of the verb prevent (and its morphological variants) within the BNC occur within type A of the Barrier Verb Construction (i.e., with a from ING complement). On the other hand, the word suppress occurred 1311 times overall, but only two of those times did it occur in the BVC (i.e., with a from ING complement).

There is much more to be said about these stats. I offer this as a tantalizing morsel. To be continued...

*These four basic construction types do not include passives or sentences where there is only an implied complement.

No comments:

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...