NPR ran a story today called Google Book Tool Tracks Cultural Change With Words. It's about "the biggest collection of words ever assembled*", Google's 500 billion word corpus is drawn from the books they've scanned, but here's the catch: many of those books are copywrited, so what Google did is pull a trick that goes back to the very beginnings of computational linguistics, they present the words as an unordered set, or bag o' words:
Many of these books are covered by copyright, and publishers aren't letting people read them online. But the new database gets around that problem: It's just a collection of words and phrases, stripped of all context except the date in which they appeared.
I first learned about this technique back in 1999 in an intro to computational linguistics course (bit of trivia: we we're using an incomplete pre-print of Martin and Jurafsky; as I recall, the discourse chapter was composed entirely of one page that read 21 Computational Discourse write something here...) and I remember being appalled at its crass simplicity. I mean, how dare those idiot engineers reduce language down to simple lists of words. How dare they try to use simple word lists to discover important facts about language and devise important linguistic tools.
It took less than a week for me to change my tune. The fact is, the bag o' words technique is remarkably powerful and useful. No, it doesn't solve all problems in one swoop, but it solves a hell of a lot more than I could possibly predict as a naive 2nd year linguistics grad student. For example:
Irregular verbs are used as a model of grammatical evolution. For each verb, researchers plotted the usage frequency of its irregular form in red ("thrived"), and the usage frequency of its regular past-tense form in blue ("throve/thriven"). Virtually all irregular verbs are found from time to time used in a regular form, but those used more often tend to be used in a regular way more rarely.
Google labs lets you play with its tool here (hehe).
*Not sure where this claim originated, but Google has already released a 1 trillion word corpus via LDC, the Web 1T 5-gram Version 1.