The Lousy Linguist: how NOT to interpret ngrams

Saturday, December 18, 2010

how NOT to interpret ngrams

Andrew Sullivan has predictably misunderstood the value of Google's Ngram Viewer. He spent all day yesterday posting trite and simplistic mis-interpretations of the data. For example,

the concept of ideology is a relatively recent one because the word ideology has become more frequent recently (this is almost certainly false).
Jesus "wins" (his word, not mine) against the Beatles because the word Jesus is more frequent.

I like the Ngram Viewer, but simply plotting the frequency of words against each other to determine something about culture or concepts is a very weak technique that leads to massive mis-interpretations, as we've seen recently with things like counting the number of times President Obama uses pronouns in his speeches. I discussed the failings of simple word counts as a technique here. To sum up,

We don't know what causes word frequencies.
We don't know what the effects of word frequencies are.
There are good alternatives.

2 comments:

Unknown said...: We know *some* of the effects of word frequencies - things like they're more likely to be the landing spot of phonological or semantic drift speech errors.

Cool thing? http://ngrams.googlelabs.com/graph?content=love%2Cwar&year_start=1800&year_end=2000&corpus=0&smoothing=3 "War" gets a big bump from 1914-1920 and then again at 1939-1945. How cool is that??; December 18, 2010 at 12:54 PM
Chris said...: Fair point, but let's be honest, the average non- linguist knows nothing about the relationship between lexical frequency and speech errors and the Google tool tells them nothing about that.; December 18, 2010 at 1:16 PM

The Lousy Linguist

Saturday, December 18, 2010

how NOT to interpret ngrams

2 comments:

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

Tools for Linguists

Favorite Posts