Saturday, December 18, 2010

how NOT to interpret ngrams

Andrew Sullivan has predictably misunderstood the value of Google's Ngram Viewer. He spent all day yesterday posting trite and simplistic mis-interpretations of the data. For example,
I like the Ngram Viewer, but simply plotting the frequency of words against each other to determine something about culture or concepts is a very weak technique that leads to massive mis-interpretations, as we've seen recently with things like counting the number of times President Obama uses pronouns in his speeches. I discussed the failings of simple word counts as a technique here. To sum up,
  • We don't know what causes word frequencies.
  • We don't know what the effects of word frequencies are.
  • There are good alternatives.

2 comments:

allisons said...

We know *some* of the effects of word frequencies - things like they're more likely to be the landing spot of phonological or semantic drift speech errors.

Cool thing? http://ngrams.googlelabs.com/graph?content=love%2Cwar&year_start=1800&year_end=2000&corpus=0&smoothing=3 "War" gets a big bump from 1914-1920 and then again at 1939-1945. How cool is that??

Chris said...

Fair point, but let's be honest, the average non- linguist knows nothing about the relationship between lexical frequency and speech errors and the Google tool tells them nothing about that.