Saturday, December 18, 2010

how NOT to interpret ngrams

Andrew Sullivan has predictably misunderstood the value of Google's Ngram Viewer. He spent all day yesterday posting trite and simplistic mis-interpretations of the data. For example,
I like the Ngram Viewer, but simply plotting the frequency of words against each other to determine something about culture or concepts is a very weak technique that leads to massive mis-interpretations, as we've seen recently with things like counting the number of times President Obama uses pronouns in his speeches. I discussed the failings of simple word counts as a technique here. To sum up,
  • We don't know what causes word frequencies.
  • We don't know what the effects of word frequencies are.
  • There are good alternatives.


allisons said...

We know *some* of the effects of word frequencies - things like they're more likely to be the landing spot of phonological or semantic drift speech errors.

Cool thing? "War" gets a big bump from 1914-1920 and then again at 1939-1945. How cool is that??

Chris said...

Fair point, but let's be honest, the average non- linguist knows nothing about the relationship between lexical frequency and speech errors and the Google tool tells them nothing about that.

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handfu...