Saturday, December 18, 2010

how NOT to interpret ngrams

Andrew Sullivan has predictably misunderstood the value of Google's Ngram Viewer. He spent all day yesterday posting trite and simplistic mis-interpretations of the data. For example,
I like the Ngram Viewer, but simply plotting the frequency of words against each other to determine something about culture or concepts is a very weak technique that leads to massive mis-interpretations, as we've seen recently with things like counting the number of times President Obama uses pronouns in his speeches. I discussed the failings of simple word counts as a technique here. To sum up,
  • We don't know what causes word frequencies.
  • We don't know what the effects of word frequencies are.
  • There are good alternatives.


allisons said...

We know *some* of the effects of word frequencies - things like they're more likely to be the landing spot of phonological or semantic drift speech errors.

Cool thing? "War" gets a big bump from 1914-1920 and then again at 1939-1945. How cool is that??

Chris said...

Fair point, but let's be honest, the average non- linguist knows nothing about the relationship between lexical frequency and speech errors and the Google tool tells them nothing about that.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...