Sunday, December 19, 2010

ngram roundup

It's not difficult to find glee and excitement surrounding Google's new Ngram Viewer. Hyperbolic praise is whirling around the innerwebz like mad. As an antidote and a nod to the role skepticism should play in our contemporary society, I present a brief round up of criticisms:

Geoffrey Nunberg:
...there are still a fair number of misdated works, and there's no way to restrict a query by genre or topic. But in the end, the most important consequence of the Science paper, and of allowing public access to the data, is that it puts "culturomics" into conversational play.

Mark Davies:
Google Books can't use wildcards to search for parts of words. For example, try searching for freak* out (all forms: freak_, freaked, freaking, etc) or even a simple search like teenager* ... if Google Books doesn't know about part of speech tags or variant forms of a word, then how can it look at change in grammar? ... To use collocates with Google Books, you would have to manually download thousands or millions of hits to your hard drive, and then use another program to look for and categorize the collocates.

Mark Liberman:
The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture".  But as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.

David Crystal:
...this is just a collection of books - no newspapers, magazines, advertisements, or other orthographic places where culture resides. No websites, blogs, social networking sites. No spoken language, of course, so over 90 percent of the daily linguistic usage of the world isn't here...The approach, in other words, shows trends but can't interpret or explain them. It can't handle ambiguity or idiomaticity..

The Binder Blog:
The value of the Ngrams Viewer rests on a bold conceit: that the number of times a word is used at certain periods of time has some kind of relationship to the culture of the time. For example, the fact that the word “slavery” peaks around 1860 suggests that people in 1860 had a lot to say about slavery. Another spike around the 1970s meshes nicely with the Civil Rights Movement. Well, that’s sort of interesting. However, I didn’t need ngrams to tell me that a lot of people were writing about slavery in 1860. These data are broad but not deep, which makes them relatively useless to most humanities majors interested in intensive study.

The one positive comment that I think bears repeating is the role this fun little tool might play is sparking the imagination of young students interested in the role technology can play in the humanities.

Geoffrey Nunberg:
Whatever misgivings scholars may have about the larger enterprise, the data will be a lot of fun to play around with. And for some—especially students, I imagine—it will be a kind of gateway drug that leads to more-serious involvement in quantitative research.


Oliver Mason said...

Well, 'slavery' in the 1860s is a kind of obvious thing, of course. But that shows that we can use the system to find things we know are there. More interesting is to come across things we wouldn't have predicted in advance - and we can then be reasonably confident because there are examples that show it works.

Chris said...

Fair point, but finding things you cannot predict is difficult work. Not sure the Ngram Viewer makes that process easier in any systematic way.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...