Monday, March 24, 2008

Google Linguistics

Erin made the following well-taken point in a comment to this earlier post:

This appeal to the authority of Google is troublesome in linguistics, since we often refer to Google results for evidence for hypotheses about usage. That is documents indexed by Google as a data source, rather than its search results as authoritative figure, of course, but this may not be obvious to the average Joe. :\

I have used Google repeatedly to find instances of constructions that I could not find using standard corpus linguistics methods with hand compiled corpora like the BNC. Typically I’m looking for any instance, just to prove people really do say the thing I’m claiming is possible. For example, I needed to find some examples of passivized complements embedded under 60 different barrier verbs following this pattern:

a. I banned John from being examined by the doctor.
b. I banned John from getting examined by the doctor.

Many of the verbs I wanted to search for are low frequency in the BNC (e.g., barricade, derail, hamper, etc) so the likelihood of finding examples of passivized complements using say a Tgrep2 search is low. So, I ventured into the scary land of Google Linguistics. I used the search query “verbed * from being” and “verbed * from getting” Within a short time, I had multiple examples for most of the verbs I was looking for. I can’t imagine performing this task more efficiently with any other tool. Google really worked well under those circumstances.

Let me note that I have not used Google hit counts or page counts to derive any statistics regarding frequency of occurrence, though. When I do this sort of thing, I’m careful to use my common sense to decide if a return is from a native speaker or not, and often what I do is skim a page to see if there are any obvious ESL errors. Also, I use my own intuition regarding the acceptability of a usage (by pure coincidence, Peter Ludlow from U. Toronto will be here in Buffalo this week giving a talk on the role of linguistic intuitions).

One of the more thorough discussions of the use of search engines in linguistics research is Adam Kilgarriff’s “Googleology is bad science”, a squib from Computational Linguistics (2007, v33, 1)

He writes that the web is attractive to linguists because it is “enormous, free, immediately available, and largely linguistic”. But, he points out four major flaws:

1. search engines do not lemmatise or part-of-speech tag
2. search syntax is limited
3. there are constraints on numbers of queries and numbers of hits per query
4. search hits are for pages, not for instances.

Kilgarriff offers this alternative: “work like the search engines, downloading and indexing substantial proportions of the web, but to do so transparently, giving reliable figures, and supporting language researchers’ queries”

The squib goes on to detail how we might go about doing that in a principled way. It’s well worth the read.

No comments:

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...