Thursday, December 24, 2009

Google Basterds

Since Liberman at LL just re-confirmed "the observation that Google counts no longer have even order-of-magnitude comparative validity in matters of usage (if they ever did)," I thought I'd pass along my own latest discovery: Google double quotes are not as restrictive in queries as they're claimed to be. From Google's support page:

Phrase search ("")
By putting double quotes around a set of words, you are telling Google to consider the exact words in that exact order without any change. Google already uses the order and the fact that the words are together as a very strong signal and will stray from it only for a good reason, so quotes are usually unnecessary. By insisting on phrase search you might be missing good results accidentally. For example, a search for [ "Alexander Bell" ] (with quotes) will miss the pages that refer to Alexander G. Bell.

But this is not what it seems...

This is a classic recall vs precision issue, right? If you care about recall, you want to return ALL matches, even if you also return other stuff (inclusive). If you care about precision, you  want to make sure that each return is correct with no errors, even if this means you miss some correct matches (restrictive). Read more here. (psst, note the very issue Liberman posted about is alive and well here. I tend to say "recall and precision" while it's quite common, perhaps more so, to say "precision and recall").

Google, and most search engines, allows us to put double quotes around a query to make it highly restrictive. In theory, this should mean that a query with no quotes around it should always return at least the same number of matches as the exact same query with quotes, and usually more. The quoted query matches should be a subset of the unquoted query matches, Got it? If I'm wrong on this, let me know, but that's my assumption.

Yesterday I wanted to know if some of the better quotes from Tarantino's recent movie Inglourious Basterds were being picked up in general usage yet so I Googled some of them and looked at their search results estimate. As a sort of baseline, I decided to Google some famous lines from film history, to see how many hits famous lines generally get. However, some of the lines are similar to common phrases (e.g., "I'll be back" vs "I'll be right back"). To account for this, I put those lines in double quotes, to restrict the returns to exact matches. Being a semi-trained researcher, I realized that I should go back and put all lines in double quotes and try to compare apples to apples. Then I discovered something weird. In some cases, the more restrictive, double-quoted query returned more hits that the unquoted query. A lot more. And the results have stood up through repetition. For example:

Gone With The Wind
about 797,000 for "Frankly, my dear, I don't give a damn!"
about 163,000 for Frankly, my dear, I don't give a damn!

Taxi Driver
about 17,500,000 for "You talkin' to me?"
about 7,450,000 for You talkin' to me?

Maybe I just don't get what the double quotes are doing. And Google doesn't make money from helping linguist study language; they make money from pairing ads with search queries, and bully for them. I'm a capitalist at heart. I don't begrudge anyone making a buck, especially a bunch of seriously smart Stanford PhDs. But still, it's disappointing that such a powerful engine as Google's isn't more useful to the research community. I should re-read Adam Kilgarriff’s “Googleology is bad science."


Anonymous said...

Hey Lousy Linguist,

maybe this post is relevant to you ( I also noticed very strange behavior in quoted Google counts. Some counts were of by factor 100,000! In all cases, I found though the bug disappeared as soon as I clicked on the "Next" button to see any of the supposed 100,000+ hits .... which then suddenly turned out to be only about 4-20.

Chris said...

hlplab, great link, thanks! No doubt it's official, Google is NOT reliable for ling work. I meant to contrast these results with WebCorp, just haven't gotten around to it.

And it's nice to see a Ra Cha Cha lab with a blog! I'll add it to my blog roll.

Lonehermit said...

Google double quotes have never been reliable. In Latvian double quotes sometimes can be used to filter out diacritical letters but sometimes it doesn't work ("ejam mājās" vs. "ejam majas"). The behavior is not predictable and can even be location dependent – your country, language settings, even your ISP can have strange influence on your results.

I don't think that they store all exact strings in their database. Probably the string A B C stored as "A B", "B C" etc. And some clever algorithms figures it out as far as the permissible amount of CPU cycles are not exceeded.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...