Since Liberman at LL just re-confirmed "the observation that Google counts no longer have even order-of-magnitude comparative validity in matters of usage (if they ever did)," I thought I'd pass along my own latest discovery: Google double quotes are not as restrictive in queries as they're claimed to be. From Google's support page:
Phrase search ("")
By putting double quotes around a set of words, you are telling Google to consider the exact words in that exact order without any change. Google already uses the order and the fact that the words are together as a very strong signal and will stray from it only for a good reason, so quotes are usually unnecessary. By insisting on phrase search you might be missing good results accidentally. For example, a search for [ "Alexander Bell" ] (with quotes) will miss the pages that refer to Alexander G. Bell.
But this is not what it seems...
This is a classic recall vs precision issue, right? If you care about recall, you want to return ALL matches, even if you also return other stuff (inclusive). If you care about precision, you want to make sure that each return is correct with no errors, even if this means you miss some correct matches (restrictive). Read more here. (psst, note the very issue Liberman posted about is alive and well here. I tend to say "recall and precision" while it's quite common, perhaps more so, to say "precision and recall").
Google, and most search engines, allows us to put double quotes around a query to make it highly restrictive. In theory, this should mean that a query with no quotes around it should always return at least the same number of matches as the exact same query with quotes, and usually more. The quoted query matches should be a subset of the unquoted query matches, Got it? If I'm wrong on this, let me know, but that's my assumption.
Yesterday I wanted to know if some of the better quotes from Tarantino's recent movie Inglourious Basterds were being picked up in general usage yet so I Googled some of them and looked at their search results estimate. As a sort of baseline, I decided to Google some famous lines from film history, to see how many hits famous lines generally get. However, some of the lines are similar to common phrases (e.g., "I'll be back" vs "I'll be right back"). To account for this, I put those lines in double quotes, to restrict the returns to exact matches. Being a semi-trained researcher, I realized that I should go back and put all lines in double quotes and try to compare apples to apples. Then I discovered something weird. In some cases, the more restrictive, double-quoted query returned more hits that the unquoted query. A lot more. And the results have stood up through repetition. For example:
Gone With The Wind
about 797,000 for "Frankly, my dear, I don't give a damn!"
about 163,000 for Frankly, my dear, I don't give a damn!
about 17,500,000 for "You talkin' to me?"
about 7,450,000 for You talkin' to me?
Maybe I just don't get what the double quotes are doing. And Google doesn't make money from helping linguist study language; they make money from pairing ads with search queries, and bully for them. I'm a capitalist at heart. I don't begrudge anyone making a buck, especially a bunch of seriously smart Stanford PhDs. But still, it's disappointing that such a powerful engine as Google's isn't more useful to the research community. I should re-read Adam Kilgarriff’s “Googleology is bad science."
I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handfu...
I used the phrase god awful in a comment at Language Log and it occurs to me that it's an odd little creature. From the OED *: Pronu...
Purpose: This post reviews my experience interviewing for a Linguist position at Google in Santa Monica, CA on February 29, 2008. I've ...
Bob Carpenter recently made the following comment on one of my posts: I'm very excited to hear that linguists are beginning to take sta...