Thursday, December 30, 2010

soccer vs. football

Too late for The World Cup, but thanks to Stan Carey at Sentence First, I only just now discovered that we Yanks are not the only English speakers who use soccer to refer to, ya know, that game where you can't touch the ball with your hands (tennis? no... the one that Ronaldo plays). In fact, there are about 74 million OTHER English speakers in this world who use soccer to refer to Ronaldo's game too. Add the USA's 308 million, and it is almost certainly the case that more English speakers use soccer than football. With that, I say thppppt to the English...


(image from Wikipedia)

UPDATE [3:38PM eastern]: reader vp points out the following passage from the same Wikipedia article the image came from: several official publications of the English Football Association have the word "soccer" in the title. Simon Kuper and Stefan Szymanski write that soccer was the most common name for the game in Britain from the 1890s until the 1970s, and suggest that the decline of the word soccer in the UK, and the incorrect perception that it is an Americanism, were linked to awareness of the North American Soccer League in the 1970s.

plagiarism and n-grams

Big media plagiarism is once again in the news as ESPN has suspended an on-air host for plagiarizing three sentences from a newspaper columnist. The on air host has admitted the plagiarism*, issued an apology, and asked for forgiveness.

The multiple and confusing ethical standards for plagiarism has have been the subject of of several LL posts (recently here) and this led me to wonder about what counts as plagiarism in the first place. Clearly a three sentence, 45 word passage, almost word for word identical with another, in the same semantic domain with the same referents, is a case of plagiarism. But what about a 20 word passage? 10 word? 4 word**?

Many short phrases are highly frequent, right? You couldn't felicitously accuse me of plagiarism for using the phrase "I am going..." could you? Even though, there can be no doubt, that someone else before me used it first. Yes, I know you can find guidelines for plagiarism in college student handbooks and such. I dealt with those for years when I taught college writing courses (and I recall flunking at least three students for plagiarism, but those were whole papers, really stupid stuff).

But I wonder, now that we have a 500 million word corpus available to us, couldn't we simply compare all n-grams to discover how likely it is that any given 5-gram is repeated? I'd prefer to do this up to 20-gram and such, but wouldn't we predict that there comes a point at which the likelihood that a particular phrase was plagiarized (given that we had found two alike) would be based solely on the general likelihood that n-grams of that size are repeated. The situation would be this: you discover that a particular 11 word passage has an identical twin from 2 years ago. Without bothering to look into whether or not the author had access to the previous work, you simply look up the likelihood that any 11-gram passage is repeated and discover that there is a 0.0002% chance that a phrase that long will be repeated.

With some effort, you could then derive predictions for near identical passages (using WordNet and similar resources)....

..just thinking out loud...


*I am ignorant of the role ESPN's producers play in the writing of on air speeches, but the quote seems clearly to have been written on a teleprompter at the time of speaking, which means someone else was involved, even if unwittingly. Nonetheless, the host is taking the fall willingly.

**Excluding obviously famous phrases like Ich bin ein Berliner.

does asbestos really mean 'unquenchable'?

Yes, at least etymologically. The Online Etymology Dictionary explains its etymology this way:
...from O.Fr. abeste, from L. asbestos "quicklime" (which "burns" when cold water is poured on it), from Gk. asbestos, lit. "inextinguishable," from a- "not" + sbestos, verbal adj. from sbennynai "to quench," from PIE base *(s)gwes- "to quench, extinguish" (cf. Lith. gestu "to go out," O.C.S. gaso, Hittite kishtari "is being put out") (emphasis added).

Like people, every word has lived its own peculiar and unique life. Riffing on my post below regarding words that have the opposite meaning of their etymology, my friend Andy (who did graduate work in Classics, and hence, actually reads Greek) challenged me to help him understand why the word asbestos, whose etymology literally means 'unquenchable' is used today to mean a substance that cannot burn.

With some Googling, I found this (PDF):  "First mention of asbestos appeared in the Greek text On Stones, written by Theophrastus, one of Aristotle’s students. Theophrastus referred to a substance that resembled rotten wood and burned (right) without being harmed when doused with oil."

So, Ol' Theophrastus kept pouring oil onto this stuff, but it never burnt, so he kept pouring, but the stuff was never quenched by oil/fire. Hence, it was unquenchable. That's my story and I'm sticking to it (for now).

Andy did some follow-up of his own and provides the following:
Yes, that's one of the more likely explanations. In my research I came across the use of asbestos as permanent wicks in lamps, but never noted the bit about being unquenchable with oil.  That Theophrastos citation really belongs in the dictionary entry below, as it's the only cite that explains the meaning under A.  The lexicon below is massively comprehensive (if you couldn't tell) so it's odd they missed Theo.

The other possible explanation is II. or "unslaked lime", as quick lime burns underwater.  This was a key component in later "Greek fire", but so far I haven't been able to find any ancient source that cites an unquenchable substance (Greek Fire dates to 500 AD, white phosphorus, which also burns underwater, dates to 1600 AD, and sodium, which explodes on contact with water, dates to 1800 AD).

If I had the time and language skill I used to have I would search my CD of all Greek text up to 600 AD for cites of asbestos and then comb thru them, but that would be a day's worth of work  I'm pleased that we got close to the meaning in online research and I'm not sure that looking up every instance of asbestos would change anything.

Andy also provided the following reference


A. unquenchable, inextinguishable,φλόξIl. l. c.; not quenched,πῦρ .” D.H.3.67, Plu.Num.9; “κλέοςOd.4.584; “γέλωςIl.1.599; “βοή11.50; “ἐργμάτων ἀκτὶς καλῶν . αἰείPi.I.4(3).42; . πόρος ὠκεανοῦ ocean's ceaseless flow, A.Pr.532 (lyr.); πῦρ, of hell, Ev.Marc.9.43.
II. as Subst., ἄσβεστος (sc. τίτανος), h(, unslaked lime, Dsc.5.115, Plu.Sert.17, Eum.16; “. κονίαLyc. ap. Orib.8.25.16.
2. a mineral or gem, Plin.HN37.146. ἀσβεστώδης: tofus, Gloss.


Henry George Liddell. Robert Scott. A Greek-English Lexicon. revised and augmented throughout by. Sir Henry Stuart Jones. with the assistance of. Roderick McKenzie. Oxford. Clarendon Press. 1940.

Wednesday, December 29, 2010

etymologists , unite!

A buddy wrote me an interesting question (to which I did not have an answer):

It's been driving me crazy, is there a term of art for when the etymological root of a word is the opposite of the word's modern meaning?  For example, asbestos means "an unquenchable fire"; philander means "a lover of men" etc. Cheers, A.,

Anyone know this?

dialects map

Extremely detailed North American English Dialects, Based on Pronunciation Patterns. The site could use a bit of a web re-design ... looks circa 1999. Anyone care to offer free web design help to clean up this otherwise useful resource a little?

Tuesday, December 28, 2010

not any or not one??

The NYTs recent The Number of None grammar blog post brings up an interesting question: is none semantically closer to not any or not one? And what should its morphosyntactic agreement be, singular or plural?

The Times takes the not any, plural position, but I am inclined to disagree based on my intuition about substitution. Below are the two sentences the Times uses to illustrate:
  • None of the interim employers or temporary agencies have contributed to a 401(k)
  • None of the works have gained a foothold in the seasonal repertory.
Now, with the substitutions and my personal acceptability rating (where * means mildly unacceptable/not sure and ** means completely unacceptable).
  • Not one of the interim employers or temporary agencies has contributed to a 401(k)
  • Not one of the works has gained a foothold in the seasonal repertory.
  • **Not any of the interim employers or temporary agencies have contributed to a 401(k)
  • **Not any of the works have gained a foothold in the seasonal repertory.
  • Not one of the interim employers or temporary agencies have contributed to a 401(k)
  • Not one of the works have gained a foothold in the seasonal repertory.
  • **Not any of the interim employers or temporary agencies has contributed to a 401(k)
  • **Not any of the works has gained a foothold in the seasonal repertory.
The above ratings suggest that I make no distinction in acceptability between none has and none have. But wait, there's more. Let's remove the lengthy PP and see how this pans out:
  • *Not one of them has contributed to a 401(k)
  • *Not one of them has gained a foothold in the seasonal repertory.
  • **Not any of them have contributed to a 401(k)
  • **Not any of them have gained a foothold in the seasonal repertory.
  • Not one of them have contributed to a 401(k)
  • Not one of them have gained a foothold in the seasonal repertory.
  • **Not any of them has contributed to a 401(k)
  • **Not any of them has gained a foothold in the seasonal repertory.
I seem to slightly prefer the singular reading when the word none is close to the verb but with a plural noun heading the PP. But this is not true if we delete the PP altogether:
  • Not one has contributed to a 401(k)
  • Not one has gained a foothold in the seasonal repertory.
  • *Not one have contributed to a 401(k)
  • *Not one have gained a foothold in the seasonal repertory.
It would appear I have an incoherent grammar (surely this is true as I believe all grammars are, in some way, incoherent. As Sapir said, all grammars leak). But, there's at least one other factor muddying the linguistic waters. The fact that one also acts a pronoun as in one does one's duty. When acting as a pronoun, it takes 2nd pers, SG agreement, as in one has to do one's duty (think he has to do his duty), not *one have to do one's duty. It may be that this pronoun agreement is interfering with my reading when one occurs right next to the verb. Also, I did this pretty fast, so I wouldn't be surprised if I change my mind by COB...

Of course, how could I resist:


I believe I got the full paradigm:
  • not one of them has
  • not one of them have
  • not any of them has
  • not any of them have
  • none of them has
  • none of them have
  • none has
  • none have
It appears as though none have had a hell of a start to the 18th 19th Century, but got killed off along with the Buffalo.

refudiate, the word that won't die

Thanks in no small measure to the Oxford University Press naming refudiate its Word Of The Year plus The Daily Dish rekindling its favorite topic, we have a new round of he-said-she-said to deal with. Made famous by Sarah Palin this past summer (see Liberman's original post here, and others here), it is yet again the object of speculation as to why Palin used the form to begin with.

Palin herself poured fuel on this fire two days ago by tweeting that it was a typo. Liberman thinks that explanation didn't hold water the first time around because she first said it aloud on teevee: the original example [on teevee] wasn't a slip of the tongue, but a symptom of the fact that Ms. Palin had a blend of repudiate and refute as a well-established entry in her mental lexicon [note added].

Why the fuss? There's nothing particularly interesting or telling about the linguistic blending of repudiate and refute. Everyone does this kind of thing now and again and sometimes it sticks. Some people like to beat up on public figures any time they can, so something like this is a target. But the more serious speculation is that the Palin Camp's public responses expose something important about Sarah Palin's inner circle and consultation. I'll leave it to the political pundits to fight that one out.

For now,

the linguistics of brand names

The Neurocritic reviews evidence for the whopping increase in drug brand names beginning with the letters z and x starting in 1986 and quotes the conclusion of the study's authors:


Reflecting their infrequent occurrence in English words, x and z count for 8 and 10 points in Scrabble, the highest values (along with j and q) in the game. So names that contain them are likely to seem special and be memorable. “If you meet them in running text, they stand out,” is the way one industry insider explained. Generally, they are also easy to pronounce.

The last point about being easy to pronounce is basically nonesense, so forgive them that, but their basic point that infrequent sounds are more memorable is basically a restatement of Zipf's Law and may have some truth to it.

I can tell you this, there are entire companies that charge high fees to help manufacturers develop brand names (see here for a discussion of what brand name developers do). I worked at one of them ever so briefly and I found there to be a mix of legitimate linguistics and voodoo linguistics mixed together in the "research" they prepared for their customers. I also found a resistance to serious linguistics for two reasons: 1) the customers didn't like science (I'm not joking; this was a serious obstacle) and 2) serious linguistics took too long and didn't come to firm conclusions. Typically, we were asked to initiate, perform, and complete linguistic research on brand names in a matter of weeks.

Ultimately, though, it was my conclusion that a product's name simply was not that crucial to its success, which teetered on the manufacturers overall marketing strategy more than the name. Think about Google vs. Microsoft. So, the rise in z and x named drug products is a fad based more in the board room than in the marketplace.

non-linguistic CAPTCHA

David Bradley, writing at sciencetech, reports on a new face-based CAPTCHA process, quoting the team that created it, "Unlike a text-based CAPTCHA, a major benefit of the proposed image-based face detection CAPTCHA is that it does not have any language barriers..."

I guess it never really occurred to me that there would be language barriers in CAPTCHAs because so many of the strings are in fact nonesense words, but I guess language specific phonotactics are helpful (often the identity of a single letter is quite ambiguous).

Monday, December 27, 2010

history of writing tech

American Scientist has a review of a new book on the history of writing technologies with a focus on how computers fit in. A BETTER PENCIL: Readers, Writers, and the Digital Revolution by Dennis Baron.

Money quote:

Will this shift in the technology of writing and reading be a positive development in human culture? Will it promote literacy, or impair it? Baron takes a moderate position on these questions. On the one hand, he acknowledges that the computer offers remarkable opportunities for self-expression and communication (at least for those of us in the wealthier parts of the world). Suddenly, we can all be published authors, and we all have access to the writings—or if nothing else the Twitterings—of millions of other authors. On the other hand, much of what these new channels of communication bring us is mere noise and distraction, and we may lose touch with more serious kinds of reading and writing. (Another recent book—The Shallows, by Nicholas Carr—argues this point strenuously.) Baron remarks: “That position incorrectly assumes that when we’re not online we throw ourselves into high-culture mode, reading Tolstoi spelled with an i and writing sestinas and villanelles instead of shopping lists.”

another lingo toy...

I love free online lingo toys like BYU's Corpora and Google's Ngram Viewer and now there's a new one: The Human Speechome Project from MIT" provides a look into the most complete record of a single child’s speech development ever created. The data has been organized to show the age of the child when he spoke each of his first 400 words." It's profiled in Forbes here. And they provide a nifty interactive graph to sort the data:

Sunday, December 26, 2010

true grit

I posted recently about the phrase "bust a cap" occurring in the original 1969 John Wayne movie True Grit. I got a chance to see the new Coen Bros version and my reactions are worth airing...or not, you decide...

First, it turns out the phrase true grit has a storied history in the history of English letters:


But this review is destined to be of the non-linguistic kind...

I also had the chance to re-watch the original John Wayne version just a couple days before watching the new one. While it may be the case that this is a bit unfair because it means the recent version is asked to live up to the original is some ways, nonetheless, it is instructive (insofar as it does NOT). I hereby forgive the Coen Bros for not watching the original again in preparation for their version. Surely this would have scuttled their project.

Let me make it clear that the individual performances in the Coen Bros movie alone make it worth watching. Each actor is given great opportunity to breath life into their character and I respect the Coen Bros for allowing that. They are truly dedicated to the fine craft of acting and I enjoyed watching their version of True Grit. Frankly, I could watch Jeff Bridges eat oatmeal and be amazed at how weird and wonderfully he did it.

Nonetheless, my primary complaint is devastating: the new Coen Bros version lacks the basic narrative structure and emotional depth that made the original so fundamentally enjoyable and satisfying. For the record, I have never read the novel, so I have no clue what it says and the Coen Bros based their new version entirely on that. However, I can say that one of the most deeply satisfying elements of the John Wayne movie is the development of the relationships that evolve between the child Mattie Ross, the drunken but courageous Rooster Cogburn, and the goofy, but basically decent La Boeuf. Throughout the original movie, those three characters find a way to forge a sort of dysfunctional, yet basically good and meaningful family unit between them. This family unit is completely absent from the new version. And I missed it.

One of the most touching and important moments of the original movie involves Rooster finally opening up to Mattie about his past and his wife and son while the two sit and wait for Ned Pepper's gang to arrive. This scene reveals Rooster's humanity and deeply emotional character. It is this scene that helps forge a familial bond, almost like an uncle/niece relationship, between Rooster and Mattie. And this deep relationship is played out for the rest of the movie. Developing this scene during a crucial moment of patience and waiting is pure narrative brilliance. Yet, the Coen Bros took this and turned it into camp and parody. The lines about his wife and son are basically thrown away in a drunken mumbling as his horse barely manages to contain his heavy frame while they trod along meaninglessly. What should be a deeply emotional connection forged in a tense moment of expectation becomes slapstick and meaningless. Why throw this away?

I would need a copy of the new film to point out all of the moments lacking narrative continuity, but here are a few to suffice:

Late in both movies, Mattie stumbles upon her nemesis Tom Chaney while gathering water from a river. In the original film, the proximity of Ned Pepper's gang is made clear and ominous. The likelihood that she would find trouble while going for water is made plain. But in the new version, it plays out like some wildly random coincidence. The ending of both movies requires these events to take place, but the original movie at least gives us some reasons behind the events, not just chaos and random nothingness.

Ned Pepper is a critical character in the story. In the original movie, the truly great actor Robert Duvall is given the chance to give the man some decency and honor. He is a killer, yes, but he also saves Mattie's life, despite claiming to be willing to end it. In fact, it is Ned Pepper, more than anyone else (in the original), who keeps Mattie alive (until the snake-hole scene at least). Robert Duvall was given the opportunity to create a Ned Pepper who is full and complex. In the Coen Bros version the actor Barry Pepper (seriously, no joke, that's his name, weird right?) is barely a grubby and dirty (really seriously dirty, nasty dirty, disgustingly dirty...) killer. The pathos of Ned Pepper is gone.

By far, the most iconic moment of the original movie is the scene where Rooster takes the reigns of his horse in his mouth and single handedly draws down against four armed opponents. This is one of the greatest moments of American Western lore, involving the single greatest actor of American Western mythology. It is truly a moment of cinematic greatness.  Leading up to this, Rooster describes a previous moment in his storied life much like this (earlier in both films) and it forms a crucial part of his legend and character. When the ultimate moment arrives in the original version, it is a moment of destiny, built up by the dialogue and scenes that have come before it. But in the Coen Bros version, the whole raison d'etre has been obscured by mumbling and misdirection. It's almost as if this were every bit as random as everything else that came before it. You may well argue that randomeness and chaos is in fact the Coen Bros' raison d'etre, and I can't argue against that. Fair enough. But then, why bother making a movie about a story for which destiny and courage is so crucial a factor? Without the great inevitable showdown of Rooster's grit against the despots' manpower, well, why make this movie at all? If you believe in pure chaos, fine, make No Country For Old Men over and over, got it. That makes sense. That's coherent. But why take this novel and make a movie? If your primary goal as movie makers is to take previous material well loved by the public and trash it for your own philosophical gain, that's just pure douchebaggery, so screw you Joel and Ethan.

Saturday, December 25, 2010

Choose Your Own Career in Linguistics

Trey Jones* at Speculative Grammarian invites y'all to play his cute, and yet somewhat depressing, game: Choose Your Own Career in Linguistics.

As a service to our young and impressionable readers who are considering pursuing a career in linguistics, Speculative Grammarian is pleased to provide the following Gedankenexperiment to help you understand the possibilities and consequences of doing so. For our old and bitter readers who are too far along in their careers to have any real hope of changing the eventual outcome, we provide the following as a cruel reminder of what might have been.


Let the adventure begin...

*hehe, he used to work at Cycorp, hehe...

Thursday, December 23, 2010

bustin' a cap

Watching the original True Grit on teevee and what do I hear? Ned Pepper (Robert Duvall) says something to the effect "I ain't never busted a cap in no girl before." I thought only contemporary gansta movies and rap lyrics used that phrase (and yes, I did find some examples of bust(-ed) a cap using the Ngram Viewer).

Wednesday, December 22, 2010

my bad, global edition

Manute Bol is often credited with coining the phrase my bad (see here and here, or here for alternate hypotheses). It has apparently made the jump, in some way, to international usage, it's just not clear to me how.

While watching The Girl Who played with Fire again last night, I noticed Lisbeth says something that is translated as my bad, but what she actually says is in Swedish, of course.

(screen shot from Netflix)

To my non-Swedish speaking ears, it sounds like she says mitt viel, which would mean something closer to my very, if Google translate is any help. Google translates my bad into Swedish as mitt dåliga (dåliga appears to be a literal translation of bad). I'm pretty sure that's not what she said, but I'd have to re-listen to be sure.

So, the linguistic questions are these:
  1. What does she say in Swedish?
  2. What is the history of the Swedish phrase?
  3. Is my bad the best English translation (given its history in slang and in pop culture)?

Tuesday, December 21, 2010

i know your email address...so what?

Cory Doctorow over at Boing Boing makes the bold claim that there's no compelling evidence that obscuring your email address online using techniques like john DOT smith at host DOT com actually reduces the amount of spam you recieve. As long as his spam filters are catching the spam effectively, he doesn't mind sharing his email address with the world.

Are you willing to follow his lead?

half a million language deaths?

Lera Boroditsky's recent concluding statement in The Economist's debate about how language shapes thought states "At the moment we have good linguistic descriptions of only about 10% of the world's existing languages (and we know even less about the half a million or so languages that have existed in the past) (emphasis added).

In my previous post on language death here, I used the number 100,000 to estimate how many languages have previously existed and related it favorably to David Crystal's 64,000 to 140,000 reasonable guesstimate. I'm just curious to know where Boroditsky came up with the half million number? I've managed to come up with a few references to this 500,000 number, but they claim it's a "radical estimate" (e.g., see here).

My hunch is that this is yet another example of Boroditsky's profound-problem. She has a tendency to call modest results profound when they are not. She is, I suspect, a tad prone to hyperbole.

Monday, December 20, 2010

language and thought votes

On the eve of the conclusion to Mark Liberman and Lera Boroditsky's debate at The Economist, there are two vote totals that are interesting to compare.

The obvious one is the lopsided results so far on the main question: Do you agree with the motion? Here, Boroditsky has a 77%-23% advantage. However, if you mouse-over each day's vote, it tells you how many yes's have switched to no and vice versa. The totals there are the near exact opposite: by a 5-1 margin yes's have switched to no. You are free to interpret this as you wish.

Unfortunately I don't see any raw totals for the number of people voting, so it's anyone's guess what proportion of votes the 6 changes represent (likely, a very small percentage).

digg's c**ktail

[UPDATE below)

I couldn't help but notice a story on Digg: Images of alcoholic drinks under the microscope from vodka c**ktails to pina colada. I checked the original Daily Mail story and saw that the word cocktail was not censored. I looked for other instances of cocktail on Digg's site and found that all instances look censored, except when the string c-o-c-k occurs in a user name, as the image below demonstrates:


This appears to be a candidate for unnecessary censorship. I sent an email to Digg asking them if this is intentional censorship or an inside joke within the site. I'll report any response (don't hold your breath).

[UPDATE: 3:01 Eastern)

Digg support did in fact reply, noting that it was a function of a profanity filter that can be turned off:

Hello ,
You see that because you have the profanity filter enabled.  To disable it just log in and go to:
http://digg.com/settings/preferences
--Digg Support

Sunday, December 19, 2010

the linguistics of the simpsons

The magnificent and admiral Snowclone X is the Y of Z made a surprise and instructive appearance on The Simpsons tonight*:

Marge -- Don't worry Lisa, you could still go to McGill, it's the Harvard of Canada.
Lisa --Anything that is the something of the something isn't the anything of anything...

Too true, Lisa, too true. It's never good to be the shadow of something else.

*This appears to have been a repeat of the 10-10-2010 episode MoneyBART (a nice allusion to Moneyball, btw).

ngram roundup

It's not difficult to find glee and excitement surrounding Google's new Ngram Viewer. Hyperbolic praise is whirling around the innerwebz like mad. As an antidote and a nod to the role skepticism should play in our contemporary society, I present a brief round up of criticisms:

Geoffrey Nunberg:
...there are still a fair number of misdated works, and there's no way to restrict a query by genre or topic. But in the end, the most important consequence of the Science paper, and of allowing public access to the data, is that it puts "culturomics" into conversational play.

Mark Davies:
Google Books can't use wildcards to search for parts of words. For example, try searching for freak* out (all forms: freak_, freaked, freaking, etc) or even a simple search like teenager* ... if Google Books doesn't know about part of speech tags or variant forms of a word, then how can it look at change in grammar? ... To use collocates with Google Books, you would have to manually download thousands or millions of hits to your hard drive, and then use another program to look for and categorize the collocates.

Mark Liberman:
The Science paper says that "Culturomics is the application of high-throughput data collection and analysis to the study of human culture".  But as long as the historical text corpus itself remains behind a veil at Google Books, then "culturomics" will be restricted to a very small corner of that definition, unless and until the scholarly community can reproduce an open version of the underlying collection of historical texts.

David Crystal:
...this is just a collection of books - no newspapers, magazines, advertisements, or other orthographic places where culture resides. No websites, blogs, social networking sites. No spoken language, of course, so over 90 percent of the daily linguistic usage of the world isn't here...The approach, in other words, shows trends but can't interpret or explain them. It can't handle ambiguity or idiomaticity..

The Binder Blog:
The value of the Ngrams Viewer rests on a bold conceit: that the number of times a word is used at certain periods of time has some kind of relationship to the culture of the time. For example, the fact that the word “slavery” peaks around 1860 suggests that people in 1860 had a lot to say about slavery. Another spike around the 1970s meshes nicely with the Civil Rights Movement. Well, that’s sort of interesting. However, I didn’t need ngrams to tell me that a lot of people were writing about slavery in 1860. These data are broad but not deep, which makes them relatively useless to most humanities majors interested in intensive study.


The one positive comment that I think bears repeating is the role this fun little tool might play is sparking the imagination of young students interested in the role technology can play in the humanities.

Geoffrey Nunberg:
Whatever misgivings scholars may have about the larger enterprise, the data will be a lot of fun to play around with. And for some—especially students, I imagine—it will be a kind of gateway drug that leads to more-serious involvement in quantitative research.

Saturday, December 18, 2010

how NOT to interpret ngrams

Andrew Sullivan has predictably misunderstood the value of Google's Ngram Viewer. He spent all day yesterday posting trite and simplistic mis-interpretations of the data. For example,
I like the Ngram Viewer, but simply plotting the frequency of words against each other to determine something about culture or concepts is a very weak technique that leads to massive mis-interpretations, as we've seen recently with things like counting the number of times President Obama uses pronouns in his speeches. I discussed the failings of simple word counts as a technique here. To sum up,
  • We don't know what causes word frequencies.
  • We don't know what the effects of word frequencies are.
  • There are good alternatives.

Friday, December 17, 2010

magical machine translation

This is un-fucking-believable:



The future has arrived.

HT kottke

ngram or n-gram?

The hottest story of the day is clearly Google's Ngram Viewer. It's all over blogs, twitter and even the MSM. But why did Google call it the Ngram Viewer and not the N-gram Viewer?

The hyphenated form is more common in the NLP industry and in general search results (by a 10-1 margin at that). Nunberg's LL post and Languagehat's post both prefer n-gram when speaking about the tokens themselves and only use Ngram when referencing Google's named product. Even Google's own people used n-gram in a blog post here.

You gotta wonder what kind of branding process Google went through to decide on ngram (they are notoriously conscious about that kind of thing). The popularity of this story also demonstrates how much more media savvy Google is because Microsoft has almost exactly the same tool, but no one knows about it. See here. The difference is that Microsoft didn't link its use to studying culture and history and give us a nifty online tool to play with, making it more dull sounding than perhaps it otherwise would.

Also, note Microsoft uses N-gram ... frikkin Microsoft.

Thursday, December 16, 2010

google has a huge tool

NPR ran a story today called Google Book Tool Tracks Cultural Change With Words. It's about "the biggest collection of words ever assembled*", Google's 500 billion word corpus is drawn from the books they've scanned, but here's the catch: many of those books are copywrited, so what Google did is pull a trick that goes back to the very beginnings of computational linguistics, they present the words as an unordered set, or bag o' words:

Many of these books are covered by copyright, and publishers aren't letting people read them online. But the new database gets around that problem: It's just a collection of words and phrases, stripped of all context except the date in which they appeared.

I  first learned about this technique back in 1999 in an intro to computational linguistics course (bit of trivia: we we're using an incomplete pre-print of Martin and Jurafsky; as I recall, the discourse chapter was composed entirely of one page that read 21 Computational Discourse write something here...) and I remember being appalled at its crass simplicity. I mean, how dare those idiot engineers reduce language down to simple lists of words. How dare they try to use simple word lists to discover important facts about language and devise important linguistic tools.

It took less than a week for me to change my tune. The fact is, the bag o' words technique is remarkably powerful and useful. No, it doesn't solve all problems in one swoop, but it solves a hell of a lot more than I could possibly predict as a naive 2nd year linguistics grad student. For example:


Irregular verbs are used as a model of grammatical evolution. For each verb, researchers plotted the usage frequency of its irregular form in red ("thrived"), and the usage frequency of its regular past-tense form in blue ("throve/thriven"). Virtually all irregular verbs are found from time to time used in a regular form, but those used more often tend to be used in a regular way more rarely.

Google labs lets you play with its tool here (hehe).

*Not sure where this claim originated, but Google has already released a 1 trillion word corpus via LDC, the Web 1T 5-gram Version 1.
From Stephen Fry's twitter feed:

Just had an fMRI scan at UCL (part of BBC doc on language I'm making). Had to play Just A Minute while being scanned. Fun.

A BBC documentary about language? Ugh...I don't think even the talents of Stephen Fry can save that one.

Wednesday, December 15, 2010

harvard jumps the linguistic shark

Harvard Business Review editor Julia Kirby adds to the mountain of pseudo-scientific bullshit filling the innerwebz by taking the modest results of a small study (about the fact that mimicking accents helps sentence comprehension) and jumping to the wild and unfounded conclusion that salespeople should start faking accents. It would make a great Monty Python skit, but it's a sad blog post from an editor of a prestigious business magazine. Money quote:

But this study suggests another possibility. Perhaps part of why mirroring and matching works is not because of how it operates on the prospect in a sales conversation, but how it operates on the salesperson. When we switch into another person's mode, however superficially, perhaps our brains are triggered to do so on a deeper level, and we become more able to receive the information that person is trying to convey.

We all know the key to empathy is to walk a mile in another's shoes. That can never literally be done, especially in brief sales encounters. But at least we can put on their brogues.

sigh...

Tuesday, December 14, 2010

so you want to study linguistics?

Recently, a reader asked me for advice about studying linguistics. She is an undergraduate in the USA at a college that does not offer a BA in linguistics and she likes math and language, particularly historical linguistics. I've posted advice to students before here, but this new request was a particularly interesting variation. What do you do if you're a smart 20 year old at a school that does not quite offer what you want? What follows is an edited version of the email I sent back:

I must begin with a  warning: academic linguistics is a small field, there is precious little room for mediocrity. There are two kinds of academic linguists, the top 15% and the unemployed.

With that said, if your school doesn't offer linguistics as a degree, then I suggest psychology (the experimental, lab-based kind) or computer science. Get hands-on experience in lab settings where you are collecting and analyzing data. Learn basic scientific method. Both psychology and computer science can offer that. Computational linguistics is a hot field with lots of opportunities in all sub-fields of linguistics. Plus, they can get jobs, hehe. High paying jobs! Computational linguists are one the the few who can get jobs outside of academia, but the truth is most industry CL jobs are really programming jobs where your programing skills are the real reason you get a job; your Natural Language Processing (NLP) skills are little more than icing on the cake. The industry is really looking for engineers with some NLP experience, not linguists with some programming skills.

There's nothing wrong with majoring in math (I definitely think all 21st Century linguists should study math), though I think knowing stats is preferable, and that's really a separate field. There is some controversy regarding whether linear algebra or calculus is better for linguistics (see here, especially the comments), but I really do think stats is key.

Studying biology or genetics is a possibility (neurolinguistics is a hot field). Liberman posted about genetics and linguistics here.

Probably the single best thing you can do for yourself right now is work your way through the NLTK book. This will teach you about basic concepts, plus teach you basic tools as well, and it's completely free! You could also start learning the R language, a great stats based language that many linguists are using these days.

You could also work your way through Tarski's World because basic logic is a sound foundation for all disciplines.

If you want a serious challenge, get your hands on the late Partha Niyogi's ' The Computational Nature of Language Learning and Evolution'. He passed away recently, far too young for a rising star. He was a pioneer in using mathematical models to understand linguistics.

If you're interested in cognitive science and linguistics, I suggest regularly reading the Child's Play blog, written by two Stanford cognitive science grad students.

My general advice to any undergrad is simple: don't sweat your undergrad too much; it's the least important part of your education. Just get it done, regardless of which major you choose, and move on to the good stuff in grad school.

Monday, December 13, 2010

a debate!

From The Economist: This house believes that the language we speak shapes how we think. Discuss...
A new drink from the SPECULATIVE GRAMMARIAN:

The Psycholinguist

wine (any kind: color is not a dependent variable in this study)
several glasses
1 stopwatch

Pour the wine into a glass while whining about how no one has properly modeled the process of wine pouring. Observe the wine under controlled conditions for an hour. Present a wordy but content-less paper to an international conference on what wine might look like in infants. Rerun the analysis in a different glass in case the receptor affects the nature of the process. Wait another hour. Drink the wine. Drink more wine. Fall onto the floor drunk, bumping your head on a pipe on the way down. Write an even less coherent paper on the effects of head bumping on linguistic processing. Gain professorship.

pimp grammar

There's a pimp's handwritten business plan floating around the interwebz. While the soundness of its basic logic cannot be denied ("Treat This Pimpin Like it's a Business" indeed), the former writing teacher in me could not help but pull out the old red pen and make a few suggestions. But here's the thing, it's a fact of contemporary college education that most writing teachers are loath to outright criticize or correct their students (they're paying tuition after all). You see, outside of the Ivy League, most college writing teachers are faced with whole classrooms filled with pimps like Keep It Pimpin', and they're our bread and butter (we can't all be blessed with students like the Winkelvi, can we?). As a result, we are careful to word our feedback delicately, so as not to offend the senses of the ones who pad our, admittedly thin, paychecks*.


*Absolute truth: I taught college level research writing courses for the whopping total price of $1250/semester. The MOST I ever got paid for teaching a college level course was $2800. In the (modified) words of my literary hero DJay:

You know it's hard out here for a [rhetoric & writing instructor].
When he tryin to get this money for the rent.
For the Cadillacs and gas money spent
Because a whole lotta [students] talkin [nonsense].

HT kottke.

Thursday, December 9, 2010

a brief history of stanford linguistics dissertations


The above image comes from the Stanford Dissertation Browser and is centered on Linguistics. This tool performs some kind of textual analysis of Stanford dissertations: every dissertation is taken as a weighted mixture of a unigram language model associated with every Stanford department. This lets us infer, that, say, dissertation X is 60% computer science, 20% physics, and so on...Essentially, the visualization shows word overlap between departments measured by letting the dissertations in one department borrow words from another department..

Thus, the image above suggests that Linguistics borrows more words from Computer Science, Education, and Psychology than it does from other disciplines. What was most interesting was using the Back button to creating a moving picture of dissertation language over the last 15 years. you'll see a lot of bouncing back and forth. Stats makes a couple jumps here and there.

HT Razib Khan

Wednesday, December 8, 2010

the baffling linguistics of job postings

While Googling around for other things, I caught this odd fish contained within a job posting for an Account Manager:

DISCLAIMER: ... Linguistics used herein may use First Person Singular and First Person Plural grammatical person construction for and with the meaning of Third Person Singular and Third Person Plural references. We reserves the right to amend and change responsibilities to meet business and organizational needs as necessary (emphasis added).

If I understand this correctly, the bold faced passage says that the authors are allowing themselves to use constructions like "we walks..." and "we talks..."

But, if you look at the uses of "we" within the text of the actual job posting, nowhere do they actually do this, EXCEPT in the disclaimer itself. I find this baffling. What is the purpose of this? Simply to allow them to write "We reserves..." I Googled the sentence and found it popping up in all kinds of job postings and the same thing is true. The only time a posting invokes its self-appointed right to this grammatical modification, is within the disclaimer. It appears to be boiler-plate job-speak of some kind. I'm remarkably freaked out by this.

Thursday, December 2, 2010

how to spot an academic con artist

If you've been to college, you were taught how to scrutinize research sources at some point. Let's test your skills, shall we? Imagine you run across a popularized article and the author promotes his own expertise using the following:
  • "Ph.D" after his name.
  • Referencing his multiple books.
  • Noting his academic appointments.
  • You look at his personal web page list of publications and you see dozens of articles and books going back several decades.
Must be an expert, right? Must be legit, right? This is what I saw for John Medina, Ph.D., author of the HuffPo article 'Parentese': Can Speaking To Your Baby This Way Make Her Smarter? But I quickly became suspicious about this man's credentials. Why? Let's look more closely at those bullet points:
  • "Ph.D" after his name.
  • I recall a professor once saying something like "Once you've been to grad school, everyone you know has a Ph.D. It's just not that special." This may sound elitist, but the truth is, most people with Ph.Ds don't use the alphabet to promote themselves. They use their body of work. I'm almost always suspicious of people who promote themselves using their degrees. Plus, nowhere on his own site does he list a CV or even where he got his Ph.D. I had to find this at the UW web page, listing "PhD, Molecular Biology, Washington State University, 1988" and impressive degree, no doubt, but why hide this? It has become common practice for serious academics to provide their full CV on their web page. Medina fails to follow this practice.
  • Referencing his multiple books.
  • All of his books are aimed at non-academics. There's nothing wrong with trying to explain your expertise to a lay audience, but at some point you should also be trying to explain your expertise top other experts.
  • Noting his academic appointments.
  • Here, Medina does seem to have some impressive qualifications. He is an "Affiliate Professor, Bioengineering" at The University of Washington. As well as director of the Brain Center for Applied Learning Research at Seattle Pacific University (which, as far as I can tell, is a house and has exactly two members, the director and his assistant).
  • You look at his personal web page list of publications and you see dozens of articles and books going back several decades.
  • This is the most suspicious by far. Yes he lists dozens of publications, but almost all of them are short, 2-4 page articles IN THE SAME MAGAZINE, Psychiatric Times, a dubious looking magazine at best. The only others are in the equally dubious looking Geriatric Times. His publications page does list a REFEREED PAPERS section with some more legitimate academic articles, but he's second or third author on almost all.
Add to this the fact that his recommendations seem to be little more than common sense (i.e., talk to your kids more...no duh!). I have no problem with someone making money off their education, but this seems to be an example of trying to con people into believing he has more to say than he really does simply because of the letters P-h-D after his name.

Despite writing this, I don't feel terribly comfortable casting aspersions on someone who may indeed be a serious, legitimate academic. If I have made mistakes in this critique, I will apologize. But then again it is incumbent upon Medina to do a better job of representing his credentials. And it is incumbent upon us as us a lay readers (hey, I ain't no molecular biologists either) to scrutinize supposed experts who are asking us to pay for their expertise (in the form of book prices and speaking fees).

Wednesday, December 1, 2010

94,000 language deaths!

History's only Emmy-nominated linguist* K. David Harrison answers questions over at The Johnson blog about language death, his favorite topic. He repeats what he's been saying for the last few years about language death, and he generally makes good points; however, he says two things worth responding to:
  1. "The human knowledge base is eroding as we lose languages"
  2. "...bilingualism strengthens the brain"
The first one is a vague and complicated claim often promoted by language-deathers** and the second is a goofy metaphor (at best). Let's walk through the reasons why these statements should not be a part of a serious discussion of language death:

The human knowledge base is eroding as we lose languages
My primary critique of this claim is that it's just not clear what it really means. In what way does a language uniquely encode information? Harrison provides a few simple examples, mostly lexical items that show us how a particular language fore-fronted particular features to encode, and the argument is that that tells us something about that culture's perceptions of what was important to them. This is probably true to some extant, but honestly, we still do not understand language well enough to truly understand what lexical features tell us about a culture. This is hyperbole at best. But this is NOT an argument against language death per se, it's just a fact.

So what if we lose some facts about a culture's perceptions of the world? Let's assume there are 6000 language alive today. How many have already died? We don't know. For a rough estimate, let's draw an analogy and ask the question, how many humans have ever lived? A few years ago, the Population Reference Bureau did a "semi-scientific" guesstimate of this question and determined that less than 6% of all people who had ever lived, were still alive in 2002. If we assume that languages come and go at a pace that correlates with populations, then we can assume that the current 6000 living languages are about 6% of the total number of languages that ever existed. That means the total number of languages that have ever existed is around 100,000***. This means we've already lost 94,000 languages that were never documented. 94,000 language deaths. 94,000 lost knowledge bases. Oh, the horror, the horror!

Exactly how bad off should we currently be if Harrison is correct about the ill effects of language death now that we know we've lost 94,000 languages? Are we really that bad off? Clearly the answer is no, we're not that bad off. If losing 94,000 languages has not caused grave danger to humanity, why would losing another 3,000?

Yes, I agree that all languages have unique linguistic properties that are worth studying in themselves. But just because we find interesting data in every language does NOT mean we should stop language death per se. We need a broader understanding of the system of language interaction and language evolution, otherwise stopping language death may be as irresponsible as causing language death. Genetics blogger Razib Khan has made a compelling argument that "high linguistic diversity is not conducive to economic growth, social cooperation, and amity." This is just one speculative claim, but at least it's a voice on the other side of this issue.


bilingualism strengthens the brain
This is just goofy phrasing. He's referencing important neurolinguistic research, so why trivialize it by using such patently absurd language?



*I actually don't know this to be true, definitively.
**Ooooh, I'm being a little caustic there, hehe.
***This estimate is remarkably similar to the ones David Crystal discusses in his book Language Death. In that book, he says anywhere from 64,000 to 140,000 is a reasonable guesstimate. My 100,000 splits that damn near down the middle.

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a he...