Thursday, December 31, 2009

Ambiguous Hookers & Psycho Sheep Wrestlers

'Tis the season for lists, and this one caught my eye: 50 Funniest Headlines Of 2009 (HT Daily Dish ). I expected more of them to be linguistically interesting, but few were. Instead, there are a lot of tasered grammas and schoolboy sex jokes. Nonetheless, there are a few whose humor lies in the linguistic structure of the headline. Personal fav: #9 Nutt faces sack. Here are the others by linguistic category

Lexical Ambiguity
#4. Hooker Named Lay Person Of The Year
#7. Pittsburgh Police Want To See Junk In Your Trunk
#23. Facebook Forms Board To Lick Molesters
#38. Courtney Love Banned From Using Hole
#44. Hooker Named Indoor Athlete Of The Year.

Garden Path
#6. Trooper Fired After Hat Fib Wants Back In

Pseudo-Garden Path
#31. Sheep Wrestlers Feared Psycho

#30. Church Kids Raid Panty's For Foodbank Supplies (note: bonus misuse of apostrophe).

#21. Winter Storm Closes Schools Across P.E.I., N.S

Wednesday, December 30, 2009

That's why they call it money.

How much are NLP start-ups worth? About $100 million. That's about how much Nuance just paid for SpinVox, and that's about how much Microsoft paid for Powerset a year and a half ago. From TechCrunch:

SpinVox, a London-based technology startup that transcribes voicemails to text so that they can be more easily digitized, searched, and manipulated, has been acquired by speech recognition company Nuance for $102.5 million.

Loyal perusers of The Linguist List's job board should be familiar will all of those companies. But don't let that price tag fool you, SpinVox also had $200 million in investment, so somebody's still waiting to get paid. (Disclaimer: yes, I understand that valuation is complicated and this coincidence in price tags means nothing, just funnin').

Tuesday, December 29, 2009

Proud Brother

My sister Lori, a long time pre-school teacher throughout Northern California who now owns her own preschool in Orland CA (and who has big plans to be a huge success as a children's book author someday) has started her own blog. And it's about time. She has the soul of a blogger.

Behold! The Tweet King!

Has Twitter made us all better computational linguists? Their 140 character limit forces us all to think in terms of characters (including whitespaces) rather than the slippery notion of words. Betcha tweetheads understand the concept of offset better than the average 1st year linguist.

Manning on NLP

Freely available: The complete set of 18 lectures from Stanford Professor Christopher Manning's Natural Language Processing course. With a nifty web player that allows you to take notes on the video.

CS224N - Natural Language Processing.

Excellent description of topics so you can pick and choose your lecture. 

Theory of Meaning

Posdcasts of most of the lectures from Professor of Philosophy John Campbell's Theory of Meaning course at Cal.

Philosophy 135 - Theory of Meaning

Unfortunately the site doesn't list what topics each podcast covers, so it's a bit of a gamble. Just open one and have a listen (some video available as well).

More Experiments!

I love online demos and live experiments because they give non-experts a user friendly, non-intimidating way to see some of the bread and butter tools of contemporary linguistics. Thanks to the excellent blog from the Human Language Processing (HLP) lab at the University of Rochester, I've discovered a few more to pass on (I have a list to your left under Call For Participation).
  • Alex Drummond 's self paced reading demo (this is a common experimental paradigm within psycholinguistics).
English: Pre-Test Questionnaire
English Experiment
English Experiment: Acceptability Judgments Only

Čestina: Experiment

Russian: Pre-Test Questionnaire
Russian Experiment

Monday, December 28, 2009

was was syntax

After re-reading my post below I had a moment of syntactic beguilement at my own use of the was that what X was that X construction:  Our group consensus was that what he really meant was that a linguistic topic was "interesting" if it helped him make his argument.

I imagined four relevant sentences; two grammatical/acceptable and two ungrammatical/unacceptable according to my own most excellent judgment*.  My challenge to you, dear reader, is to explain why sentences (1-2) are grammatical/unacceptable and why sentences (3-4) are ungrammatical/unacceptable.
  1. our consensus was that what he really meant was X
  2. our consensus was he really meant (that) X
  3. *our consensus was what he really meant (that) X
  4. *our consensus was he really meant was X
 *having been schooled by a prominent typologist, I avoid using the term "ungrammatical" in any strict sense, grammaticality judgments being such slippery things.


Neuroskeptic has pledged to avoid the word interesting in his blog posts because it begets intellectual laziness:

Sadly it's easier to just call something interesting than to explain why it is. Partly this is because "interesting" (or "fascinating", "thought-provoking", "intriguing", "notable" etc.) is just one word, and it's easier to write one word than a sentence. More important is the fact that you probably don't know why you're interested by something until you do some thinking about it.

Reading this, I couldn't help but be reminded of a conversation between two of my academic advisers, quite early in my graduate linguistics studies, about Chomsky's use of the word "interesting" in that he tended to use it as an insult.  We had formed a reading group one summer to discuss The Minimalist Program and discovered that Chomsky would boldly proclaim that one topic was "interesting" while another was not, seemingly by fiat, with little or no explanation. Our group consensus was that what he really meant was that a linguistic topic was "interesting" if it helped him make his argument; it was "uninteresting" if it did not (we came to the same conclusion about his notion of "narrow syntax", btw; this wiki page lists a variety of other criticisms).

Saturday, December 26, 2009

Nine as Narcissism Porn

a rare non-linguistics post...

I just saw the much ballyhooed film Nine.Comparisons with Moulin Rouge and Chicago are unavoidable (especially since the director did Chicago also). It compares favorably with Chicago in style and substance. That was not a compliment. It is a movie dedicated to style over substance. As Marion Cotillard's character exclaims "style is the new content." This is unfortunately true (also a nice example of X is the new Y). And as the spurned wife of the lead, she should know. However, Moulin Rouge is far superior.

But first, let me be kind and detail the movie's considerable strengths:
  • It is GORGEOUS. Not only are all the actors beautiful, but the film techniques will make the average NYU film student scribble furiously in a notebook...or iPhone app, whatever. This movie is an editor's dream come true. The juxtaposition of scenes, the rapid camera lens switches, the luxurious theatrical song & dance numbers, and oh my, the colors!...well, they'll make your head swirl with awe at the magic that only film can convey. This is a technically brilliant film. Screw Avatar. These editors and cinematographers deserve Oscars. Don't wait, give it to them now.
  • I rather liked the cute dance scene homage to Goldie Hawn's 60s image by daughter Kate Hudson. Nice bit of meta-Hollywood there.
  • Two words: Sophia Loren.
Now, on to the critique:
  • This movie displays a strange nostalgia for 1960s Italian misogyny. Really? Why? The misogyny is masked by brilliant film techniques/tricks, but it's there. Throughout this entire film, women are little more than a prop to a jackass' journey. 
  • Nine = a man is defined by the women in his life...starting with his mother...c'mon, Freud is sooooo dead.
  • To quote Gertrude Stein, there is no there there. This movie is not deep. it only pretends to be. There is little story worth watching. A self-indulgent, arrogant narcissist who makes garishly bad movies gets to revel in his own image while the world fawns around him. The movie fails in its attempt to expose this narcissistic orgy (weak plot twists at the end with the wife and producer; Cotillard says in the end "I can see now it is hopeless") and fails to redeem the character himself (weak self-realization at the end). This is the worst kind of artistic self-indulgence where the the art is supposed to be redemptive. No, it's not. In the same way a crack addict cannot be redeemed by smoking more crack, an arrogant narcissistic artist cannot be redeemed by making yet another movie. There is no real redemption in this movie. There is just voyeurism. It deserves its own 70 min Phantom Menace take down (as I suspect Avatar does too...haven't seen it, ain't gonna).
  • It celebrates lechery while superficially condemning it. Make no mistake, this film exalts this man's lechery. Make this film with ugly people, it's genius; with beautiful people, it's porn.  The beautiful actors and film tricks mask its utter depravity.
  • It has to be said:  Nicole Kidman is a botoxed cartoon. I cannot take her seriously as an actress.
PS: Okay, gotta throw in a little linguistics. Should my title be "narcissism porn" or "narcissistic porn"? The distinction requires me to decide what exactly I think the porn is about.

Thursday, December 24, 2009

Google Basterds

Since Liberman at LL just re-confirmed "the observation that Google counts no longer have even order-of-magnitude comparative validity in matters of usage (if they ever did)," I thought I'd pass along my own latest discovery: Google double quotes are not as restrictive in queries as they're claimed to be. From Google's support page:

Phrase search ("")
By putting double quotes around a set of words, you are telling Google to consider the exact words in that exact order without any change. Google already uses the order and the fact that the words are together as a very strong signal and will stray from it only for a good reason, so quotes are usually unnecessary. By insisting on phrase search you might be missing good results accidentally. For example, a search for [ "Alexander Bell" ] (with quotes) will miss the pages that refer to Alexander G. Bell.

But this is not what it seems...

Brain Farts

The title alone made this worth reading: Anatomy of A Brain Fart.

Money quote: The latest research seems to indicate that brain farts are a unique type of cognitive mistake. Unlike errors caused by lack of information or experience, or by distractions, brain farts are innate. They have a predictable neural pattern that emerges up to 30 seconds before they happen. When you are absorbed in inward-focused thinking such as daydreaming, a collection of brain regions jointly called the default mode network (DMN) starts furiously popping away. Neuroscientists don’t agree on exactly which parts of the brain compose this network, but they now believe it is one of the busiest neurological systems.

Eskimo Gibberish?

(image from vintage_ads)

Recently, kottke posted this vintage ad featuring two "Eskimos" with one saying , and I quote, "Kripik igloo sop frofu torky."  Commenter bluebear2 at vintage_ads notes that none of those words appear in the online Inuktitut Dictionary. I would be surprised and impressed if this was anything but gibberish, but I know next to nothing about the Eskimo-Aleut family of languages. I Googled the sentence (to use that word lightly) and found nothing, of course. Just thought I'd throw it out there. Any experts out there care to confirm the obvious?

Wednesday, December 23, 2009

Vision Affects Language Processing

Does watching a leaf fall help you process the sentence the leaf is falling down? Apparently, no, it hurts. It slows you down. Cognitive Daily reviews research supporting this conclusion. Money quote:

...people take longer to process sentences that match the movement of an animation than they do to process sentences that don't match it. Kaschak's team reasons that we must be using the same region of the brain to process the motion itself as we do to process the language describing that motion.

Tuesday, December 22, 2009

How Many Linguists Are There?

The Independent recently published an article about the language documentation efforts of Mark Turin and his colleagues at The World Oral Literature Project. In the article, Turin was lamenting the large number of undocumented languages (a fair lament) and was quoted as saying this:

There are more linguists in universities around the world than there are spoken languages – but most of them aren't working on this issue. To me it's amazing that in this day and age, we still have an entirely incomplete image of the world's linguistic diversity. People do PhDs on the apostrophe in French, yet we still don't know how many languages are spoken.

I found this passage remarkably agitating. I appreciate Turin's passion for language documentation and I support language documentation efforts, but there are two claims in this passage (one explicit, one implicit) that I object to:

Regex Dictionary

Nice one! A web-based dictionary you can search with regular expressions (HT MetaFilter). from the site's introduction page:

The Regex Dictionary is a searchable online dictionary, based on The American Heritage Dictionary of the English Language, 4th edition, that returns matches based on strings —defined here as a series of characters and metacharacters— rather than on whole words, while optionally grouping results by their part of speech. For example, a search for "cat" will return any words that include the string "cat", optionally grouped according to gramatical category:

    * Adjectives: catastrophic, delicate, eye-catching, etc.
    * Adverbs: marcato, staccato, etc.
    * Nouns: scat, category, vacation, etc.
    * Verbs: cater, complicate, etc.

In other words, the Regex Dictionary searches for words based on how they are spelled; it can find:

    * adjectives ending in ly (197; ex.: homely)
    * words ending in the suffix ship (89)
          o Adjectives (1, midship)
          o Nouns (80; ex.: membership)
          o Suffixes (1, -ship)
          o Verbs (6; ex.: worship)
    * words, not counting proper nouns, that have six consecutive consonants, including y (79; ex.: strychnine)
    * words, not counting proper nouns, that have six consecutive consonants, not counting y (2; ex.: latchstring)
    * words of 12 or more letters that consist entirely of alternate consonants and vowels (45; ex.: legitimatize)

a deeply frustrating pursuit

Neuroblogger Jonah Lehrer has a new article about the value of failure in science and how it can lead to discovery. A nice, if somewhat light, read:  Accept Defeat: The Neuroscience of Screwing Up. The basic point is that our brains have two somewhat competing processes, one for perceiving errors (the “Oh shit!” circuit) and one for deleting irrelevant stuff (the Delete key). If the delete key wins, important discoveries are ignored (something like that).

Money quote:

While the scientific process is typically seen as a lonely pursuit — researchers solve problems by themselves — Dunbar found that most new scientific ideas emerged from lab meetings, those weekly sessions in which people publicly present their data. Interestingly, the most important element of the lab meeting wasn’t the presentation — it was the debate that followed. Dunbar observed that the skeptical (and sometimes heated) questions asked during a group session frequently triggered breakthroughs, as the scientists were forced to reconsider data they’d previously ignored. The new theory was a product of spontaneous conversation, not solitude; a single bracing query was enough to turn scientists into temporary outsiders, able to look anew at their own work.

Sunday, December 20, 2009


Do some words grab your attention more than others because of their semantic content? If I want to get the attention of 12 screaming kids, would I be better off yelling "SEX!" or "EGGPLANT!"

This was the topic (kinda) of a study recently reviewed by the excellent Cognitive Daily blog: Huang, Y., Baddeley, A., & Young, A. (2008). Attentional capture by emotional stimuli is modulated by semantic processing. Journal of Experimental Psychology: Human Perception and Performance, 34 (2), 328-339 DOI: 10.1037/0096-1523.34.2.328.

The study used an interesting methodology: rapid serial visual presentation, or RSVP which involves showing participants a random stream of stimuli, flashing by one every tenth of a second. Wiz bang! That's a lot of flashing. Let Cognitive Daily explain:

Typically if you're asked to spot two items in an RSVP presentation, you'll miss the second one if it occurs between about 2/10 and 4/10 of a second after the first one, but not sooner or later. This phenomenon is called Attentional Blink -- a blind spot caused by the temporary distraction of seeing the first item... Their streams were simply random strings of letters and digits, with two words embedded in each stream. Then they asked students to look for words naming fruit as they flashed by. If a fruit word appeared, it was always the second word in a stream. The key was in the first word: half the time, this first word was a neutral word like bus, vest, bowl, tool, elbow, or tower, and half the time it was an emotional word like rape, grief, torture, failure or morgue. So a sequence might look like this:

  • JW34KA
  • QPLX12
  • MC15KW
  • 083FLB
  • S21L0C
  • DJW09S
  • 3LW8Z9
  • XOWL01
And so on. The first word acts as a distractor: the students are looking for fruit words, but this is always a non-fruit word. The question is, are emotional words more distracting?

The results result of the experiments was ...

Thursday, December 17, 2009

Analogy as the Core of Cognition

Here is a YouTube of a February 6, 2009 Stanford University Presidential Lecture by Douglas Hofstadter, one of the most interesting cognitive science/artificial intelligence thinkers of our lifetime:

In this Presidential Lecture, cognitive scientist Douglas Hofstadter examines the role and contributions of analogy in cognition, using a variety of analogies to illustrate his points.

Wednesday, December 16, 2009

The Snowclone Cometh

(image from On The Scene)

Just bought the box set of seasons 1-4 of the sublime comedy It's Always Sunny In Philadelphia. Catching up on episodes missed, I just watched the 2008's season 4 finale "The Nightman Cometh" (episode #13, 45). When I saw the title as the episode began, I was struck by this thought: it might be the case that Eugene O'Neill's 1939 play "The Iceman Cometh" is the single most mimicked play title in history. Can you think of a play title that has more homages than this one?

Then I wondered, is this a snowclone? A snowclone is a linguistic construction like a cliché, with a somewhat rigid syntactic pattern, but allows substitutions, with a somewhat recognizable meaning. A classic example is "X is the new Y" like "gray is the new black" or "knitting is the new yoga."

The Snowclone database lists two primary criteria for inclusion (these should be taken to be neither necessary nor sufficient; rather, they are a guide):
  • high number of Google hits
  • significant variation
So, I Googled the query "the *man cometh" and found about 3,390,000 hits. No small number that (oooh, that construction might also be a snowclone...).

The first page of Google hits alone shows 9 variations out of 12 hits. That's a lot of variation.
  • The Meatman Cometh
  • The Tax Man Cometh
  • The Monkey Man Cometh
  • The Dark Man Cometh
  • The Repo Man Cometh
  • The Yogurt Man Cometh
  • The H-Man Cometh
  • The ad man cometh
  • The Con Man Cometh
Like many snowclones, I suspect that the users of this construction rarely know of its origin. I skimmed the first 10 pages of Google hits and found that almost NONE referenced the original play. Might this be history's most successful snowclone?

As a side note, the writers of It's Always Sunny In Philadelphia chose their homage wisely. Anyone who watches even a few episodes will note the clear synchronicity with this Wikipedia description of O'Neil's play: "It expresses the playwright's disillusionment with the American ideals of success and aspiration, and suggests that much of human behavior is driven by bitterness, envy and revenge."

Just FYI, if you haven't purchased your Hanukkah/Christmas/Kwanzaa/Festivus gift Kitten Mittons yet, I believe operators are standing by. Finally, there's an elegant, comfortable mitton, for kats! Meeeeoowww!

How good is your language sense?

The use of the web for language experiments is growing (see my Call for Participation links to the right) and the use of games to facilitate experiments makes the whole process fun for the subjects/users/participants (whatever the word du jour is for the people actually taking the experiment is).

The site Games With Words, run by Joshua Hartshorne, a graduate student in Psychology at Harvard University, is a great example of this. They have two games running right now:

...and coming soon

Here's a new site that aims to crowdsource translation (HT Boing Boing): Translated by humans.

What's going on here?

It's called collaborative translation. To make it simple, people help each other translate interesting foreign language texts into their native language. It's mostly blog posts, magazine articles, short stories and another materials licensed for free redistribution.

Monday, December 14, 2009

Without The Hats

Ingrid at Language on the Move blog reports that the Student Council at Zayed University in the UAE is conducting a poll to see which languages students want to see offered, and Korean is winning (HT Research Blogging).

Rarely do language students get an actual say in institutional offerings and a current polling initiative by the Student Council at Zayed University is therefore the more exciting. This internal poll has been running for a couple of days and I can’t take my eyes of it: for a sociolinguist this is like Melbourne Cup Day without the hats!

Needless to say, I had to google Melbourne Cup Day.

Sunday, December 13, 2009

On Linguistic Fingerprinting

Can an author's writing style be defined by the frequency of unique words in their writings? According to physicist Sebastian Bernhardsson, the answer is yes. He found a couple of interesting facts: 1) the more we write, the more we repeat words and 2) the rate of repetition (or rate of change) seems to be unique to individual authors (creating a "linguistic fingerprint"... literally his words, not mine). Let me walk through his claims and findings, just a bit.

Bernhardsson et al. are in press with a corpus linguistics study which compared rates of unique words between short and long form writing (short stories vs. novels vs. corpora). I stumbled on to this research earlier this week when a BBC News title caught my eye: Rare words 'author's fingerprint': Analyses of classic authors' works provide a way to "linguistically fingerprint" them, researchers say.

The idea of linguistically fingerprinting authors has been around for a while. In some ways it acted as a lost leader decades ago, piquing interest in the use of corpora and statistical methods to study language and now there is even a whole journal called Literary and Linguistic Computing. Plus, there is an established practice of forensic linguistics where linguistic methods are used to establish authorship of critical legal documents.

However, Bernhardsson makes a bold claim. He claims that the process of writing (a cognitively complex process) can be described as the process of pulling chunks out of a large meta-book which shows the same statistical regularities of an authors real work (he hedges on this a bit, of course). I always shiver when I run across a non-linguist jumping head first into linguistics making bold claims like this, but I also recognize that Bernhardsson and and his co-authors are pretty smart folks so I gave them the benefit of the doubt and skimmed one of their two available papers (freely available here).
  • The meta book and size-dependent properties of written language. Authors: Sebastian Bernhardsson, Luis Enrique Correa da Rocha, Petter Minnhagen. New Journal of Physics (2009), accepted.
First, I concentrated on the first section because the paper goes into a different direction that was not necessary for me to cover (and had lots of scary algorithms; it is Sunday and I do want to watch football, hehe). What they did was count the number of words in a text, then count the number of unique words (this is a classic type/token distinction). Here's what they found:

When the length of a text is increased, the number of different words is also increased. However, the average usage of a specific word is not constant, but increases as well. That is, we tend to repeat the words more when writing a longer text. One might argue that this is because we have a limited vocabulary and when writing more words the probability to repeat an old word increases. But, at the same time, a contradictory argument could be that the scenery and plot, described for example in a novel, are often broader in a longer text, leading to a wider use of ones vocabulary. There is probably some truth in both statements but the empirical data seem to suggest that the dependence of N (types) on M (tokens) reflects a more general property of an authors language. (my emphasis and additions).

First, let's make sure we get what the author's did. We have to use words more than once, right? I've already repeated the word "we" in just the last two sentences. And we repeat words like "the" and "of" all the time. We have to. So there are types of words, like "the" but there are also the number of times those words get repeated (tokens). It's pretty straight forward to simply count the total number of words in a story, then count the total number of types of words. Thus giving us a ratio. For example, let's say we have a short story by Author X with 1000 words it (= tokens). Then we count how many times each word is repeated and we find that there are only 250 unique words (= types), this means there is a ratio of 1000/250, or 100/25 (for comparison's sake I'm using this ratio). This means that only 25% of the words are unique, which also means that, on average, a word is repeated 4 times in this story.

Now let's take a novel by Author X with 100,000 words (= tokens). After counting repetitions we find it has 11000 unique words. Our token/type ration = 100,000/11000, or 100/11. This means that only 11% of the words are unique, which means, on average, a word gets repeated about 9 times. That's higher than in the short story. Words are being repeated more in the novel. Now let's imagine we take all of Author X's written work, put it together into a single corpus and repeat the process and discover that the ratio is 100/7 (on average, a word gets repeated about 14 times).

UPDATE: whoa, my maths was off a bit the first time I did this. That'll teach me to write a blog post while watching Indie crush Denver. Sorry, eh,

This is what the author's found: "The curve shows a decreasing rate of adding new words which means that N grows slower than linear (α less than 1)."

They discovered something potentially even more interesting. there is a rate of change between these ratios is unique to each author: Here's is their graph from the article (H = Thomas Hardy, M = Herman Melville, and L = D.H. Lawrence):
FIG. 1: The number of different words, N, as a function of the total number of words, M, for the authors Hardy, Melville and Lawrence. The data represents a collection of books by each author. The inset shows the exponent = lnN/ lnM as a function of M for each author.

Their conclusions about the meta-book and linguistic fingerprint:

These findings lead us towards the meta book concept : The writing of a text can be described by a process where the author pulls a piece of text out of a large mother book (the meta book) and puts it down on paper. This meta book is an imaginary infinite book which gives a representation of the word frequency characteristics of everything that a certain author could ever think of writing. This has nothing to do with semantics and the actual meaning of what is written, but rather to the extent of the vocabulary, the level and type of education and the personal preferences of an author. The fact that people have such different backgrounds, together with the seemingly different behavior of the function N(M) for the different authors, opens up for the speculation that every person has its own and unique meta book, in which case it can be seen as a fingerprint of an author. (my emphasis)

They are quick to point out that this finding says nothing about the semantic content of the writings. So what does it say? I admit I was having a hard time seeing any conclusion about cognition or the writing process, even while finding this methodology interesting, I'm just not at all sure what it really says about the human brain and language, if anything at all. The speculation that "every person has their own unique meta book" is bold. Unfortunately, it is also almost entirely untestable. Keep in mind that this research had zero psycholinguistic component. They were just counting words on pages. I'd caution against drawing any conclusion about the human language system based solely on this work. (I should note that I skipped one of the most interesting findings, that the section of work doesn't matter, simply the size. meaning, they took random chunks from their corpora and found the same patterns, if I understood that part correctly.) Which begs the question: why is this being published in a physics journal? It's being published in The New Journal of Physics and a quick perusal of the articles from previous editions doesn't show anything remotely similar to this work (no surprise).

I'm a fan of corpus linguistics, but I'm also a fan of caution. I'm not convinced any conclusions about the psycholinguistics of the complex writing process can be drawn from this work. Not as yet. But interesting, nonetheless.

FYI: it's easy enough to fact check some of these results using freely available tools, namely KWIC Concordance
. This tool will take any text and count the total tokens and number of repeats for us. I did this for Melville's Bartleby, the Scrivener and Moby Dick. I got text versions of each from Project Gutenberg, then ran the wordlist function within KWIC and here are my results:

Total Tokens: 18111
Total Types: 3462
Type-Token Ratio: 0.191155

Moby Dick
Total Tokens: 221912
Total Types: 17354
Type-Token Ratio: 0.078202

Bartleby = 0.191155
Moby Dick =

Yep, the short story
Bartleby has more unique words than the longer Moby Dick. FYI, this is a weak test simply because the tokens are not stemmed, meaning morphological variants are treated as different words. I don't know if this is consistent with Bernhardsson's methodology or not.

Saturday, December 12, 2009

Thursday, December 10, 2009

Monkey ThreatDown!

Following up on Mr. Verb's coverage of Zuberbühler's mokeys go boom boom in the banana patch story (see here), Stephen Colbert issued a threat down against the primates:

The Colbert ReportMon - Thurs 11:30pm / 10:30c
Monkey ThreatDown - Holes & Banana Too High
Colbert Report Full EpisodesPolitical HumorU.S. Speedskating

However, my own two-year-old war on Colbert continues. I shall not rest!

Wednesday, December 9, 2009

Boom Boom Syntax

Mr. Verb has a post up about yet another NYT's article on animal language that does a poor job of reporting the facts:

I've been wondering about what syntax really is and how we would show it exists since reading this in the NYT this morning. It reports work by Klaus Zuberbühler and others arguing that Campbell's monkeys (cute critters, see pic) in Ivory Coast not only have some sound-meaning correspondences (boom boom mean 'come here once', krak means 'leopard', etc.), but that they have what they're calling inflectional morphology, a suffix -oo, which sounds like an auditory evidential — indicating you've heard but not seen something.

As Mr. Verb points out, the original scholarly article is not yet available so we are unable to fact check this one...yet.

Tuesday, December 8, 2009

Scooping Language Log

Looks like I've managed to scoop Language Log authors twice in the last couple weeks.
  • Ben Zimmer's Dec 4 NYT's article on the alien language in Avatar here. My Nov 26 post on the same topic here.
  • Chris Potts' Dec 5 LL post on David Foster Wallace's grammar test here. My Dec 4 post on the same topic here.
...pats self on back.

ooops, forgot to carry the one

(image from MIT)

MIT has launched a well funded re-think of Artificial Intelligence principles called The Mind Machine Project, what they're calling a "do-over."

"MMP group members span five generations of artificial-intelligence research, Gershenfeld says. Representing the first generation is Marvin Minsky, professor of media arts and sciences and computer science and engineering emeritus, who has been a leader in the field since its inception. Ford Professor of Engineering Patrick Winston of the Computer Science and Artificial Intelligence Laboratory is one of the second-generation researchers, and Gershenfeld himself represents the third generation. Ed Boyden, a Media Lab assistant professor and leader of the Synthetic Neurobiology Group, was a student of Gershenfeld and thus represents the fourth generation. And the fifth generation includes David Dalrymple, one of the youngest students ever at MIT, where he started graduate school at the age of 14, and Peter Schmidt-Nielsen, a home-schooled prodigy who, though he never took a computer science class, at 15 is taking a leading role in developing design tools for the new software."

Monday, December 7, 2009

The Noughties

The BBC is sponsoring a contest (with no prize, they had to stop doing that, hehe) to come up with the single word that best sums up the 2000s. Some of their suggestions:


My suggestion: meh

Sunday, December 6, 2009

Which One Does Shakira Speak?

I thought I was going to have another Full Liberman on my hands (haven't finished the last one yet) but thankfully the article provocatively titled Do You Know Your 'Love Language'? doesn't really have much at all to do with language, and nothing to do with linguistics. In this case, the word language is used as a metaphor for behaviors associated with personal relationships. It's common to use language in this way, but I'd be happier with, say, the semiotics of love, or something like that.

Saturday, December 5, 2009

Outsourcing Fact Checking

Paul Spinrad guest blogs at boingboing and floats the idea of outsourcing fact checking (I'll support any proposal whatsoever that improves the fact checking process, believe me) but he adds the notion of, in essence, crowd sourcing linguistic annotation:

Now, what if these fact-checkers didn't just vet and correct the text? While they dig into the logic and accuracy of everything, as usual, they could also use some simple application to diagram the sentences and disambiguate the semantics into a machine-friendly representation. Just a little extra clicking, and they could bind all the pronouns to their antecedents, and select from a dropdown box to specify whether an instance of the string "Prince" refers to the musician Prince or to Erik Prince-- the president of XE, the company formerly known as Blackwater-- within an article that for whatever reason mentions both of them.

I have zero interest in diagramming sentences, mind you (because it's a dated and frankly messy pseudo-logical method of representing the syntax of a sentence), but there is a good idea at the core. While it's true that the web has given us greater access to large corpora, this corpora remains unstructured text. I'd like to see larger parsed corpora available (like the BNC).

With minimal training, editors and fact checkers could be utilized to mark up text with simple phrase boundaries and labels (this is a NP, this is a VP) as well as PP attachment ambiguity and co-reference, etc. There would be messiness in this approach too, but Breck Baldwin has noted that this can be done effectively (for recall, at least) and the major issue is adjudicating the error rate of a set of crowd-sourced raters (see my previous post here and Baldwin's original post here). A little sampling could adjudicate nicely.

Unsolved Problems in Linguistics

(pic from the Donders Institute)

Just discovered this page called Unsolved problems in linguistics. It's a rather incomplete list, but a start. This is the sort of topic that could easily form the core of a very interesting conference debate. Linguistics remains a wide open field with competing theories and emerging methodologies, and the big questions remain dark and murky. However, this page claims that the origin of language is the major unsolved problem. I definitely disagree. The main goal of linguistics, as I would state it, is to figure out how language works in the brain (hence, that is our major unsolved problem). From that, most other questions can be answered (btw, see The Language Guy's take down of a recent report regarding the word most here). As our understanding of the brain improves, so will our understanding of language. I don't dispute that understanding the origin of language could be of use, but it is hardly the center of the linguistics world )I realize the Derek Bickerton might disagree).

NOTE: After Googleing the phrase "Unsolved Problems in Linguistics" I found a number of other sites dedicated to the same topic, including a Wikipedia page; however there is clear plagiarism/borrowing going on somewhere as there is word for word similarity between these sites; not sure who's cutting and pasting from whom. But you need only go to one site to see the same stuff.

Friday, December 4, 2009

The Naked Vulnerability Of His Sentences

The grammatically whimsical author of Infinite Jest, the late David Foster Wallace, was, apparently, a prescriptivist. Blogger Amy McDaniel at HTMLGIANT recently posted what she claims is a "complete text of a worksheet from his class" (HT kottke) which is, basically, a grammar test which begins with the following admonition:


Feel free to take the test yourself here, or to troll the answers folks are giving.

Personal fav:

2. I’d cringe at the naked vulnerability of his sentences left wandering around without periods and the ambiguity of his uncrossed “t”s.

UPDATE: it's always nice to scoop Language Log. A day late and a dollar short, Chris Potts posts about the DFW grammar test here. Psst, my post title is wayyyyy more cleverer. thhhpppt!

UPDATE 2: More LL on DFW and his prescriptivist bent here.

UPDATE 3: Looks like scooping LL is becoming a habit for me (pats self on back).

Google Words

TechCrunch reviews Google's newish dictionary app here (Google's dictionary has been lurking around for awhile, but now it gets its own page here). I did a quick comparison of Google & Merriam Webster's entries for inappropriate and found they were remarkably different in scope. Google returns a lot more data (plus they provided links to other web definitions, which seemed to mostly be Wordnet links). I prefer Google's phonetic guide as it seems to be straight IPA (although their transcription of -pro- as pr'oʊ seems odd to me as it suggests a diphthong when I think they're just indicating rounding, but I never was much of a phoneticist, so no biggie).

I was particularly impressed to see some constructional patterns listed in Google entries (e.g., |'it' v-link ADJ to-inf| representing something like 'it is inappropriate to yell').

However, Miriam Webster still wins on historical data, minimal as it is.

Paul Reubens on Rails

kottke was in a goofy mood recently and started a twitter game whereby users come up with blends of celebrity names and online apps. Some of them are pretty good. Personal favs:
  • daniel craigslist
  • Gwyneth Paypaltrow
  • Sid
  • Katrina and the (google) Waves (I'm a sucker for '80s retro)
  • Ali G(mail)
  • Bing crosby
  • Simon and Garflickr
  • Michael J FireFox
  • Ben Afflickr
  • Black IP's
See more at #webappcelebs.

Thursday, December 3, 2009

Lexical Decision Tasks

(screen grab from University of Essex demo)

UPDATE (September 14, 2017): the Uni Essex link is dead. Try this one from PsyToolkit instead.

Just found this online demo of a classic lexical decision experiment from the University of Essex here. Some images on the page don't seem to load, but the experiment runs just fine. It's a nice example of a simple psycholinguistics methodology that is commonly used in many experiments.

I'll let the good folks at Essex explain the task:

One of the key methods of investigating the processes involved in reading is the lexical decision task. Any model of reading needs to explain how a particular word can be selected from many similarly featured items, (known collectively as the neighbourhood). Neighbourhood size is a measure of the orthographic similarity between words (Coltheart et al., 1977). If a target word is orthographically similar to many words, then the target word is said to have a large neighbourhood (e.g the word sell has many neighbours such as tell, well, bell, yell and sill). A target word which is orthographically similar to few words is described as having a small neighbourhood (e.g. deny only has the neighbours defy and dent. In lexical decision tasks, Andrews (1989), found that words from large neighbourhoods elicit quicker responses than words from small neighbourhoods. This finding has been observed in a number of studies (e.g. Laxon et al., 1992: Scheerer, 1987). The facilitatory effect of neighbourhood size suggests that presentation of a target word results in activation of all the lexical entries which are similar to the target, and this local activation somehow speeds up target access. However, the precise nature of this facilitatory effect is a matter of continuing debate.

Now go enjoy the demo!

BTW, check out these other online psycholinguistics experiments here.

Wednesday, December 2, 2009

Thinking Words (part 1)

(image from

I’d like to present a brief lesson in contemporary linguistic research with the goal of showing that we live in a marvelous age of quick and ready research tools freely available to even the most humble of internet users. Hence, a little effort goes a long way. My point is that when we make claims about language usage (and by "we" I mostly mean those of us who present our claims about language to the public via the interwebz) we need not make such claims based on our intuitions and emotions; rather, we can perform a little due diligence in a way that linguistic pontificators of the past simply could not. And bully for us.

My subject for today’s Full Liberman is this classic example of language mavenry from Prospect magazine: Words that think for us by Edward Skidelsky, lecturer in philosophy at Exeter University (HT Arts and Letters Daily). In this article, Skidelsky laments the following “linguistic shift”:

No words are more typical of our moral culture than “inappropriate” and “unacceptable.” They seem bland, gentle even, yet they carry the full force of official power. When you hear them, you feel that you are being tied up with little pieces of soft string. Inappropriate and unacceptable began their modern careers in the 1980s as part of the jargon of political correctness. They have more or less replaced a number of older, more exact terms: coarse, tactless, vulgar, lewd. They encompass most of what would formerly have been called “improper” or “indecent.”…“Inappropriate” and “unacceptable” are the catchwords of a moralism that dare not speak its name. They hide all measure of righteous fury behind the mask of bureaucratic neutrality. For the sake of our own humanity, we should strike them from our vocabulary.

UPDATE: A very lively discussion of the meaning of the words in question (something I largely ignore) has broken out on Language Log here)

This article makes four testable linguistic claims:
  1. The words inappropriate and unacceptable have increased in frequency over the last couple decades.
  2. This frequency increase is due to replacing other words: coarse, tactless, vulgar, lewd, improper, and indecent.
  3. These other words are “older”
  4. These other words are “more exact”
With a little investigation using entirely freely available online linguistics tools, we can easily fact check each of these claims. In the interest of time, I'll answer the first two together.

First and Second -- Has the frequency of inappropriate and unacceptable increased since the 1980s? & have they replaced the other words?

In order to quickly get some data, I took this to mean the frequency of the first two words have increased while the frequency of the other words have decreased since the 1980s (is this is an unfair interpretation?. In any case, that’s how I operationalized my methodology.). Thanks to Mark Davies excellent resource, the TIME Corpus of American English (100 million words, 1923-2006, requires registration, but it's free) we can quickly get a snapshot of the frequency of each word’s usage for the last 9 decades (not bad, huh? Thanks Mark!!).

Caveat: raw frequency is a poor data point by itself. What we really need is a way to compare apples to apples and oranges to oranges, and the problem we have is different sized corpora for each decade. Fear not, Davies does this work for us. His handy dandy interface allows us to report frequency per million, thus giving us comparable frequencies across different decades.

Using the TIME corpus, I discovered the frequency per million of each word per decade. Then I entered that data into a spread sheet. I used Excel 2007 to create a line graph of these frequencies.

Here's the relevant data:

And here's the graph:

UPDATE (2hrs after original post): original graph was confusing (same graph, just confusing labels) so I fixed it.

What this shows us is that both inappropriate and unacceptable do in fact show a rise in frequency (consistent with Skidelsky's claim), but starting in the 1960s, not 1980s. However, unacceptable shows a more recent dramatic decline, which is inconsistent with his claim. Lewd actually made a bit of a comeback in the 1990s (thank you Mr. Clinton?), but has since dropped back (it's a bit of a jumpy word, isn't it?). The other words do seem to be falling off in usage, consistent with Skidelsky's claim. So the picture is not quite what Skidelsky thinks it is, though he does seem to be on to something.

UPDATE: See myl's plot of this same data (but grouping the words as Skidelsky does) here which suggests that "'coarse', 'tactless', 'vulgar' etc. declined until WWII and then stayed about the same, perhaps with an additional decline in past decade; while 'inappropriate' and 'unacceptable' rose gradually from the 1930s to 1970 or so, and then leveled off. " The plot does suggest that we could view the two groups as having roughly inverted frequency, somewhat conforming to Skidelsky's hunch.

Third -- Are these other four words “older”?

Unfortunately, as I am no longer affiliated with a university, therefore I have no access to the OED (I’ve decided not to pay the $295 for their individual subscription. Condemn me if you must). If anyone would care to look those up and post them in comments, I’d be happy to update. Most of these words have multiple senses and the question is, when did the most relevant sense enter usage? For that, the OED is most valuable. Again, you can do that work for me, or send me a check for $295.

However, a simple search of the Merriam Webster online dictionary gives us a quick answer:

unacceptable = 15th century
inappropriate = 1804
coarse = 14th century
tactless = circa 1847
vulgar = 14th century
lewd = 14th century
improper = 15th century
indecent = circa 1587

This data suggests these five words fall into roughly two groups:

A -- words that entered the language around the 19th century
  • Set A = inappropriate, tactless
B -- words that entered the language around the 15-16 centuries
  • Set B = unacceptable, coarse, vulgar, lewd, improper, indecent
This grouping does not conform to Skidelsky’s assumption that inappropriate & unacceptable fall together in a newer class and the others in an older class.

UPDATE: much thanks to commenter panoptical who provides the following OED dates which appear to largely confirm the Merriam Webster dates, with the notable except of lewd which dates back to Old English it seems...does have a certain Beowulf ring to it, doesn't it?

unacceptable: 1483
inappropriate: 1804
coarse: 1424
tactless: 1847
vulgar: 1391
lewd: c890
improper: 1531
indecent: 1563

Fourth -- Are the other words "more exact"?

Finding a way to empirically test this is a challenge I will take up in later post (you can see Wordnet coming, can't you?). It will require teasing apart senses and relationships between senses (oh my, I wish I had the OED right now...).

TV Linguistics - and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...