Collaborative Corpus Linguistics

Using Pinax and Django For Collaborative Corpus Linguistics from James Tauber on Vimeo.

I haven't seen this video yet (it's almost an hour long and using my current connection would take 3 days to buffer), but it seems interesting:

After introducing Django and Pinax, this talk discusses Pinax-based tools the speaker is developing to help with web-based collaboration on corpus annotation with applications from lexicography to morphology to syntax to discourse analysis.

The speaker is James Tauber. According to his web page, he is currently on leave from a PhD in linguistics at the University of Essex where I was researching the inflectional morphology of Ancient Greek.

Monday, August 30, 2010

map of nationality suffixes

(image from Linglish.net)

A commenter on Liberman's recent post Ask Language Log: Adjectives from country names? linked to this excellent discussion of nationality suffixes: So many nationality suffixes which contained the map above. Well done.

Linguist List Jobs?

I can't be alone in thinking the new Linguist List Jobs page is horribly designed, can I? I mean, it's soooooo 1998, right?

Sunday, August 29, 2010

reports of the death of the printed OED have been slightly exagerated

Just read a short article suggesting the OED illuminati are at least cognizent of the impending doom of the printed version of the OED. I vote yes (no one cares). I always hated that little magnifying glass. *fyi, I haven't figured out yet how to copy and paste links using droid x, so you'll have to surf for the article on your own. (Copying is easy, pasting not so much)

Saturday, August 28, 2010

language of the fishers

National Geographic is getting much Twitter buzz for their recent article "Lost" Language Found on Back of 400-Year-Old Letter. The title alone evokes Indiana Jones and The Da Vinci Code, so people naturally get all jazzed. But note the clever use of scare quotes in the title. What was actually discovered was an apparent translation of base ten number names from a language, likely one of two known only by the mention of their names in contemporary texts: Quingnam and Pescadora—"language of the fishers".

I found this quote interesting:

"Even though [the letter] doesn't tell us a whole lot, it does tell us about a language that is very different from anything we've ever known—and it suggests that there may be a lot more out there," said project leader Jeffrey Quilter, an archaeologist at Harvard’s Peabody Museum of Archaeology and Ethnology. (emphasis added).

Ummm, if the letter doesn't tell us a whole lot, how do we know this lost language is "very different from anything we've ever known"?

Friday, August 27, 2010

The Top Ten NLPers!!

There's a LIST!!!

From Dr. Jochen L. Leidner’s Blog,

For the area Natural Language and Speech, the all-time most high impact researchers are:

Robert Mercer
Fernando Pereira
Kenneth Church
Vincent Della Pietra
Aravind Joshi
Mitchell Marcus
Hermann Ney
Peter Brown
Michael Collins
Stephen Della Pietra

Congratulations, you awards will be mailed at a later date.

Tuesday, August 17, 2010

I just don't trust the guy...

The events surrounding Marc Hauser and his lab over the last week have been fascinating to follow and interesting to wonder about from afar, but the lasting impact didn't really hit home until I read this post: What Are The Origins of Number Representation?

The Thoughtful Animal is a good science blog written by a smart grad student at USC. And that particular post is an excellent summary of interesting research on human cognition. But here's the thing, it includes a review of one of Hauser's papers. Though not one of the three papers targeted by the investigation, as far as I know, I felt suspicious. I couldn't help it. Perhaps I should be less judgmental, but the truth is, I just don't trust the guy. I do not claim that's a good thing. Again, perhaps I should be more objective, but there is an uncontrollable emotional aspect to the effect of academic misconduct. Credibility is gone.

This is the inevitable effect of that kind of scandal (and I do think that's an appropriate word to use here) that Hauser is now involved in. I can't read any reference to his work sans suspicion (especially work involving qualitative judgments about the behavior of cotton-top tamarin monkeys). I could read the original paper, of course, but that wouldn't allay the deep suspicion because I wouldn't be able to review the evidence myself. However, if Hauser, and all scholars, would publish their raw data online (as Liberman has called for here) I could make a better informed judgment as to the credibility of that particular paper and its conclusions. Unfortunately that option simply isn't available to me ... yet.

Sunday, August 15, 2010

I De-ed her...

Been reading Wallace Stegner's 1976 National Book Award winning novel The Spectator Bird for my book club (must. finish. monday...). The author makes some cute observations about the incoherence of the Danish glottal stop (famous amongst tortured first year linguistics grad students who often are assigned to wrestle with its distributive intricacies). But the linguistic observations are not limited to phonology.

As part of the narrative, the main character, a literary agent named Joe Allston, travels to Denmark and meets a proper, if minor, Danish aristocrat Astrid Wredel-Krarup. While reviewing his journals about the time with her, he makes the following declaration to his wife Ruth:

Ruth: I called her by her first name after the first day or so.

Joe: Well, I didn't after two or three months. Ever. She called me Mr. Allston and I gave her back the full business. When I tried Danish, I didn't Du her, I De-ed her.

It's a cute illustration of how people negotiate linguistic forms, especially politeness forms. This reminded me of a former house-mate of mine who moved to DC after a year teaching ESL in France. He took a summer sublet with an older, very proper French speaking Belgian lady and he told me he didn't use the informal tu with her until late in the summer after she explicitly told him it was okay to use it.

Saturday, August 14, 2010

Eliot Ness takes down Hauser?

It's a general rule of bad journalism that when a story begins to slow down, it's a journalist's duty to just make things up to squeeze out one or two more articles. As far as I know, there has been no new information released about the Marc Hauser story since it broke four days ago here. Yet yesterday the NYT writer Nicholas Wade wrote "Marc Hauser’s academic career was soaring when suddenly, three years ago, Harvard authorities raided his laboratory and confiscated computers and records" (emphasis added).

Raided? Really? They raided his lab? With shotguns and sledge hammers like Eliot Ness? Wade's article simply rehashes what we already know, so it's not clear why the NYT is giving him more and more column inches to fill. Wade is quickly making a cottage industry out of repetition of the Hauser story without adding much if any value (see here for an even more vicious critique of Wade). Why does he believe there was a "raid" on Hauser lab? It's not clear. The only indication is a quote later in the article from Michael Tomasello who said “Three years ago when Marc was in Australia, the university came in and seized his hard drives and videos because some students in his lab said, ‘Enough is enough.’ They said this was a pattern and they had specific evidence” (emphasis added).

Tomasello used the word "seize", Wade used the word "raid." To me those words have very different connotations. Seizing a hard drive can be as mild as some guy walking into a lab with a Starbucks in one hand and a stack of ungraded freshman essays in the other and saying to a couple of exhausted grad students "hey guys, I need to take your hard drives, can you take a break for a minute?" A raid, on the other hand, involves shotguns and tear gas.

Friday, August 13, 2010

Harvard responds

Harvard released a statement yesterday concerning the Marc Hauser retractions (plural, it  includes three papers):

Harvard has always taken seriously its obligation to maintain the integrity of the scientific record. The University has rigorous systems in place to evaluate concerns about scientific work by Harvard faculty members. Those procedures were employed in Dr. Hauser's situation. As a result of that process, and in accordance with standard practice, Harvard has taken steps to ensure that the scientific record is corrected in relation to three articles co-authored by Dr. Hauser. While Dr. Hauser (or in one instance, his colleague) were directed to explain the issues with these articles to the academic journals that published those papers, the University has also welcomed specific questions from the editors involved. We will continue to assist the editors in this process.

In these types of cases, Harvard follows federal requirements for investigating alleged research misconduct and reports its findings, as required, to the appropriate federal funding agencies, which conduct their own review. At the conclusion of the federal investigatory process, in cases where the government concludes scientific misconduct occurred, the federal agency makes those findings publicly available.

Still no indication of what the actual misconduct was and why it took three years to investigate. Overall reactions around the web seem to be of three kids:

1. yeah, we kinda suspected this all along.
2. we don't know enough yet to judge.
3. this is the sausage making inherent in the scientific process.

Thursday, August 12, 2010

marc hauser reactions

The social media world is abuzz with reactions to the Marc Hauser story.  From the blogs--

Mark Liberman:
Like many other linguists, Geoff and I have felt from the beginning that the results of Hauser's monkey experiments were of dubious relevance to the evolution of speech and language. Now we're forced to question whether there were any reliable results at all.

Drug Monkey:
What I am worried about in this type of coverage is the conflation of a failure to replicate a study with the absence of evidence (per the retraction blaming a trainee) with scientific debate over the interpretation of data. The mere failure of an investigation to be able to replicate a prior one is not in and of itself evidence of scientific misconduct. Scientific findings, legitimate ones, can be difficult or impossible to replicate for many reasons and even if we criticize the credulity, scientific rigor or methods of the original finding, it is not misconduct.

John hawks:
The problem of subjective data is not unique to Hauser's work but is systemic in the field of primate cognition. It reminds me of some discussion in Jeremy Taylor's recent book Not a Chimp: The Hunt to Find the Genes that Make Us Human. There's the issue of whether experiments are designed clearly enough to yield conclusions. Then there's the second issue of whether observations are replicable, or whether they result only from somewhat "wishful" researchers. Such experiments often get heightened scrutiny, but rarely is there clear misconduct. That makes this a really shocking case.

Razib Khan:
Hauser is a prominent public intellectual...Obviously problems in some aspects of his work doesn’t necessarily invalidate all his findings, but it doesn’t look good for his credibility. This sort of incident points to the importance of trust within the culture of science. Collaborators and researchers who cited his results are scrambling to make sense of it all.

Open Parachute:
...we should recognise that we are seeing one of the methods science has for self correction. The science community treats deliberate distortion of evidence, poor record keeping and biased interpretation of results very seriously.

There are going to be people who use this news to attack science. But we should ask them if they are prepared to submit their beliefs, ideology or claims to such scrutiny? And are they willing to be disciplined if an investigation finds that they have made distorted or false claims?

Art Markman:
I find cases like this both frustrating and reassuring at the same time.

The frustrating part of cases of misconduct is fairly obvious. As a scientist, all I really have is the integrity of my data. Theories are nice, of course. We create theories to help us to explain patterns of data. But, really, theories are most useful because they help use to develop new questions that we can ask that will help use to collect new data. [...] At the same time, cases of misconduct are reassuring. Science is remarkably self-correcting. When we publish papers in scientific journals, we organize our papers in a way that reflects the ideals laid out by Francis Bacon. We give enough of the details about our methods that someone else could repeat the study we are presenting. We present details about the analysis of our data. After a paper is published, authors often make their data available to others who want to do additional analyses of the work.

David Dobbs:
To me the allegations, vague as they are, don’t quite rise to shocking yet, though I may be missing something. (Please point it out if so.) But that confusion underlines how important it is, methinks, for Harvard to spell out just what is being looked at here. Not just Hauser but a lot of people who have drawn on, contributed to, or worked parallel to his work are hanging in the wind here.

FYI: The author of the original story follows up today with this: Harvard is urged to detail inquiry.

Wednesday, August 11, 2010

what a PhD looks like...

a pimple...

...and I remain happily ABD...

See The illustrated guide to a Ph.D. for full set of images and discussion.

Marc Hauser on leave???

An odd development in the language evolution saga. Harvard professor Marc Hauser has been forced on leave over an apparent academic misconduct issue regarding a paper he published involving rule learning in monkeys. Hauser is well known within language evolution circles. He co-authored with Chomsky the (in)famous paper The Faculty of Language: What Is It, Who Has It, and How Did It Evolve? which argued that

...a distinction should be made between the faculty of language in the broad sense (FLB)and in the narrow sense (FLN). FLB includes a sensory-motor system, a conceptual-intentional system, and the computational mechanisms for recursion, providing the capacity to generate an infinite range of expressions from a finite set of elements. We hypothesize that FLN only includes recursion and is the only uniquely human component of the faculty of language. We further argue that FLN may have evolved for reasons other than language...

This prompted a response from Pinker & Jackendoff: The faculty of language: what’s special about it? which countered Hauser & Chomsky & Fitch saying "The approach ... is sufficiently problematic that it cannot be used to support claims about evolution."

Boston.com reports that Hauser "is taking a year-long leave after a lengthy internal investigation found evidence of scientific misconduct in his laboratory. The findings have resulted in the retraction of an influential study that he led. “MH accepts responsibility for the error,’’ says the retraction of the study on whether monkeys learn rules, which was published in 2002 in the journal Cognition. Two other journals say they have been notified of concerns in papers on which Hauser is listed as one of the main authors."

Commenters at Razib Khan's Gene Expression post about this fisk the Boston.com article a bit and are worth reading over there. Nonetheless, it's a strange turn of events for language evolution.

Liberman has a nice analysis of Hauser's work here.

ant synonyms and linguistics envy

A cute analogy: Similar molecules which differ slightly in chain length cause similar behavioral reactions in ants. Therefore, similar chemicals are like lexical synonyms in human language. This is a rough paraphrase of the brief post Chemical Ant Language Has Synonyms.

But is the analogy valid?

The blog was referring to a study that investigated what appeared to be a pretty straight forward stimulus-response reaction. Ants were exposed to a variety of chemicals which differed minimally and their reactions were recorded. Upon first pass, it appears as though no sort of cognitive processing occurred in the ants (I cannot speak with any authority on the state of cognitive processing in ants, but I'm guessing it's limited at best). The blog post author took the sysnonym analogy straight from the original study Deciphering the Chemical Basis of Nestmate Recognition (full citation below). From the abstract:

This study contributes to our understanding of the chemical basis of nestmate recognition by showing that, similar to spoken language, the chemical language of social insects contains “synonyms,” chemicals that differ in structure, but not meaning (emphasis added).

Linguists have been justly accused of having both physics envy and biology envy for our tendency to borrow concepts from those fields to help understand linguistic processes. This, however, may be a case of linguistics envy. The use of language as a metaphor for anything remotely communicative is all too familiar to many of us and typically wrong. And the public's love of animal language stories fuels the fire.

Clearly the findings are interesting to the extent that they show a certain categorical response. Apparently ants respond to a set of chemicals in a similar way and this set of chemicals might be loosely compared to a set of synonyms like run, jog, trot, scurry, scamper, sprint, etc. But the most interesting thing about lexical synonyms is that they DO differ in meaning and distributional properties. Even if the differences are nuanced, they are real. Their semantics are related, but it's the differences that are the object of linguistic inquiry. So, if the ant response is to be a viable analogy to lexical synonyms, we're going to have to see that each chemical variant produces a similar but interestingly different response in ants.

Now, what might be a closer linguistic analogy is that of phonemes. Here we have a well undersood phenomenon whereby a set of similar but interestingly different sounds are perceived as belonging to a single class. There is also the interesting categorical perception phenomenon where slight differences in sounds can be perceived as whole category differences, not unlike molecules of different chain length causing a similar reaction in ants (I think).

Wilgenburg, E., Sulc, R., Shea, K., & Tsutsui, N. (2010). Deciphering the Chemical Basis of Nestmate Recognition Journal of Chemical Ecology, 36 (7), 751-758 DOI: 10.1007/s10886-010-9812-4

Monday, August 9, 2010

[having sex] shatner and taboo vocabulary

Liberman bait:

William Shatner, of all people, stands at the center of television's latest moral battleground.

He's the cantankerous lead character in a new CBS sitcom, "(Bleep) My Dad Says," that is scheduled to air on Thursday nights. Rather than "bleep," the title uses a series of symbols that suggest the expletive included in the book title on which the series is based.

The Parents Television Council last week sent letters to 340 companies that advertise frequently on TV urging them to stay away from the show unless the name is changed. The group argues that the title is indecent.

"Parents really do care about profanity when their kids are watching TV," said PTC President Tim Winter. "All parents? No, but something like 80 or 90 percent of parents. Putting an expletive in the title of a show is crossing new territory, and we can't allow that to happen on our watch" (emphasis & link added).

Note, however, that CBS did not go fully arbitrary with their symbols. There is more than a little iconicity between  $#! and Shit. The dollar sign clearly evokes a capital S and the pound sign # evokes an H (though more of a capital H than lower case h). Would The Parents Television Council be happy were it called '%!@#* My dad Says'?

UPDATE: Ben Zimmer alerted me to the fact that this is old news, see here and here for previous discussions. [having sex] Pullum!

wurfing and polkadodge

According to a recent story in The Telegraph titled Secret vault of words rejected by the Oxford English Dictionary uncovered:

Millions of "non words" which failed to make the dictionary lie unused in a vault owned by the Oxford University Press. [...] These words were recently submitted for use in the Oxford English Dictionary (OED) but will remain dormant unless they enter common parlance in the future. Graphic designer Luke Ngakane, 22, uncovered hundreds of 'non words' as part of a project for Kingston University, London. He said: ''I was fascinated when I read that the Oxford University Press has a vault where all their failed words, which didn't make the dictionary, are kept. ''This storeroom contains millions of words and some of them date back hundreds of years. ''It's a very hush, hush vault and I really struggled to find out information about it because it is so secretive.

The OED Illuminati have all the power!

The story provides a small dictionary of non words at the end.  Personal fav: "earworm", a catchy tune that frequently gets stuck in your head. ewwwww...

UPDATE: Ben Zimmer debunks this story here. Alas, there is no OED Illuminati...or is that just what Ben wants us to think?

the linguistics of love

A recent tweet from CursorTN does a nice bit of frame semantic analysis: the expression for being irrationally in love *should* be "heels over head in love." Think about it.

Hmmmm, yes, yes, this seems correct. What the hell does 'head over heels' mean anyway? I'm head over heels right now and I'm sitting at a computer typing!

Presumably there is a romantic attraction frame (can't find anything like this at FrameNet, might have missed it) whereby  being in love upsets your natural state. if your natural physical state is standing upright, then you naturally are 'head over heels.' Hence, when you are in love, your natural state is up ended and you become 'heels of head.' And yet this is not the phrase in use.

A little googling and I found a few websites which have discussed this before, but only one gives us some historical background:

The Phrase Finder: 'Head over heels' is a good example of how language can communicate meaning even when it makes no literal sense. After all, our head is normally over our heels. The phrase originated in the 14th century as 'heels over head', meaning doing a cartwheel or somersault.

Now can we figure out how the reversal occurred?

Sunday, August 8, 2010

Sunday math quiz

What's wrong with describing a quantity as x meters or h feet? Answer in the first chapter of this free MIT course Street-Fighting Math.

From a special post titled The evolution of language at The Emporia Gazette (KS) Much of all modern languages in Europe evolved from Latin and Greek. sigh...

NLP Book

Alias-i has just released a draft version of a book based on their NLP suit LingPipe

Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.


Friday, August 6, 2010

the linguistics of tetris

The University of Edinburgh has a new language experiment/game up and running:

Tetris Experiment

In Tetris, you must place falling blocks to score points.  If there is no more room to place blocks, you lose!
Steer the current block using the left and right arrows.  Rotate the block with the up arrow and fast-drop it using the down arrow.

You will play one of two versions of the game:
Tetris – Whole horizontal rows are removed.
Coltris – if more than 4 blocks of the same colour are touching, those blocks are removed.

A word will appear with each new block.  This is the name of the next block in Tetro – a strange and ancient language.Your task is to learn the names of the blocks in Tetro.  You will be presented with each block and the name for it in Tetro in a Training round.  These will appear in the top right window.  You will be tested on your knowledge in a Test round! First, you will play Tetris or Coltris for 2 minutes, then you will be trained on Tetro, then do a short test. You'll play the game for 2 more minutes, then be trained again and you'll do a long test.

HT: The Adventures Of Auck.

Thursday, August 5, 2010

just to be clear...

(personal pic taken in outside of Huachuca City, AZ)

Yep, sometimes even the simplest linguist structure requires a little indexical help.

Tuesday, August 3, 2010

discourse is that thing...

Hal Daume, NLPer extraordinaire, waxes poetic on discourse and concludes that interpretation is abduction  where the purpose of discourse is to give you an interpretation about whatever is not in the sentence itself given that you assume the sentence must be coherent. Money quote: what do we have to assume about the world to make this discourse coherent?

It was a modest post by a very smart man, but it all seemed a little too Grice-lite to me. But Daume gets bonus points for the Jerry Hobbs shout out. Not enough linguists read Hobbs.

Snowclone bait: isn't 'Xer extraordinaire' a snowclone? Time to email Erin...

Monday, August 2, 2010

son of of bitch, the weather is pickled

This is an awesome video of a Korean language professional teaching Korean speakers how to use swear words in English. It's so good, it's pickled.

(HT kottke)

swallowing the whorfian pill

The lingo world is ablaze with references to the recent WSJ article Lost in Translation by Stanford Professor Lera Boroditsky. Tweeters have been linking to it like mad and both Language Log (also here) and the Economist's Johnson blog have discussed it (I posted about Professor Boroditsky's work before here and my position mirror's the Johnson's: interesting but not ground breaking and a bit overblown).  I suspect that lay audiences love Whorfian stuff much like they love peevologists and no word for x discussions. A combination of cultural attitudes and fundamental misunderstanding about how language actually works leads to popular lingo-topics that make most linguist yawn or roll their eyes.

Along these lines, I stumbled across what appears to be a legitimate, well-meaning, but perhaps somewhat misguided research project: The who/which project. From the project web page:

One of the underlying causes of ecological destruction is the separation between humans and other animals. When nonhuman animals are treated according to balance sheets rather than their own nature, the result can be not only a life of misery for the animals concerned, but environmental devastation.  This research project looks at one area of language which both reflects and contributes to the gulf between humans and other animals: the pronouns who and which. Who, we are told by some but not all dictionaries and grammar books, refers exclusively to humans, which to nonhuman animals, plants and things. This project has begun by investigating the use of the pronouns who and which (and perhaps related topics later), starting with what dictionaries and grammar books prescribe and describe. Beyond this, the hope is that the project could contribute to efforts to bridge the gap between humans and other animals (emphasis added).

If I understand correctly, this project is going to try to analyze written rules for who/which in hopes of discovering some causal link to animal cruelty. This is founded on the belief the using a different pronoun to refer to animals causes us to think about them differently. I realize I'm exaggerating the project's claims a bit and I'll ask you to forgive me that because I want to lay bare the underlying assumptions. There is a Whorfian hypothesis underlying this project's mission. Unfortunately, their methodology is far to superficial and simplistic to yield anything useful, I suspect. And this comes full circle back to Boroditsky in that even a Standford professor's well designed empirical research on the Whorfian hypothesis gets misunderstood and overblown. Poorly designed projects are destined for a Full Liberman.

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been cre...