Friday, February 21, 2014

RIP Charles Fillmore

I never met Charles Fillmore, but he had a deep influence on my linguistics education. When I was a graduate student in linguistics at SUNY Buffalo we only half jokingly called it Berkeley East because half the faculty had been trained at Berkeley and the department had a *perspective* on linguistics that was undeniably colored by Berkeley theory. Charles Fillmore was a hero at SUNY Buffalo and it was hard to take a class that didn't reference his work. His work on constructions and frame semantics was the underpinning of my interest in verb classes and prepositions.

I can't offer any unique thoughts on the man, so I'll simply point to some folks around the web who have offered theirs:

A Roundup of Reactions

Paul Kay - Charles J. Fillmore
The magnitude of Fillmore’s contributions to linguistics can hardly be exaggerated

George Lakoff - He Figured Out How Framing Works
He discovered that we think, largely unconsciously, in terms of conceptual frames — mental structures that organize our thought. Further, he found that every word is mentally defined in terms of frame structures.

Dominik Lukes - Linguistics According to Fillmore
Charles J Fillmore who was a towering figure among linguists without writing a single book. In my mind, he changed the face of linguistics three times with just three articles (one of them co-authored).

UC Berkeley - Linguistics Department
He was a gifted teacher, a beloved mentor, a treasured colleague and friend, and one of the great linguists of the last half-century.

Arnold Zwicky - Chuck Fillmore
...with a link to a wonderful video he made about his career in 2012.

Friday, January 31, 2014

The SOTU and Reading Level

Evan Fleischer wrote a cheeky little bit about the reading level of the SOTU over at Esquire: Is the State of the Union Getting Dumber?

It was triggered by this graph in The Guardian:

Even emailed me and several other linguists to get some reactions. He quotes me, Ben Zimmer, and Angus B. Grieve-Smith. We generally agreed that trend noted by the graph probably had more to do with changing trends in who the speech is for, rather than any change in intelligence level.

It's a fun little read.

Tuesday, January 28, 2014

Anticipating the SOTU

In anticipation of President Obama's 2014 State Of The Union speech tonight, and the inevitable bullshit word frequency analysis to follow, I am re-posting my post from 2010's SOTU reaction, in hope that maybe, just maybe, some political pundit might be slightly less stupid than they were last year ... sigh .. here's to hope

BTW, Liberman has been on top of the SOTU story for a while now. here's his latest.

(cropped image from Huffington Post)

It has long been a grand temptation to use simple word frequency* counts to judge a person's mental state. Like Freudian Slips, there is an assumption that this will give us a glimpse into what a person "really" believes and feels, deep inside. This trend came and went within linguistics when digital corpora were first being compiled and analyzed several decades ago. Linguists quickly realized that this was, in fact, a bogus methodology when they discovered that many (most) claims or hypotheses based solely on a person's simple word frequency data were easily refuted upon deeper inspection. Nonetheless, the message of the weakness of this technique never quite reached the outside world and word counts continue to be cited, even by reputable people, as a window into the mind of an individual. Geoff Nunberg recently railed against the practice here: The I's Dont Have It.

The latest victim of this scam is one of the blogging world's most respected statisticians, Nate Silver who performed a word frequency experiment on a variety of U.S. presidential State Of The Union speeches going back to 1962 HERE. I have a lot of respect for Silver, but I believe he's off the mark on this one. Silver leads into his analysis talking about his own pleasant surprise at the fact that the speech demonstrated "an awareness of the difficult situation in which the President now finds himself." Then, he justifies his linguistic analysis by stating that "subjective evaluations of Presidential speeches are notoriously useless. So let's instead attempt something a bit more rigorous, which is a word frequency analysis..." He explains his methodology this way:

To investigate, we'll compare the President's speech to the State of the Union addresses delivered by each president since John F. Kennedy in 1962 in advance of their respective midterm elections. We'll also look at the address that Obama delivered -- not technically a State of the Union -- to the Congress in February, 2009. I've highlighted a total of about 70 buzzwords from these speeches, which are broken down into six categories. The numbers you see below reflect the number of times that each President used term in his State of the Union address.

The comparisons and analysis he reports are bogus and at least as "subjective" as his original intuition. Here's why:

Sunday, January 12, 2014

causation in verbal semantics

Causation is a major area of study within linguistic semantics. There is a thorough wiki page on the Causative that provides a good overview. Also, unsurprisingly, Beth Levin has written a nice discussion of the issues in these LSA 09 notes: Lexical Semantics of Verbs III: Causal Approaches to Lexical Semantic Representation.

To list the troubles with defining causation would fill a dissertation, so I won't bother here. Often, semanticists are interested in argument realization (see Levin's notes above). But there are deeper issues with causality that often go unaddressed. The deepest of all: what the hell is causality?

To this point, I ran across an old draft of a grad school buddy's qualifying paper on causation. It's just a draft, and it's old, but it had a nice section that tried to outline the constitutive criteria for causation*. I have since lost touch with this guy (I'll call him "BB"), but I thought this list of criteria is good food for though for anyone interested in causation. I post these as discussion points only. And if BB sees this, give me a buzz :-)

First, here's a taste of the range of causative types taken from the wiki page on Causation (don't be fooled by these English examples, the issues permeate all languages. Causation is tough):

  • The vase broke — autonomous events (non-causative).
  • The vase broke from a ball’s rolling into it — resulting-event causation.
  • A ball’s rolling into it broke the vase — causing-event causation
  • A ball broke the vase — instrument causation.
  • I broke the vase in rolling a ball into it author causation (unintended).
  • I broke the vase by rolling a ball into it  agent causation (intended) 
  • My arm broke when I fell  undergoer situation (non-causative).
  • I walked to the store  self-agentive causation.
  • I sent him to the store  caused agency (inductive causation).

BB's Nine Criteria for the treatment of causation (c. 2002)
  1. Change of state. The caused event must denote a change of state.
  2. Causers must be events. The causer A can not simply be an individual but must be an event.
  3. Argument sharing. The causing event must contain the causee in its representation.
  4. Impingement. There must be a clear indication of impingement between the causer and the causee such that the causer impinges on the causee.
  5. Occurrence condition. The caused event must occur.
  6. Co-occurrence condition. The occurrence of the caused event must be conditional with the occurrence of the causing event, that is, the caused event can only take place if the causing event takes place.
  7. Non-co-occurrence condition. The non-occurrence of the caused event must be conditional with the non-occurrence of the causing event; that is, the caused event does not take place if the causing event does not take place.
  8. Directness of causation. It must be apparent when indirect causation is allowable for causality in lexical items.
  9. Spatiotemporal equivalence. The causing event and the caused event must have an equivalent time and place.

BTW, I recall objecting to #5 "the caused event must occur" because of negative causative verbs like prevent (feel free to read my previous post on these kinds of verbs). I don't know how or if he addressed that in his final version.

* There's so much literature on causation, it would take years to review it all to see if anyone else has done such a thing at quite such a level (many authors mention criteria, but not quite as exhaustively). I wouldn't be surprised if there is a better variation out there, and I'm happy to post it if someone wants to point it out to me.

Monday, December 16, 2013

Porn for Linguists!

Finally! One thing in linguistics Len Talmy, Paul Postal, Noam Chomsky, and Joan Bresnan can agree on: At $12.99, The Speculative Grammarian Essential Guide to Linguistics is a modest last minute Christmas gift that takes less effort to purchase than red sweaters with white fluffy trim, yay! Nerdy uncles around the world thank you!

At 10,700 single spaced pages, 9 point font, Vera Sans Bold, this thin volume is a reminder of why my dissertation never quite fulfilled its promise, or never quite filled 50 pages, for that matter (can you say Ay Bee Dee, boys and girls?).

This volume of linguistic paraphernalia appears to be an elaborate sting designed to con some otherwise reputable institution into bestowing a commemorative matchbook cover on Trey Jones, a linguist best known for not being Terry Jones.

Out of kindness to the editors, I will refrain from discussing their shocking decision, vis à vis two white spaces after a period or one (I'll leave it to you, dear reader, to judge the depth of their depravity on your own). As to their policy regarding the Oxford comma, scandalous!

Am I paranoid, or was the blank page four a none-too-subtle homage to covert logical form? Obvious Chomskyan propaganda, I was disgusted.

'Tis not without its charms, though. A personal fave: Kean Kaufmann's cartoon depiction of when Daniel Jones discovered history's first cardinal vowel by plucking it, virginal and innocent, from his perfectly formed vowel space:

The volume also contains some rarely discussed dark moments in linguistics history, such as the catastrophic linguistic consequences of the 2004–5 NHL lockout on Canadian language production. So many "ehs" lost in time, like teardrops in the rain...

Rumor has it that Steven Pinker saw the book and immediately cried out, "Jones? TREY Jones? That guy owes me money!"

There are worse things you can do than spend $12.99 on pure linguistics fun.

Sunday, December 15, 2013

Why Big Data Needs Big Humanities

There's a new book out using Google's Ngrams and Wikipedia to discover the historical significance of people, places, and things: Who is Bigger? I have only taken a cursory glance at the web page, but it doesn't take a genius to see that the results look deeply biased, and it's no surprise why.

The two data sets they used, Wikipedia and Google's Ngrams, are both deeply biased towards recent, Western data. Wikipedia authors and editors are famously biased towards young, white, Western males. It's no surprise then that the results on the web page are obviously biased towards recent, Western people, places and things (not uniquely so, to be clear, but the bias is obvious imho).

The most glaring example is the complete non-existence of Genghis Khan on any of the lists. Khan is undeniably one of the most influential humans to have ever existed. In the book Destiny Disrupted: A History of the World Through Islamic Eyes, author Tamim Ansary referred to Khan as the Islamic world's Hitler. But he died in 1227 and mostly influenced what we in the West call the East.

Another example is the appearance of the two most recent US presidents, George W. Bush and Barack Obama, in the top ten of the top fifty most influential things in history. Surely this is a pure recency effect. How can this be taken seriously as historical analysis?

Perhaps these biases are discussed in the book's methodology discussion, I don't know. Again, this is my first impression based on the web page. But it speaks to a point I blogged earlier in response to a dust-up between CMU computer scientists and UT Austin grad students:

"NLP engineers are good at finding data and working with it, but often bad at interpreting it. I don't mean they're bad at interpreting the results of complex analysis performed on data. I mean they are often bad at understanding the nature of their data to begin with. I think the most important argument the UT Austin team make against the CMU team is this (important point underlined and boldfaced just in case you're stupid):
By focusing on cinematic archetypes, Bamman et al.’s research misses the really exciting potential of their data. Studying Wikipedia entries gives us access into the ways that people talk about film, exploring both general patterns of discourse and points of unexpected divergence.
In other words, the CMU team didn't truly understand what their data was. They didn't get data about Personas or Stereotypes in film. Rather, they got data about how a particular group of people talk about a topic. This is a well known issue in humanities studies of all kinds, but it's much less understood in sciences and engineering, as far as I can tell."

One of the CMU team members responded to this with the fair point that they were developing a methodology first and foremost and their conference paper was focused on that. I agree with that point. But it does not apply to the Who is Bigger project primarily because it is a long book, and claims explicitly to be an application of computational methods to "measure historical significance". That is a bold claim.

To their credit, the authors say they use their method to study "the underrepresentation of women in the historical record", but that doesn't seem to be their main point. As the UT Austin grad student suggested above, the cultural nature of the data is the main story, not a charming sub plot. Can you acknowledge the cultural inadequacies of a data set at the same time you use it for cultural analysis? This strikes me as unwise.

I acknowledge again that this is a first impression based on a web site.

UPDATE: Cass Sunstein wrote a thorough debunking of the project's methodology a few weeks ago, concluding that the authors "have produced a pretty wacky book, one that offers an important warning about the misuses of quantification."

Monday, December 2, 2013

Dictionary of American Regional English

One of the most useful things any research program in any field can do is provide a resource to other researchers. The Dictionary of American Regional English is a rich linguistic resource decades in the making, and it is now available online

Here's a description from the project's About page:

The Dictionary of American Regional English (DARE) is a multi-volume reference work that documents words, phrases, and pronunciations that vary from one place to another place across the United States. 
Challenging the popular notion that our language has been "homogenized" by the media and our mobile population, DARE demonstrates that there are many thousands of differences that characterize the dialect regions of the U.S. 
DARE is based on face-to-face interviews carried out in all 50 states between 1965 and 1970 and on a comprehensive collection of written materials (diaries, letters, novels, histories, biographies, newspapers, government documents, etc.) that cover our history from the colonial period to the present. 
The entries in DARE include regional pronunciations, variant forms, some etymologies, and regional and social distributions of the words and phrases.
A striking feature of DARE is its inclusion in the text of the Dictionary of selected maps that show where words were found in the 1,002 communities investigated during the fieldwork.

Wednesday, October 16, 2013

Weka data mining and the power of the masses

I recently completed the 5 hour Weka Data Mining MOOC and I was very impressed. I beta tested the first week last March and was enthusiastic. My enthusiasm was warranted.

The core idea is not to teach data mining per se, but rather to teach the user friendly GUI that makes data mining a simple matter of button clicks. It's the WYSIWYG approach to data analysis that could tip the momentum behind data mining over the point where everyone gets to play. For example, below is the Weka GUI with their sample diabetes data displayed:

Below is the same data set after the decision tree classifier J48 has been run (with default parameters).

This took me all of 45 seconds with zero programming (I'll agree with you that 73.8% accuracy is meh, if you'll agree with me that 45 seconds and default parameters is hella rad).

To be clear, the course is actually not a data mining course per se. Rather, it's a tutorial about their GUI. It shows you how to click buttons in order to load data sets, choose features, run various learning algorithms like decision trees, Naive Bayes, logistic regression, etc. What it does not do is teach you how these algorithms work (with a minor exception of a nice decision tree video).   More than anything else, this MOOC shows you how valuable Weka is for rapid prototyping. With this tool, you could run a dozen algorithms with a dozen feature variations over a data set in minutes. With ZERO programming!

I cannot stress enough how powerful this idea is. For those of you who don't appreciate how much more culturally powerful Microsoft Word is than LaTeX, you may not appreciate this power. It's the power of the masses. LaTeX does not have the power of the masses. Python does not have the power of the masses. But Weka has the potential to bring data mining to high school students, English majors, hipsters, unemployed copy writers, etc. Weka has made me more excited about the future of data mining than any other single tool.

Sunday, September 22, 2013

Ask Ziggy What?

Every now and then a new company or tech tool blips on my radar that piques my curiosity. Recently, I ran across Ask Ziggy, a Sacramento California based NLP start-up. No, not a typo, they really are based in Sacramento (well, technically Rocklin, the Arlington of Sacramento).

This piqued my curiosity first and foremost because that's about 90 minutes from where I grew up in the Sacremento Valley, an area well known as a hot bed of dry summer dust, but not known as a hot bed of NLP start-ups. Then again, one of the new hot beds of tech is Austin Texas. So hey, if it can happen in Austin, why can't it happen in Sacramento?

Before I go on, let me make it clear that I do not work for Ask Ziggy in any way and this is not a sponsored blog in any way. These thoughts are entirely my own. This is my personal blog and all content is my own and reflects my honest, personal opinions.

As I flipped through Ask Ziggy's web pages, four things occurred to me:
  1. "Ask Ziggy" as a brand is eerily reminiscent of "Ask Jeeves".
  2. Their core goal is making it easier for app developers to use NLP speech science.
  3. They have received $5 million in VC funding.
  4. Is this the start of a Sacramento NLP community?
1) Ask Jeeves: Most folks in the NLP community recall Ask Jeeves, a question answering search engine from the 1990s that was going to revolutionize search. Unfortunately, Google revolutionized search way better than they did, and Ask Jeeves was forced into a series of lay offs, booms, lay offs, booms "business cycle." Today, they're best known for that annoying Yahoo! tool bar extension.

2) Making Speech Science Easy: Since Ask Ziggy is currently in "private beta," I'm actually not exactly sure what they do, but it seems like they empower an app developer to allow a user to make relatively unconstrained natural language voice commands, and their NLP technology magically "figures out" what action is appropriate (given the app's basic goals and functionality). So, maybe a music app could allow a user to speak aloud "I wonder what MIA's new song sounds like?" and Ask Ziggy's tech figures out that that's equivalent to the action command [PLAY MIA New Song].

If that's true, then that would be awesome. It is a common complaint against Siri that it doesn't "understand" a lot of commands. Maybe Ask Ziggy is applying some bleeding edge NLP, informed by contemporary psycholinguistics, to bridge the gap. Dunno. It's not clear what their special sauce is from their promotional materials, but I like the idea of relieving average app developers of the burden of learning speech science just to add voice activation to their app.

3) Five Million Dollars! Maybe I'm jaded at this point, but $5 million in VC funding is a drop in the bucket in serious NLP development-land. $5 million equals maybe 2-3 years for a modest sized group, maybe 5 years for a really small group. They received this funding near the end of 2012, it's now near the end of 2013. They'd be lucky to have $3.5 million left, with the clock ticking. It's great to get VC funding, but it's greater to get customers. What is their plan for 2015? That's the money year, as far as I can tell.

4) Sacramento is the New Google? It's great to see Sacramento developing a tech community, especially in NLP. Unlike the energy industry, the computer tech industry doesn't need natural resources nearby, so it's not tied to geography like coal, oil, or natural gas. Any two-bit town can become a tech powerhouse (I'm looking at you, Redmond Washington). Any community of practice fosters creativity and innovation. There is no a priori reason that Sacramento could not become a new generator of NLP technologies and innovation. It only requires the techies in that area to know each other, meet regularly, be open minded, and ... oh yeah, have access to that $5 million in VC capital, that helps too.

Best of luck Ziggy.