Wednesday, May 22, 2013

VerbCorner - crowd sourcing and verb meaning

Josh, a postdoc at Harvard, has initiated an online game called VerbCorner in order to crowd source the study of the meaning of verbs. How often do you and I, the little people, get a chance to contribute to Harvard quality linguistic research? Well, apparently quite a lot these days. Research is for the masses!

Here's Josh's explanation
Dictionaries have existed for centuries, but scientists still haven't worked out the exact meanings for most words. At VerbCorner, we are trying to work out what verbs mean. Rather than try to work out the definition of a word all at once, we have broken the problem into a series of tasks. Each task has a fanciful backstory -- which we hope you enjoy! -- but at its heart, each task is asking about a specific component of meaning that scientists suspect makes up one of the building blocks of verb meaning.

Ultimately, we hope to probe dozens of aspects of the meaning of thousands of verbs. This is a massive project, which is why we need your help! We will be sharing the results of this project freely with scientists and the public alike, and we expect it to make a valuable contribution to linguistics, psychology, and computer science.
Being a verb meaning kinda guy myself, I'm very interested to see how this all plays out (literally and figuratively). My [defunct] dissertation was on verb semantics and Talmy's force dynamics. I'm really curious to see if Josh has included any Force Dynamics into this game.

Now, go play!

Tuesday, May 21, 2013

David Books, Word Classes, and Google Ngrams

David Brooks waxes poetic about word frequencies and the good old days in today's NYT: What Our Words Tell Us.

Update: Before reading my own most excellent original post below, here are two well respected linguists who fisk Brooks' article as well:

Robin Lakoff: What Our Words Don’t Tell Us. Money Quote:
It is hardly respectable scholarship to jump to the conclusion that changes in word frequency necessarily indicate changes in topics under discussion (new words may replace familiar ones but have similar meanings), and even if they do, it is very dubious – ethically questionable, you might say – to jump from there to the conclusion that these changes signify deep societal changes in the direction of moral decline, unless writers are prepared to make explicit and be prepared to defend their understanding of “morality” and “decline.” Social science is still, happily, distinguishable from theology.

Mark Liberman: Ngram morality. Money quote
David Brooks doesn't mention this ideological and temporal inconsistency in his sources. In general, as I've noted in discussions of his earlier columns, his "unparalleled ability to shape an intellectually interesting idea into the rhetorical arc of an 800-word op-ed piece" crucially depends on skillful editing — or revision — of his raw materials into a form that fits his theme.
My Original Post
Brooks cherry picks three recent Google Ngram analyses (by non linguists) and provides paper thin summaries of their findings, then concludes that America has lost is moral core. These analyses all depend crucially on the creation of word categories like “individualistic words” and “moral terms”. These are not quite synonyms*, but they require that the words in each class bear some semantic link between them. This begs the question: Are these groupings natural? Is there something psychologically real about them?

Linguists care about word classes quite a bit (computational linguists even more so). There are ways of constructing naturalistic sets of words. However, Brooks says nothing about how these studies performed their categorizations, so I thought I would post a quick review as it's important in judging the validity of the results.

Twenge et al
The first study by Twenge et al (which he doesn’t link to, but I do below) followed a scientifically reasonable path to create their word sets. They asked 53 Mechanical Turk participants to “generated words characteristic of individualism and communalism.” Then, they had a different set of 55 Mechanical Turk participants rate those words on a 7-point Likert scale. The top 20 words were then used as their search set. FYI, here are their two sets:

Individualistic
independent, individual, individually, unique, uniqueness, self, independence, oneself, soloist, identity, personalized, solo, solitary, personalize, loner, standout, single, personal, sole, and singularity
Communal
communal, community, commune, unity, communitarian, united, teamwork, team, collective, village, tribe, collectivization, group, collectivism, everyone, family, share, socialism, tribal, and union


UPDATE: For more on Twenge, commenter "unknown" helpfully suggests these Language Log posts:
Textual narcissism by Liberman
Textual narcissism, replication 2 by Liberman
It's all about who? by Liberman

Kesebir and Kesebir
Kesebir and Kesebir did 2 studies. In study one, they took ten words they found as synonyms of “virtue” in an unnamed thesaurus and searched Google’s Ngram for those words. Here are the ten: character, conscience, decency, dignity, ethics, morality, rectitude, righteousness, uprightness, and virtue.

In their second study, they constructed a set of 80 virtue words taken from websites about virtue in literature (e.g., honesty, patience, honor) then asked participants to rate each one as No = -1, Perhaps = 0, and Yes = 1. Then they took the 50 words with the highest averaged rating and search Ngrams for frequency.

Klein
Klein unapologetically gives no motivation for his word sets whatsoever. A “very casual paper” indeed.

The Problem
While I respect the attempt of the first two sets of authors to add some psychological reality to their linguistic categories, they fall for the same naïve assumption that plagued linguistics for hundreds of years: that people's conscious judgement of meta-linguistics is valid. For example, syntacticians discovered the folly of grammaticality judgments. I have been involved recently in a number of Mechanical Turk ratings tasks and we're finding that it is very difficult to get consistent ratings. I believe the same issue is at play here. Plus, ratings can easily be affected by context like surrounding text, yet none is given in these tasks. It's not clear what it means to rate isolated words. Word semantics by their very nature are contextual.

UPDATE: Commenter Arjan rightly brought up the great acceptability debates. One could claim that I am unfairly dismissing grammaticality judgments. And one could claim that I am not. The good folks at MIT's Tedlab have posted a few excellent resources on multiple sides of the controversy. Look under the 2010 heading on this page.

Words are not thought. These studies seem to be a variation on the “No word for X” syndrome (see here for a recent rant). Certain types of words may be used more or less frequently over some time-scale (like one century), but that doesn’t necessarily mean that we are thinking differently over that time-scale.

Unlike Brooks, I’ll link to the actual papers (all free, but the second two require email registration):

Increases in Individualistic Words and Phrases in American Books, 1960–2008. Jean M. Twenge, W. Keith Campbell and Brittany Gentile

The Cultural Salience of Moral Character and Virtue Declined in Twentieth Century America. Kesebir and Kesebir

Ngrams of the Great Transformations. Daniel B. Klein

UPDATE: *Rumor has it that WordNet has copyrighted the term "synset", so I'm being careful to avoid their cease and desist letter. Anyone know if there's truth to this rumor?

Saturday, May 18, 2013

Book Reviews

A quick (very self-serving) link fest. Here are the cognitive linguistics related book reviews I've written:

1. Adam's Tongue: How Humans Made Language, How Language Made Humans. By Derek Bickerton.

2. Louder Than Words: The New Science of How the Mind Makes Meaning. By Benjamin K. Bergen.

3. Through the Language Glass: Why the World Looks Different in Other Languages. By Guy Deutscher.

Thursday, May 16, 2013

heard tell 'bout them linguistic constructions yonder

Randomly, my mind wondered onto an older English slang phrase, "heard tell" which is culturally associated with rural and working class. It means something like "I heard about X." Before I get to the interesting role of prepositions, here are some examples:

1881
a) When you say that you knew about, do you mean that you have heard tell about other things?
1900
a) Maria and me concluded that we had struck one o' them gamblin' places we'd heard tell about, and I tell ye we got out in a hurry!
b) I never heard tell of it until I was told by Justice Bolte about it.
1901
a) "I niver heard tell that you had an owl in your parlor chimney," said he, sort o' suspicious-like.
b) And we had a gentleman in our county that perhaps most of you have heard tell of,
1904
a) I asked him if he had ever heard tell of a bouse they called the House of Shaws.
b) "Never heard tell of him," said John William, making spectacles of his burnished bores, and looking through them into the sunlight.


Originally, I selfishly assumed "heard tell" was an American English slang construction, but upon a little Google Ngram sleuthing, I discovered it is a common English construction.

American English


British English


All


Though details differ, the general pattern is clear: A general rise in frequency throughout the late 1800s, peaking at the turn of the century, a general decline throughout the 20th century, then a leveling off around the mid 1970s. The American English usage was a lot more unpredictable, a bit choppy, but generally follows the same pattern.

But what interests me most is the red line in all of the above graphs. That's the one that plots the frequency of the tri-gram "heard tell of". This stung me a bit because my brilliant native speaker intuition strongly preferred "heard tell about", but in this I am in the minority.

For the graph impaired: What the red line in the above graph tell us is that when English speakers have used "heard tell", they most frequently follow it with "of" even though they have a semantically similar choice available, namely "about" (and even "that").

I don't have a semantic analysis of the difference between prepositions (though I don't doubt an interesting one could be engineered), but after I consoled my wounded linguistic pride, I then realized that the construction with the preposition "of" strongly tracks the overall pattern. The bigram search "heard tell" is a more general and inclusive one, hence its results include all of "heard tell of" results as well. If you imagine taking away everything underneath the red line, there would not be much left, less than half to be sure. This means that "heard tell of" accounts for more than half of all instances of "heard tell".

I don't know why "of" became so tightly associated with the "heard tell" construction, but it struck me as a nice example of how construction semantics are not necessarily compositional, intuitive, or even necessarily coherent. I wonder if these three choices "heard tell of/about/that" have regional variances? We will need a more nuanced corpous to tease that out.

Monday, May 13, 2013

Pullum’s NLP Lament: More Sleight of Hand Than Fact

My first reading of both of Pullum’s recent NLP posts (one and two) interpreted them to be hostile, an attack on a whole field (see my first response here). Upon closer reading, I see Pullum chooses his words carefully and it is less of an attack and more of a lament. He laments that the high-minded goals of early NLP (to create machines that process language like humans do) has not been reached, and more to the point, that commercial pressures have distracted the field from pursuing those original goals, hence they are now neglected. And he’s right about this to some extent.

But, he’s also taking the commonly used term "natural language processing" and insisting that it NOT refer to what 99% of people who use the term use it for, but rather only a very narrow interpretation consisting of something like "computer systems that mimic human language processing." This is fundamentally unfair.
In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.
I feel Pullum is moving the goal posts on us when he says “there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis” [my emphasis]. Pullum’s agenda appears to be to create a straw-man NLP world where NLP techniques are only admirable if they mimic human processing. And this is unfair for two reasons.

One: Getting a machine to process language like humans is an interesting goal, but it is not necessarily a useful goal. Getting a machine to provide human-like output (regardless of how it gets there) is a more valuable enterprise.

Two: A general syntactic and semantic analysis of human language DOES. NOT. EXIST. To draw back the curtain hiding Pullum’s unfair illusion, I ask Pullum to explain exactly how HUMANS process his first example sentence:
Which UK papers are not part of the Murdoch empire?
Perhaps the most frustrating part of Pullum’s analysis so far is that he fails to point the blame where it more deservedly belongs: at linguist themselves. How dare Pullum complain that engineers at Google don’t create algorithms that follow "general syntactic and semantic analysis" when you could make the claim against linguists that they have failed to provide the world with a unified "general syntactic and semantic analysis" to begin with!

Ask Noam Chomsky, Ivan Sag, Robert van Valin, and Adele Goldberg to provide a general syntactic and semantic analysis of Pullum’s sentence and you will get four vastly different responses. Don’t blame Google for THAT! While commercial vendors may be overly-focused on practical solutions, it is at least as true that academic linguists are overly-focused on theory. Academic linguists rarely produce the sort of syntactic and semantic analyses that are useful (or even comprehensible … let alone UNIFIED!) to anyone outside of a small group of devotees of their pet theory. Pullum is well known to be a fierce critic of such linguistic theory solipsism, but that view is wholly unrepresented in this series of posts.

In his more recent post, Pullum insists again that commercial NLP is tied to keyword searching, but this remains naïve. Pullum does his readers a disservice by glossing over the now almost 70 years of research on information theory underpinning much of contemporary NLP.

Also, Pullum unfairly puts Google search at the center of the NLP world as if that alone represents the wide array of tools and techniques that exist right now. This is more propaganda than fact. He does a disservice by not reviewing the immense value of ngram techniques, dependency parsers, Wordnet, topic models, etc.

When he laments that Google search doesn’t "rely on artificial intelligence, it relies on your intelligence", Pullum also fails to relate the lessons of Cyc Corp and the Semantic Web community which have spent hundreds of millions of dollars and decades trying to develop smart artificial intelligence approaches with comparatively little success (compared to the epic scale success of Google et al). In this, Pullum stacks the deck. He laments the failure of NLP to include AI without reviewing the failure of AI to enhance NLP.

I actually agree that business goals (like those of Google) have steered NLP in certain directions away from the goal of mimicking human language, but to dismiss this enterprise as a failure is unfair. It may be that NLP does not mimic humans, but until [we] linguists provide engineers with a unified account of human language, we can hardly complain that they go looking elsewhere for inspiration.

And for the record, there does exist exactly the kind of NLP work that attempts to incorporate more human-style understanding (for example, this). But boy, it ain’t easy, so don’t hold your breath Geoff.

If Geoff has some free time in June, I recommend he attend The 1st Workshop on Metaphor in NLP 2013.

Saturday, May 11, 2013

Pullum thinks there are no NLP products???

Famed linguist Geoffrey Pullum has a recent Chronicle of Higher Education post about NLP: Why Are We Still Waiting for Natural Language Processing? As a linguist, I deeply respect Geoff Pullum's reputation for fierce skepticism, but this recent post borders on the ornery old man syndrome.

First of all, Powerset didn't die when Microsoft bought them. Their technology is part of Bing search*. That's not death. Powerset technology is used by millions of people today, whereas before it was used by 3 guys in a SoMA cubicle. And to call Bing "a plain old keyword-based search engine" is a bit naïve.

Also, Pullum's claim that there are "absolutely no commercial NLP products" is flat bonkers. There are thousands of commercially viable and profitable NLP products. Just ask Clarabridge, Nuance, or BBN.

I'll grant that Pullum is somewhat correct that question answering hasn't matched the expectations it raised in the 1990s, but it's much more sophisticated than he lets on. How does Pullum not even mention Siri or the host of Android competitors? Yes the results are hit-or-miss, but they exist.

As a [somewhat former] linguist, the fact that NLP hasn't yet managed to mirror natural language isn't a reason to lament. Rather, I celebrate that it exposes just how complex natural language is and the fact that sheer computing power that the likes of Google, Apple, and Microsoft can throw at it still ain't enough.

What I would like to see is tech companies hiring more *real* linguists. During the first NLP boom of the 90s, companies hired many linguists (my first NLP job was at an early Q and A start-up). Then, after the bust and with the rise of statistical machine learning, tech companies now hire engineers almost exclusively (except for contract jobs annotating data). I'm seeing more and more engineers learning some linguistics and getting jobs, whereas I suspect we'd be better off the other way around.

Anyhoo, NLP is alive and well Geoff. Geesh...

PS - I know Pullum is well aware of everything thing I've pointed out. He's ginning up the crowd for his series of posts about where NLP went wrong (which I'm looking forward to). But, he runs the risk of leading naïve readers down a false path. There ARE people who have no clue about all the great stuff NLP has done in the last 30 years and after reading Pullum's article, they'll think that's a fair assessment of the state-of-the-art, when it is not.

*UPDATE (5/12/13): I may have overstated this. A little birdie tells me that "not much Powerset technology" was actually incorporated into Bing. Disappointing, but I don't think this undermines my main point that Pullum mis-represents the state of commercial Q and A tech.

Friday, April 12, 2013

The birth of a metaphor

A Twitter exchange with Ben Zimmer over the metaphorical use of the phrase "pause button" in the new TV show The Americans (set in 1981) led me to think about how metaphors begin their lives. I didn't watch the episode in question, but apparently several viewers noticed that the show used the phrase "pause button" metaphorically to mean something like to put a romantic relationship on hold.

Ben tweeted this fact as a likely anachronism, presumably because the technology of pause buttons was too young in 1981 to have likely jumped to metaphorical use by then. I was not the only one who immediately took to Google Ngrams to start testing this hypothesis. In the end, Tweeter @Manganpaper found a good example from 1981 from some kind of self-help book.

But what interests me is an example I found from 1987:
Consumers have pushed the "pause" button on sales of video-cassette recorders, for years in the fast-forward mode.
Ben reluctantly conceded the example:

I'd have to review my historical linguistics books, but I don't think words necessarily shift their meanings radically all at once. I believe they can take on characteristics of associated meanings slowly, thus widening or narrowing their meaning as their linguistic environment unfolds. Eventually, a word can come to mean something quite radically different than it originally meant. I see no reason that the life of a metaphor could not follow a similar trajectory. Ben objected to the fact that the 1987 use of "pause button" I linked to was semantically linked to the literal use of actual pause buttons because it dealt with the conceptual space of VCR sales. But my hunch is that this is how many metaphors start their lives, making small conceptual leaps, not big ones. I could be wrong though. The sad truth is that finding good empirical data for the life span of metaphors is extremely difficult. The fact is that even with the awe inspiring large natural language data sets currently available in many languages, studying a linguistically high level data type like metaphor remains out of reach of most NLP techniques.

But this is why our NLP blood boils. There are miles to go before we sleep...

Monday, February 18, 2013

So You Want To Be A Text Analyst? Get Your Hands Dirty.

Yet again, I find myself responding to posts on a Text Analytics discussion board and I'd like to broadcast my response to a wider audience. After posting a link to my review of IBM's Text Analytics platform, someone posted this request: I need the techniques and methods, how to extract a knowledge from a text!!

First, I want to thank this questioner for the double exclamation points, because if there had been only one, I would have ignored the question as banal and unworthy of a response.

Second, I feel the need to invoke the code of Unfrozen Caveman Lawyer. I'm just a linguist. I'm frightened by the counting machines and the blinking lights. When I see a gradient descent algorithm, I think 'Oh no! Will it converge on a global minimum more quickly if features have similarly scaled values?' I don't know. Because I'm just a linguist -- that's the way I think..

My point being that the tools and techniques underlying text analytics are the math and coding part and they require a special effort to learn. That's why IBM's software is so enticing. They put the math and coding stuff under the hood.

If you want to learn the math and coding stuff, I highly recommend the following:

Third, learning the algorithms requires walking through the math with data. It takes several months of regular effort to gain competency, but once accomplished, the rewards are well worth it. There are lots of free data sets these days, like the the Enron Corpus.

Fourth, it is a fair question about the actual techniques, but the actual extraction techniques are far more complicated than any one blog post can address. This is why people complete degrees in computational linguistics or data science. There are many complicated issues and algorithms to gain competency in and that takes time. It's like learning golf or poker. Learning the theory ain't good enough. You have to get your hands dirty. This was my main point about IBM's platform. It does a lot of the work for you, under the hood.

The harder question is this: What does it mean to extract info from text? Business Intelligence is a profitable sector, but what counts as BI extracted from text? Sentiment? Topics? Every business must answer this for themselves. I did a brief review of books on Amazon under the search business intelligence and I was overwhelmed by the deluge of empty jargon. That's not to say that there isn't important stuff there, but the people who write books about it are probably not the right people to learn from. I know it can be frustrating for outsiders to a technical field listening to insiders throw buzzwords about without understanding the basics. I am reminded that California's Lieutenant Governor Gavin Newsom was a guest on Colbert recently to talk about reinventing government in a digital age (video). What transpired was a ten minute stroll down bullshit lane. Newsom spewed forth the most inane and banal set of memorized talking points which made him look like the most out-of-touch, kiss-your-baby, shake-your-hand, eat your-gramma's-homemade-pie, do-whatever-it-takes-to-get-your-vote slimy politician since Pappy O'Daniel. And Colbert called Newsom out on his bullshit too. Good for you Stephen.

Yet I'm still befuddled by the idea of "business intelligence" and those books did me no good. Right now, I think that anything that helps someone make an extra dollar of profit = business intelligence. And there's definitely many extra dollars of profit left lying around in language data.

Friday, February 15, 2013

Crazy Question - The Primacy of Nouns or Verbs?

A question was posted recently in a Text Analytics group discussion on a well known social networking site (sorry, not Facebook). I posted an answer and thought it was worth broadcasting to a wider audience, but I recognize the murky new technology ethics involved. It is a closed networking site, though it has lots of open access for non-members. I don't want to steal the thunder of the original poster who asked the question in all earnestness. And I gave my full answer within the confines of that site, but it is MY answer, after all. I feel I own it. And it is a question and answer interaction I think a lot of non-linguists might benefit from and I have the means to distribute it beyond the semi-walled garden of the original site. Imma post this modified version* and let you, dear reader, decide the ethics for yourselves**.

Original Question
I have a crazy question: In language, which is formed first, a noun or a verb? I think it is a noun. We know 'Google' as a noun from when it started as a brand name. Now we use it as a verb. When a language evolves, naming should happen first, right?  Naming of actions and entities. That itself is noun. After that, we are defining different forms of verbs.
My Answer
As a linguist, I'd have to disambiguate the question before beginning an answer. There are (at least) three variants of the question. There's no simple answer to any of these questions, but getting the question right is often the best starting point.:
  1. In the contemporary evolution of a new language (e.g., pidgins), what parts of speech (POS) form first?
  2. In the development of language in a child, which POS is learned/utilized first?
  3. In the brain, which POS is the base or most salient form of ambiguous words like "Google"? 
To begin an answer, it's important not to confuse the cognitive act of labeling events in the world (the gavagai problem) with labeling parts of speech (POS) like Noun/Verb/Adjective. POS are used by linguists to identify how words behave in the grammatical structure of a sentence. POS tell us what structural rules a word follows in a particular grammatical structure, not what a word refers to in the world. POS are syntactic objects, not semantic ones. Once this distinction is clear, then issues are made more plain.

I’ll use the English “Google” example from the original question to illustrate (but let’s be aware that brand names like Google, Kleenex, and Xerox have their own weird, unique linguistic life, so this is not a perfect example).

When “Google” is used as a noun in English, it can function as the Subject of a sentence. For example, “Google rolled out a new service today.” As a noun, it can be counted and take plural morphology and count determiners like “one” and “two”. For example, “There are not two Googles, there is only one Google.” As a verb, it can take tense morphology like past tense -ed. For example, “I googled around for a new phone yesterday.”

Contrast this purely syntactic analysis with how words are learned by children. When a child is at the one word stage, she may use a single word to label a whole series of events and objects (known as holophrasis). For example, imagine playing with a one year old by picking her up, twirling her around, then setting her down and she giggles. After you set her down, she holds her arms up to you and says “up.” What she wants is for you to go through the whole series of events again. She is not using that one word to refer to one discrete object in the world. She is referring to a holistic series of events.

In that situation, what POS is “up”? Is it a preposition? A verb? A noun? It’s none of the above. It is simply not appropriate in a linguistics sense to give the word “up” any POS under these circumstances because it is not functioning with the grammar of a structured sentence.

To return to the original question: what the questioner is calling naming is not the same thing as the POS “noun”. Labeling [events and objects in the world] and labeling POS are fundamentally different, though there is some rough but buggy correlation in some languages, but it's all very messy.

As to what comes first in the evolution of language, that’s a deeply complicated topic with no clear answers. We have little to no direct evidence for how languages evolved originally. This is not to say that there aren't some very smart theories. For a readable lay introduction, I recommend the book “Adam’s Tongue” by Derek Bickerton (I do not endorse the conclusions in that book, but I do recommend it as a good, readable intro to the issues of language evolution for the lay reader). For more detailed analysis and up-to-date discussions of language evolution I can highly recommend the group blog Replicated Typo.

That's my first pass attempt at answering the question as originally posed. I'm happy to accept dissents, revisions, updates, addendum, rude noises, porn links, kitten pictures, and hot stock picks.

*I naturally re-worded and added clarification to my hastily composed original, so this version is its own unique product.
**I cleaned up the original question for clarity and brevity.

Thursday, February 14, 2013

IBM SPSS Text Analytics - STAS vs. Word Clouds

This is the fourth and final post in a series about IBM's SPSS Text Analytics platform STAS (first intro post here, second on a linguist's perspective here, third on Shakespeare's sonnets here).

As I wrote in my third post, I want to run a bake off between the word frequency analyses of the President's State Of The Union (SOTU) speech last night and STAS's more in-depth tools. I am no fan of word clouds and simplistic word frequency counts (see my discussion here or Mark Liberman's discussion of word count abuse in political punditry here), but I'm trying to put myself into the shoes of a novice with no NLP, text analytics, or linguistics background. Someone who wants a quick and simple way to analyze language in an objective but meaningful way.

First, let's look at a word cloud of President Obama's 2013 SOTU (text here; note I deleted all instances of the string "(Applause.)"). I used the free online service Wordle to create this word cloud:


Jobs, America, people, new, work, now, get, like... Huh? A pretty incoherent representation of the speech. Plus, it failed to stem, so "American" and "Americans" are treated as different. Imagine I didn't tell you which president's SOTU this was or which year, could you use this word cloud to make a guess with any kind of certainty? It looks like pretty generic SOTU stuff. Could have been Carter 78, Reagan 85, Bush HW 91, Clinton 95, Bush W 2001, Obama 2007.

There are other word count options for the lay reader. For example, the UK site WriteWords provides a nifty free online frequency counter for any text. Using that tool, it's easy to get the full word count, but without stop words deleted, it's even less useful than the word cloud above. Here are the top 100 most frequent words in SOTU 2013:


However, the WriteWords online tool offers something a little more useful, an ngram counter for finding strings of words (they call it a Phrase Frequency Counter) which allows anyone to find ngrams up to 10-grams. I'm giving them a kudos for offering this nifty tool. I used it to discover the frequent bigrams, trigrams, and finally I used it to discover the longest ngram that Obama repeated.


You'll notice there's some noise in the trigrams. The unigrams "than" and "in" got caught up in the trigrams list. They were not the only ones, just the only ones in the top 25. When I ran 4-grams and 5-grams, the unigram noise got worse (Noise. Whaddayagonnado?). This just goes to reinforce the point made by @heatherfro on Twitter recently: 75% of text analysis is text prep RT @Adam_Crymble Identifying and Fixing Transcription Errors in Large Corpora. No matter how you slice it, humans need to prep their documents before NLP processing. We may never solve that problem.

Does IBM's STAS offer anything better?

In terms of text prep, hell no. In fact, I almost gave up this series of posts because the pre-processing text prep was such a pain. This is because STAS expects language data to be ingested via cells in a spreadsheet. This required me to dump the SOTU speech into a cell. But the text was too large for a single cell (the spreadsheet accepted it, but STAS refused to open it), so I had to break it up into paragraphs, one per cell. Ugh! Those are too many minutes of my life I'll never get back.

But on to the bake off. First to compare apples to apples, it's only fair I use STAS as is, with their default parameters*. With a little effort, STAS would undoubtedly yield better results than those I discuss below. But I don't have the ability to tune Wordle or WriteWords, so I shouldn't tune STAS either. Nonetheless, even with minimal effort, STAS provides several more nuanced analyses than the previous tools.

For example, like we saw with Shakespeare's sonnets in the third post in this series, STAS auto clusters words and phrases by synonym class, which it calls concepts. While the average person can generally intuit synonyms, it can be cumbersome for a text as long as a SOTU speech, or for 5,000 tweets, or 1,000 customer responses to a survey. Having this done for you automatically is a nice feature. One of the interesting concepts that STAS clustered was problem, containing the terms problem, issue, struggling, complicated, hard (remember, for STAS, a concept is a set of synonyms with the most salient term used as the concept name). Instead of simply displaying raw frequencies, STAS provides us with some organization and color coding, helping to see patterns that might not be intuitively obvious at first blush.


For those who are seasoned NLPers, Digital Humanists, academics, this may seem trivial. It's important to come at this from the perspective of the average employee of a firm who has neither the time nor inclination to learn R, Python, or any other superior data analysis tool set. The sort of person who is intrigued by word clouds would be much better served by STAS. I'm convinced that's true.

But now we get to something STAS provides that is much harder to find in a user friendly tool: sentiment analysis. Sentiment analysis has exploded in recent years and is now one of the hottest industry NLP applications since spell checkers. I have a particular soft spot in my heart for sentiment analysis because I went to a graduate school that can be said to be one of its birthplaces: SUNY Buffalo. Lost in the hustle and bustle of modern day SA is the history of Jan Wiebe's seminal 1990 dissertation Recognizing Subjective Sentences: A Computational Investigation of Narrative Text in the CS department at SUNY Buffalo. I first heard her give a talk at Buffalo on emerging sentiment analysis techniques in the late 90s (damn! that looks weird now that I typed that, but yeah, it was the frikkin 90s. I guess I'm old. I feel 17. Is that normal?).

Anyhoo... STAS can provide automatic sentiment analysis of phrases. For example, using its default parameters, it found 67 positive terms and 57 negative terms. Note that it associates sentiment with Types, which are higher order groupings like Locations, Organizations, People (common information extraction stuff). Some of its positive terms were thanks, enjoy, work hard, great, solve, interests, reasonable, worth. Here's the screenshot. Note the lower left pane lists the set of types STAS clustered and the right pane lists the actual paragraphs from the SOTU.


Another interesting sentiment analysis result is the type "PositiveBudget". if you look at that lower left pane, you'll see that "Budget" is a type by itself (think of this as a set of budget related entities) and STAS provides a subset of budget items that have a positive sentiment attached: economic, free, affordable. The screenshot below shows some of these items:


I have only scratched the surface of the capabilities of STAS and my 14 day trial ends tomorrow, so this is my farewell to this tool. I am disappointed I didn't have more time to really dig deep. I think it's a well designed UI with some serious NLP behind it.

The take home is this: if you want a lot of people to use your product, you better make it usable for a lot of people. Obscure APIs with dense or non-existent documentation is fine for the hacker set, but the average person is not gonna learn R, Python, Scala, MapReduce, or Hadoop. Most professionals didn't want to learn SPSS in the first place, they sure as hell don't want to learn something else. Don't punish your customers just because they don't want to learn your pet framework. STAS wins because it doesn't punish its customers. It helps them.

NOTES
*not only does this make the comparison fair, but it cuts down on my learning curve requirements :-)

**Let me reiterate: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.