As I wrote in my third post, I want to run a bake off between the word frequency analyses of the President's State Of The Union (SOTU) speech last night and STAS's more in-depth tools. I am no fan of word clouds and simplistic word frequency counts (see my discussion here or Mark Liberman's discussion of word count abuse in political punditry here), but I'm trying to put myself into the shoes of a novice with no NLP, text analytics, or linguistics background. Someone who wants a quick and simple way to analyze language in an objective but meaningful way.
First, let's look at a word cloud of President Obama's 2013 SOTU (text here; note I deleted all instances of the string "(Applause.)"). I used the free online service Wordle to create this word cloud:
Jobs, America, people, new, work, now, get, like... Huh? A pretty incoherent representation of the speech. Plus, it failed to stem, so "American" and "Americans" are treated as different. Imagine I didn't tell you which president's SOTU this was or which year, could you use this word cloud to make a guess with any kind of certainty? It looks like pretty generic SOTU stuff. Could have been Carter 78, Reagan 85, Bush HW 91, Clinton 95, Bush W 2001, Obama 2007.
There are other word count options for the lay reader. For example, the UK site WriteWords provides a nifty free online frequency counter for any text. Using that tool, it's easy to get the full word count, but without stop words deleted, it's even less useful than the word cloud above. Here are the top 100 most frequent words in SOTU 2013:
However, the WriteWords online tool offers something a little more useful, an ngram counter for finding strings of words (they call it a Phrase Frequency Counter) which allows anyone to find ngrams up to 10-grams. I'm giving them a kudos for offering this nifty tool. I used it to discover the frequent bigrams, trigrams, and finally I used it to discover the longest ngram that Obama repeated.
You'll notice there's some noise in the trigrams. The unigrams "than" and "in" got caught up in the trigrams list. They were not the only ones, just the only ones in the top 25. When I ran 4-grams and 5-grams, the unigram noise got worse (Noise. Whaddayagonnado?). This just goes to reinforce the point made by @heatherfro on Twitter recently: 75% of text analysis is text prep RT @Adam_Crymble Identifying and Fixing Transcription Errors in Large Corpora. No matter how you slice it, humans need to prep their documents before NLP processing. We may never solve that problem.
Does IBM's STAS offer anything better?
In terms of text prep, hell no. In fact, I almost gave up this series of posts because the pre-processing text prep was such a pain. This is because STAS expects language data to be ingested via cells in a spreadsheet. This required me to dump the SOTU speech into a cell. But the text was too large for a single cell (the spreadsheet accepted it, but STAS refused to open it), so I had to break it up into paragraphs, one per cell. Ugh! Those are too many minutes of my life I'll never get back.
But on to the bake off. First to compare apples to apples, it's only fair I use STAS as is, with their default parameters*. With a little effort, STAS would undoubtedly yield better results than those I discuss below. But I don't have the ability to tune Wordle or WriteWords, so I shouldn't tune STAS either. Nonetheless, even with minimal effort, STAS provides several more nuanced analyses than the previous tools.
For example, like we saw with Shakespeare's sonnets in the third post in this series, STAS auto clusters words and phrases by synonym class, which it calls concepts. While the average person can generally intuit synonyms, it can be cumbersome for a text as long as a SOTU speech, or for 5,000 tweets, or 1,000 customer responses to a survey. Having this done for you automatically is a nice feature. One of the interesting concepts that STAS clustered was problem, containing the terms problem, issue, struggling, complicated, hard (remember, for STAS, a concept is a set of synonyms with the most salient term used as the concept name). Instead of simply displaying raw frequencies, STAS provides us with some organization and color coding, helping to see patterns that might not be intuitively obvious at first blush.
For those who are seasoned NLPers, Digital Humanists, academics, this may seem trivial. It's important to come at this from the perspective of the average employee of a firm who has neither the time nor inclination to learn R, Python, or any other superior data analysis tool set. The sort of person who is intrigued by word clouds would be much better served by STAS. I'm convinced that's true.
But now we get to something STAS provides that is much harder to find in a user friendly tool: sentiment analysis. Sentiment analysis has exploded in recent years and is now one of the hottest industry NLP applications since spell checkers. I have a particular soft spot in my heart for sentiment analysis because I went to a graduate school that can be said to be one of its birthplaces: SUNY Buffalo. Lost in the hustle and bustle of modern day SA is the history of Jan Wiebe's seminal 1990 dissertation Recognizing Subjective Sentences: A Computational Investigation of Narrative Text in the CS department at SUNY Buffalo. I first heard her give a talk at Buffalo on emerging sentiment analysis techniques in the late 90s (damn! that looks weird now that I typed that, but yeah, it was the frikkin 90s. I guess I'm old. I feel 17. Is that normal?).
Anyhoo... STAS can provide automatic sentiment analysis of phrases. For example, using its default parameters, it found 67 positive terms and 57 negative terms. Note that it associates sentiment with Types, which are higher order groupings like Locations, Organizations, People (common information extraction stuff). Some of its positive terms were thanks, enjoy, work hard, great, solve, interests, reasonable, worth. Here's the screenshot. Note the lower left pane lists the set of types STAS clustered and the right pane lists the actual paragraphs from the SOTU.
Another interesting sentiment analysis result is the type "PositiveBudget". if you look at that lower left pane, you'll see that "Budget" is a type by itself (think of this as a set of budget related entities) and STAS provides a subset of budget items that have a positive sentiment attached: economic, free, affordable. The screenshot below shows some of these items:
I have only scratched the surface of the capabilities of STAS and my 14 day trial ends tomorrow, so this is my farewell to this tool. I am disappointed I didn't have more time to really dig deep. I think it's a well designed UI with some serious NLP behind it.
The take home is this: if you want a lot of people to use your product, you better make it usable for a lot of people. Obscure APIs with dense or non-existent documentation is fine for the hacker set, but the average person is not gonna learn R, Python, Scala, MapReduce, or Hadoop. Most professionals didn't want to learn SPSS in the first place, they sure as hell don't want to learn something else. Don't punish your customers just because they don't want to learn your pet framework. STAS wins because it doesn't punish its customers. It helps them.
*not only does this make the comparison fair, but it cuts down on my learning curve requirements :-)
**Let me reiterate: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.