Wednesday, February 13, 2013

IBM SPSS Text Analytics: A Shakespearean Stage

This is the third in a series of posts about IBM's SPSS Text Analytics platform STAS (first post here, second here). These tests were performed Tuesday eve.

Yet again, my work-a-day schedule was a bit light on free time, so I didn't get to dig as deep as I had wanted to (that 14 day free trial is tic-toc-ing away, a low, dull, quick sound, such as a watch makes when enveloped in cotton, but this only increases my fury, as the beating of a drum stimulates the soldier into courage).

With tonight's SOTU speech, I of course want to use my last day (tomorrow) to run a bake off between the inevitable word frequency analyses that will pop up tomorrow, and STAS's more in-depth tools.

So, for tonight, I went back to my literary roots and performed a sort of simple Digital Humanities analysis of Shakespeare's 154 Sonnets. I used the Gutenberg free versions of the Bard's Sonnets. I had to do a little document pre-processing, of course (okay, I had to do A LOT of pre-processing). I've already noted that STAS requires unstructured language data to be ingested via cells in a spreadsheet, so I pasted each sonnet into its own cell, then I ran STAS' automatic NLP tools. The processing took all of 90 seconds.

What this gave me was a set of what IBM calls "concepts" and "types." I take "concepts" to be roughly synonyms, with the most frequent lexeme used as the exemplar of the concept. For example, STAS identified a concept it called "excellent" with 40 linguistic items including "great", "best", and "perfection" (see image below).


So far, I'm pretty impressed. Remember, STAS only took about 90 seconds of processing to produce all this. And this isn't half of what it did.

While I'm impressed, I saw some clear bad apples. For example, STAS generated a concept called "art", but the set of linguistic items it included in that concept are overly tied to the literal string a-r-t, see below:

However, to give the IBM crew credit they deserve, they never claim the default parameters are right for every project. They clearly state that the categorization process is iterative, requiring human-in-the-loop intervention to maximize the value of this tool. I simply let the default parameters run and looked at what popped out. STAS provides many ways to manually intervene in the process and affect the output in positive ways.

In addition to providing concepts (synonyms), STAS analyzes "types", which I take to be higher order groupings like entities. For example, it will identify Organizations, Dates, Persons, and Locations. These types are well known to the information extraction (IE) set. This is the bread and butter of IE.

For example, STAS identified a type it called "budget" with items like "pay", "loan", and " fortune". See screenshot below for examples.

Another interesting example of a type that STAS identified in 90 seconds is "eyes", including "bright eyes", "eyes delight" and "far eyes".

The "types" are not typical types that IE pros are used to dealing with, but I suspect that's a function of the Shakespeare corpora I used. I previously ran some tweets through it and the types were more typical, like Microsoft and San Francisco and such.

I haven't delved deep into STAS's sentiment analysis toolkit, but it does provide a variety of ways of analyzing the sentiment expressed within natural language. For example, below shows some of the positive sentiment words it identified.

Keep in mind, that the more powerful tools it provides (which I haven't played with yet), allows querying language data about things like Food + Positive to capture positive opinion regarding food in a particular Shakespeare play or scene.

With that, I'm truly looking to pitting STAS against the SOTU word count illiterati that will cloud the airwaves.

No comments:

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...