Wednesday, February 6, 2013

IBM's SPSS Text Analytics - A Linguist's Perspective

This is the second in a series of posts about IBM's SPSS Text Analytics platform STAS (first post here). I haven't had much time to do anything more than review the documentation, but I must reiterate that this platform is clearly not for serious "big data" scientists. It is not a Formula One race car, but rather a Ford Model T. But let's not forget that the Model T introduced the car to the average workingman. As such, it ushered in a revolution in productivity. I'm not sure STAS will usher in a revolution, but I am even more impressed with this toolkit today simply because it has managed to package a variety of legit NLP tools into a user interface that is intuitive and easy to use. This is not to say it is perfect, nothing is. But I cannot think of any software that provides this much basic NLP functionality in such a user friendly user interface. I'm surprised by how impressed I am*. I am a skeptic by nature. I'll tell you this, STAS is a damned better tool than frikkin Wordle! If all the people who go bonkers for mindless, uninformative word clouds could use STAS for an hour or two, they'd give up word clouds for ever.

More than anything else, I'm impressed with its documentation. The STAS User's Guide (PDF) is 240 pages long and remarkably well written with examples aplenty. I'm sure we've all grown tired of the poor to non-existence of much NLP software documentation. The STAS Guide seems to actually be written by teachers, people who want the average person to learn how to use this tool properly. Quick note, I have often happily promoted the use of NLTK on this blog, and still do, but that is a free tool set. STAS is not free. In fact, it's really damned expensive. But what you get for all that money is a tool that average employees will actually use.

After all this gushing I feel compelled to swear to you, this is NOT a paid endorsement. I have no connection to IBM (other than the one I've previously explained, repeated below*)

So, exactly what text analytics can you do with STAS? In the very least, you can do all of the following with little effort:

  • Tokenize linguistic input (they call this "componentize").
  • Built stop word lists.
  • Build specialized word lists (they call these "libraries").
  • Lemmetize (or "stem") content words (they call this "de-inflecting").
  • Auto cluster linguistic terms ala simple topic modeling (they call these "categories").
  • Manually adjust topic categories as desired.
  • Auto-chunk linguistic input (they call these "terms").
  • Search for ngram chunks with Boolean and Kleene operators.
  • Search for collocates within a window of co-occurrences.
  • Build custom clusters based on ngrams (up to trigrams).
  • Auto-assign sentiment to linguistic chunks.
  • Manually "fix" sentiment labels.
  • Auto-translate content into English.
  • Visualize data.

Many/most of these things can be done automatically, but they also allow considerable manual review and revision.

My 14 day trial is tic toc-ing away. I can't do much with an hour or two in the evenings, but I hope to spend this coming Saturday constructing a set of 100 or so test documents (single sentences, most likely) that will put STAS through its NLP paces, so to speak. Until then, I want to mention a couple little linguistic quibbles.

  • The documentation consistently misuses the term "synonym". They use it to mean two words that share a root like "opportunities" and "opportunity" (p. 114). Their basic point is fine (that these two words, after stemming, can be grouped together semantically), but there's no reason to use the word "synonym" for this.
  • I haven't found a discussion of how they chunk their terms. They explicitly state that, post-chunking, the "terms/chunks" are treated as bags of words, but how do they parse their chunks to begin with?
  • There is a short discussion of "semantic networks" which sounds an awful lot like WordNet to me, but no mention of WordNet is made. 

*Let me reiterate: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.

No comments:

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...