Monday, February 4, 2013

IBM SPSS Text Analytics - Any Good? Yes.

I recently discovered that IBM bought SPSS a few years ago and is now providing a Text Analytics package called IBM SPSS Text Analytics for Surveys (producing the acronym STAS, which is either a stats package or an STD). I thought I'd take it out for a test drive so I downloaded a 14 day trail version. Before using, I reviewed these excellent tutorials: Analytics Blog and RTI's SurveyPost blog.

For data I could have used their sample data, but I decided to download my Twitter archive. Unfortunately, this caused me some pre-processing hassle. You see, STAS is technically designed for survey data and it expects unstructured language to be in the form of comment responses to questions and it expects those comments to be stored in single cells within a column in a spreadsheet (I see no reason in principle why it couldn't used for analyzing any unstructured data. You just have to package the data in a format SPSS will accept, namely a spreadsheet with the unstructured data all in one column).

Also, STAS does not directly ingests CSV files or Open Office ODC files. Apparently it only accepts inputs of four types: its own file type, Excel, ODBC, and what they call “Data Collection” which I haven't investigated.

Once you open a file, you are asked to drag-and-drop the column name containing your language data into an "Open Ended Text" box (refer to Analytics Blog for screen shots). While I appreciate the simplicity of the drag and drop functionality, my Twitter data had tokens separated into separate columns (which I thought was weird. Let me do my own tokenization, please!). STAS' functional choice means I needed to pre-process my data files. I had to merge the many token columns containing language tokens into a single column. Document pre-processing is common in language analysis, but STAS is supposed to be a platform easy to use for non-engineers. These file ingest and pre-processing steps are tedious and uninteresting and exactly why most people get frustrated. These things can be automated and it is a platform like STAS that ought to be doing this for me.

Also, it seems to only ingest a single file at a time. My Twitter data came to me separated by month so I have 38 files. I can manually merge them, but more work for me. Really no reason STAS can't let me select multiple files all formatted identically, then merge if necessary.

I was surprised and impressed that the software immediately offered me an opportunity to translation non-English comments with a single click. Simple and easy. Quality is what it is with MT. Don't blame STAS if it's a crappy translation. No matter how you slice it, it's a great function. Kudos.

I was super impressed that it will crawl the data and suggest code categories like key concepts. This is essentially topic modeling (though not as sophisticated as something like LDA. The User Guide has a whole chapter devoted to describing the details, but I haven't had time to dig in yet). Color coded clusters of concepts is a very nice function. Colors seem to refer to entity types (Person,. Org, etc). You can collapse all concepts into just the key exemplars of each cluster. There are also several nice filtering options to help you understand what your data is centered around. Here's a screenshot of my final output:


I can see key concept frequencies and filter by that. That's nice. Next steps: Can I see simple word frequencies? Ngrams?

Sentiment analysis can be done with respect to specific categories (food + positive). Pretty easy, but SPSS should mitigate lay people's over-indulgence in sentiment analysis which is tricky and not as easy as this makes it looks. This is where making something easy backfires. How can STAS encourage double checking the data? Gold Standards, sampling, etc.

No doubt, this is easy to use. An academic has the luxury of ignoring people who don't want to learn command line tools or programming languages, but the businessman does not. There's a ton of language data out there owned by thousands of companies and those companies are never going to get their regular employees to learn R just to analyze it. For them, STAS is a legitimate tool that will actually allow the average employee to dig into unstructured data. That's a win.

*In the interest of full disclosure: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.

7 comments:

Anonymous said...

I would have to say that I was extremely disappointed with this product. It is very limiting, slow and buggy. It doesn't offer any text preprocessing capabilities, it only allows for boolean weighting, its sentiment analysis is very basic, its category creation requires a lot of manual work, its regex support is a joke and on top of all that, it quite often fails to detect the words you are interested in when you give it a lot of data. From my experience, the most it could handle was 50,000 obs. Although, I do understand that its only intended to be used for surveys and the sample size hardly ever gets that large.

Unknown said...

A lot of companies nowadays are coming up with various customer loyalty programs to ensure bigger profits for their companies. This may seem to be quite a worn idea already for a customer loyalty program but people, no matter how wealthy they are, actually enjoy getting freebies every now and then. spss statistical analysis

Unknown said...

Any alternative product?

rima said...

@Anonymous : so what is your suggestion??!
I found it could be useful but still I am struggling regex support. would be nice if we can have chat together.

Leo said...

I've used knime and R for this sort of thing and really like it.
R has a stack of packages (tm, word cloud etc) that help and knime has a bunch of useful nodes AND embeds R if you just want to call the R functions.

Imagine knime is SPSS for the load and transformation of data.
Then you've got a choice about whether to use knime node to do the rest or embed a bit of R to do it.

Google it, and spend a bit of time reading the white papers. I hope you find it worthwhile.

Dan said...

I'm stunned to see that your profile says that you are a Cognitive Linguist working at IBM, when this article says you do not work for IBM. An easy error to catch, with our without NLP.

Chris said...

Dan, you may be equally stunned to learn that I was not working at IBM in 2013 when I wrote the post.

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...