I recently discovered that IBM bought SPSS a few years ago and is now providing a Text Analytics package called IBM SPSS Text Analytics for Surveys (producing the acronym STAS, which is either a stats package or an STD). I thought I'd take it out for a test drive so I downloaded a 14 day trail version. Before using, I reviewed these excellent tutorials: Analytics Blog and RTI's SurveyPost blog.
For data I could have used their sample data, but I decided to download my Twitter archive. Unfortunately, this caused me some pre-processing hassle. You see, STAS is technically designed for survey data and it expects unstructured language to be in the form of comment responses to questions and it expects those comments to be stored in single cells within a column in a spreadsheet (I see no reason in principle why it couldn't used for analyzing any unstructured data. You just have to package the data in a format SPSS will accept, namely a spreadsheet with the unstructured data all in one column).
Also, STAS does not directly ingests CSV files or Open Office ODC files. Apparently it only accepts inputs of four types: its own file type, Excel, ODBC, and what they call “Data Collection” which I haven't investigated.
Once you open a file, you are asked to drag-and-drop the column name containing your language data into an "Open Ended Text" box (refer to Analytics Blog for screen shots). While I appreciate the simplicity of the drag and drop functionality, my Twitter data had tokens separated into separate columns (which I thought was weird. Let me do my own tokenization, please!). STAS' functional choice means I needed to pre-process my data files. I had to merge the many token columns containing language tokens into a single column. Document pre-processing is common in language analysis, but STAS is supposed to be a platform easy to use for non-engineers. These file ingest and pre-processing steps are tedious and uninteresting and exactly why most people get frustrated. These things can be automated and it is a platform like STAS that ought to be doing this for me.
Also, it seems to only ingest a single file at a time. My Twitter data came to me separated by month so I have 38 files. I can manually merge them, but more work for me. Really no reason STAS can't let me select multiple files all formatted identically, then merge if necessary.
I was surprised and impressed that the software immediately offered me an opportunity to translation non-English comments with a single click. Simple and easy. Quality is what it is with MT. Don't blame STAS if it's a crappy translation. No matter how you slice it, it's a great function. Kudos.
I was super impressed that it will crawl the data and suggest code categories like key concepts. This is essentially topic modeling (though not as sophisticated as something like LDA. The User Guide has a whole chapter devoted to describing the details, but I haven't had time to dig in yet). Color coded clusters of concepts is a very nice function. Colors seem to refer to entity types (Person,. Org, etc). You can collapse all concepts into just the key exemplars of each cluster. There are also several nice filtering options to help you understand what your data is centered around. Here's a screenshot of my final output:
I can see key concept frequencies and filter by that. That's nice. Next steps: Can I see simple word frequencies? Ngrams?
Sentiment analysis can be done with respect to specific categories (food + positive). Pretty easy, but SPSS should mitigate lay people's over-indulgence in sentiment analysis which is tricky and not as easy as this makes it looks. This is where making something easy backfires. How can STAS encourage double checking the data? Gold Standards, sampling, etc.
No doubt, this is easy to use. An academic has the luxury of ignoring people who don't want to learn command line tools or programming languages, but the businessman does not. There's a ton of language data out there owned by thousands of companies and those companies are never going to get their regular employees to learn R just to analyze it. For them, STAS is a legitimate tool that will actually allow the average employee to dig into unstructured data. That's a win.
*In the interest of full disclosure: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.
In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...
The commenters over at Liberman's post Apico-labials in English all clearly prefer the spelling syncing , but I find it just weird look...
Good ol' Sitemeter never fails to yield its share of fascinating factoids. For example, earlier today some brave Canadian Googler found...
Purpose: This post reviews my experience interviewing for a Linguist position at Google in Santa Monica, CA on February 29, 2008. I've ...