Monday, February 18, 2013

So You Want To Be A Text Analyst? Get Your Hands Dirty.

Yet again, I find myself responding to posts on a Text Analytics discussion board and I'd like to broadcast my response to a wider audience. After posting a link to my review of IBM's Text Analytics platform, someone posted this request: I need the techniques and methods, how to extract a knowledge from a text!!

First, I want to thank this questioner for the double exclamation points, because if there had been only one, I would have ignored the question as banal and unworthy of a response.

Second, I feel the need to invoke the code of Unfrozen Caveman Lawyer. I'm just a linguist. I'm frightened by the counting machines and the blinking lights. When I see a gradient descent algorithm, I think 'Oh no! Will it converge on a global minimum more quickly if features have similarly scaled values?' I don't know. Because I'm just a linguist -- that's the way I think..

My point being that the tools and techniques underlying text analytics are the math and coding part and they require a special effort to learn. That's why IBM's software is so enticing. They put the math and coding stuff under the hood.

If you want to learn the math and coding stuff, I highly recommend the following:

Third, learning the algorithms requires walking through the math with data. It takes several months of regular effort to gain competency, but once accomplished, the rewards are well worth it. There are lots of free data sets these days, like the the Enron Corpus.

Fourth, it is a fair question about the actual techniques, but the actual extraction techniques are far more complicated than any one blog post can address. This is why people complete degrees in computational linguistics or data science. There are many complicated issues and algorithms to gain competency in and that takes time. It's like learning golf or poker. Learning the theory ain't good enough. You have to get your hands dirty. This was my main point about IBM's platform. It does a lot of the work for you, under the hood.

The harder question is this: What does it mean to extract info from text? Business Intelligence is a profitable sector, but what counts as BI extracted from text? Sentiment? Topics? Every business must answer this for themselves. I did a brief review of books on Amazon under the search business intelligence and I was overwhelmed by the deluge of empty jargon. That's not to say that there isn't important stuff there, but the people who write books about it are probably not the right people to learn from. I know it can be frustrating for outsiders to a technical field listening to insiders throw buzzwords about without understanding the basics. I am reminded that California's Lieutenant Governor Gavin Newsom was a guest on Colbert recently to talk about reinventing government in a digital age (video). What transpired was a ten minute stroll down bullshit lane. Newsom spewed forth the most inane and banal set of memorized talking points which made him look like the most out-of-touch, kiss-your-baby, shake-your-hand, eat your-gramma's-homemade-pie, do-whatever-it-takes-to-get-your-vote slimy politician since Pappy O'Daniel. And Colbert called Newsom out on his bullshit too. Good for you Stephen.

Yet I'm still befuddled by the idea of "business intelligence" and those books did me no good. Right now, I think that anything that helps someone make an extra dollar of profit = business intelligence. And there's definitely many extra dollars of profit left lying around in language data.

Friday, February 15, 2013

Crazy Question - The Primacy of Nouns or Verbs?

A question was posted recently in a Text Analytics group discussion on a well known social networking site (sorry, not Facebook). I posted an answer and thought it was worth broadcasting to a wider audience, but I recognize the murky new technology ethics involved. It is a closed networking site, though it has lots of open access for non-members. I don't want to steal the thunder of the original poster who asked the question in all earnestness. And I gave my full answer within the confines of that site, but it is MY answer, after all. I feel I own it. And it is a question and answer interaction I think a lot of non-linguists might benefit from and I have the means to distribute it beyond the semi-walled garden of the original site. Imma post this modified version* and let you, dear reader, decide the ethics for yourselves**.

Original Question
I have a crazy question: In language, which is formed first, a noun or a verb? I think it is a noun. We know 'Google' as a noun from when it started as a brand name. Now we use it as a verb. When a language evolves, naming should happen first, right?  Naming of actions and entities. That itself is noun. After that, we are defining different forms of verbs.
My Answer
As a linguist, I'd have to disambiguate the question before beginning an answer. There are (at least) three variants of the question. There's no simple answer to any of these questions, but getting the question right is often the best starting point.:
  1. In the contemporary evolution of a new language (e.g., pidgins), what parts of speech (POS) form first?
  2. In the development of language in a child, which POS is learned/utilized first?
  3. In the brain, which POS is the base or most salient form of ambiguous words like "Google"? 
To begin an answer, it's important not to confuse the cognitive act of labeling events in the world (the gavagai problem) with labeling parts of speech (POS) like Noun/Verb/Adjective. POS are used by linguists to identify how words behave in the grammatical structure of a sentence. POS tell us what structural rules a word follows in a particular grammatical structure, not what a word refers to in the world. POS are syntactic objects, not semantic ones. Once this distinction is clear, then issues are made more plain.

I’ll use the English “Google” example from the original question to illustrate (but let’s be aware that brand names like Google, Kleenex, and Xerox have their own weird, unique linguistic life, so this is not a perfect example).

When “Google” is used as a noun in English, it can function as the Subject of a sentence. For example, “Google rolled out a new service today.” As a noun, it can be counted and take plural morphology and count determiners like “one” and “two”. For example, “There are not two Googles, there is only one Google.” As a verb, it can take tense morphology like past tense -ed. For example, “I googled around for a new phone yesterday.”

Contrast this purely syntactic analysis with how words are learned by children. When a child is at the one word stage, she may use a single word to label a whole series of events and objects (known as holophrasis). For example, imagine playing with a one year old by picking her up, twirling her around, then setting her down and she giggles. After you set her down, she holds her arms up to you and says “up.” What she wants is for you to go through the whole series of events again. She is not using that one word to refer to one discrete object in the world. She is referring to a holistic series of events.

In that situation, what POS is “up”? Is it a preposition? A verb? A noun? It’s none of the above. It is simply not appropriate in a linguistics sense to give the word “up” any POS under these circumstances because it is not functioning with the grammar of a structured sentence.

To return to the original question: what the questioner is calling naming is not the same thing as the POS “noun”. Labeling [events and objects in the world] and labeling POS are fundamentally different, though there is some rough but buggy correlation in some languages, but it's all very messy.

As to what comes first in the evolution of language, that’s a deeply complicated topic with no clear answers. We have little to no direct evidence for how languages evolved originally. This is not to say that there aren't some very smart theories. For a readable lay introduction, I recommend the book “Adam’s Tongue” by Derek Bickerton (I do not endorse the conclusions in that book, but I do recommend it as a good, readable intro to the issues of language evolution for the lay reader). For more detailed analysis and up-to-date discussions of language evolution I can highly recommend the group blog Replicated Typo.

That's my first pass attempt at answering the question as originally posed. I'm happy to accept dissents, revisions, updates, addendum, rude noises, porn links, kitten pictures, and hot stock picks.

*I naturally re-worded and added clarification to my hastily composed original, so this version is its own unique product.
**I cleaned up the original question for clarity and brevity.

Thursday, February 14, 2013

IBM SPSS Text Analytics - STAS vs. Word Clouds

This is the fourth and final post in a series about IBM's SPSS Text Analytics platform STAS (first intro post here, second on a linguist's perspective here, third on Shakespeare's sonnets here).

As I wrote in my third post, I want to run a bake off between the word frequency analyses of the President's State Of The Union (SOTU) speech last night and STAS's more in-depth tools. I am no fan of word clouds and simplistic word frequency counts (see my discussion here or Mark Liberman's discussion of word count abuse in political punditry here), but I'm trying to put myself into the shoes of a novice with no NLP, text analytics, or linguistics background. Someone who wants a quick and simple way to analyze language in an objective but meaningful way.

First, let's look at a word cloud of President Obama's 2013 SOTU (text here; note I deleted all instances of the string "(Applause.)"). I used the free online service Wordle to create this word cloud:


Jobs, America, people, new, work, now, get, like... Huh? A pretty incoherent representation of the speech. Plus, it failed to stem, so "American" and "Americans" are treated as different. Imagine I didn't tell you which president's SOTU this was or which year, could you use this word cloud to make a guess with any kind of certainty? It looks like pretty generic SOTU stuff. Could have been Carter 78, Reagan 85, Bush HW 91, Clinton 95, Bush W 2001, Obama 2007.

There are other word count options for the lay reader. For example, the UK site WriteWords provides a nifty free online frequency counter for any text. Using that tool, it's easy to get the full word count, but without stop words deleted, it's even less useful than the word cloud above. Here are the top 100 most frequent words in SOTU 2013:


However, the WriteWords online tool offers something a little more useful, an ngram counter for finding strings of words (they call it a Phrase Frequency Counter) which allows anyone to find ngrams up to 10-grams. I'm giving them a kudos for offering this nifty tool. I used it to discover the frequent bigrams, trigrams, and finally I used it to discover the longest ngram that Obama repeated.


You'll notice there's some noise in the trigrams. The unigrams "than" and "in" got caught up in the trigrams list. They were not the only ones, just the only ones in the top 25. When I ran 4-grams and 5-grams, the unigram noise got worse (Noise. Whaddayagonnado?). This just goes to reinforce the point made by @heatherfro on Twitter recently: 75% of text analysis is text prep RT @Adam_Crymble Identifying and Fixing Transcription Errors in Large Corpora. No matter how you slice it, humans need to prep their documents before NLP processing. We may never solve that problem.

Does IBM's STAS offer anything better?

In terms of text prep, hell no. In fact, I almost gave up this series of posts because the pre-processing text prep was such a pain. This is because STAS expects language data to be ingested via cells in a spreadsheet. This required me to dump the SOTU speech into a cell. But the text was too large for a single cell (the spreadsheet accepted it, but STAS refused to open it), so I had to break it up into paragraphs, one per cell. Ugh! Those are too many minutes of my life I'll never get back.

But on to the bake off. First to compare apples to apples, it's only fair I use STAS as is, with their default parameters*. With a little effort, STAS would undoubtedly yield better results than those I discuss below. But I don't have the ability to tune Wordle or WriteWords, so I shouldn't tune STAS either. Nonetheless, even with minimal effort, STAS provides several more nuanced analyses than the previous tools.

For example, like we saw with Shakespeare's sonnets in the third post in this series, STAS auto clusters words and phrases by synonym class, which it calls concepts. While the average person can generally intuit synonyms, it can be cumbersome for a text as long as a SOTU speech, or for 5,000 tweets, or 1,000 customer responses to a survey. Having this done for you automatically is a nice feature. One of the interesting concepts that STAS clustered was problem, containing the terms problem, issue, struggling, complicated, hard (remember, for STAS, a concept is a set of synonyms with the most salient term used as the concept name). Instead of simply displaying raw frequencies, STAS provides us with some organization and color coding, helping to see patterns that might not be intuitively obvious at first blush.


For those who are seasoned NLPers, Digital Humanists, academics, this may seem trivial. It's important to come at this from the perspective of the average employee of a firm who has neither the time nor inclination to learn R, Python, or any other superior data analysis tool set. The sort of person who is intrigued by word clouds would be much better served by STAS. I'm convinced that's true.

But now we get to something STAS provides that is much harder to find in a user friendly tool: sentiment analysis. Sentiment analysis has exploded in recent years and is now one of the hottest industry NLP applications since spell checkers. I have a particular soft spot in my heart for sentiment analysis because I went to a graduate school that can be said to be one of its birthplaces: SUNY Buffalo. Lost in the hustle and bustle of modern day SA is the history of Jan Wiebe's seminal 1990 dissertation Recognizing Subjective Sentences: A Computational Investigation of Narrative Text in the CS department at SUNY Buffalo. I first heard her give a talk at Buffalo on emerging sentiment analysis techniques in the late 90s (damn! that looks weird now that I typed that, but yeah, it was the frikkin 90s. I guess I'm old. I feel 17. Is that normal?).

Anyhoo... STAS can provide automatic sentiment analysis of phrases. For example, using its default parameters, it found 67 positive terms and 57 negative terms. Note that it associates sentiment with Types, which are higher order groupings like Locations, Organizations, People (common information extraction stuff). Some of its positive terms were thanks, enjoy, work hard, great, solve, interests, reasonable, worth. Here's the screenshot. Note the lower left pane lists the set of types STAS clustered and the right pane lists the actual paragraphs from the SOTU.


Another interesting sentiment analysis result is the type "PositiveBudget". if you look at that lower left pane, you'll see that "Budget" is a type by itself (think of this as a set of budget related entities) and STAS provides a subset of budget items that have a positive sentiment attached: economic, free, affordable. The screenshot below shows some of these items:


I have only scratched the surface of the capabilities of STAS and my 14 day trial ends tomorrow, so this is my farewell to this tool. I am disappointed I didn't have more time to really dig deep. I think it's a well designed UI with some serious NLP behind it.

The take home is this: if you want a lot of people to use your product, you better make it usable for a lot of people. Obscure APIs with dense or non-existent documentation is fine for the hacker set, but the average person is not gonna learn R, Python, Scala, MapReduce, or Hadoop. Most professionals didn't want to learn SPSS in the first place, they sure as hell don't want to learn something else. Don't punish your customers just because they don't want to learn your pet framework. STAS wins because it doesn't punish its customers. It helps them.

NOTES
*not only does this make the comparison fair, but it cuts down on my learning curve requirements :-)

**Let me reiterate: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.

Wednesday, February 13, 2013

IBM SPSS Text Analytics: A Shakespearean Stage

This is the third in a series of posts about IBM's SPSS Text Analytics platform STAS (first post here, second here). These tests were performed Tuesday eve.

Yet again, my work-a-day schedule was a bit light on free time, so I didn't get to dig as deep as I had wanted to (that 14 day free trial is tic-toc-ing away, a low, dull, quick sound, such as a watch makes when enveloped in cotton, but this only increases my fury, as the beating of a drum stimulates the soldier into courage).

With tonight's SOTU speech, I of course want to use my last day (tomorrow) to run a bake off between the inevitable word frequency analyses that will pop up tomorrow, and STAS's more in-depth tools.

So, for tonight, I went back to my literary roots and performed a sort of simple Digital Humanities analysis of Shakespeare's 154 Sonnets. I used the Gutenberg free versions of the Bard's Sonnets. I had to do a little document pre-processing, of course (okay, I had to do A LOT of pre-processing). I've already noted that STAS requires unstructured language data to be ingested via cells in a spreadsheet, so I pasted each sonnet into its own cell, then I ran STAS' automatic NLP tools. The processing took all of 90 seconds.

What this gave me was a set of what IBM calls "concepts" and "types." I take "concepts" to be roughly synonyms, with the most frequent lexeme used as the exemplar of the concept. For example, STAS identified a concept it called "excellent" with 40 linguistic items including "great", "best", and "perfection" (see image below).


So far, I'm pretty impressed. Remember, STAS only took about 90 seconds of processing to produce all this. And this isn't half of what it did.

While I'm impressed, I saw some clear bad apples. For example, STAS generated a concept called "art", but the set of linguistic items it included in that concept are overly tied to the literal string a-r-t, see below:

However, to give the IBM crew credit they deserve, they never claim the default parameters are right for every project. They clearly state that the categorization process is iterative, requiring human-in-the-loop intervention to maximize the value of this tool. I simply let the default parameters run and looked at what popped out. STAS provides many ways to manually intervene in the process and affect the output in positive ways.

In addition to providing concepts (synonyms), STAS analyzes "types", which I take to be higher order groupings like entities. For example, it will identify Organizations, Dates, Persons, and Locations. These types are well known to the information extraction (IE) set. This is the bread and butter of IE.

For example, STAS identified a type it called "budget" with items like "pay", "loan", and " fortune". See screenshot below for examples.

Another interesting example of a type that STAS identified in 90 seconds is "eyes", including "bright eyes", "eyes delight" and "far eyes".

The "types" are not typical types that IE pros are used to dealing with, but I suspect that's a function of the Shakespeare corpora I used. I previously ran some tweets through it and the types were more typical, like Microsoft and San Francisco and such.

I haven't delved deep into STAS's sentiment analysis toolkit, but it does provide a variety of ways of analyzing the sentiment expressed within natural language. For example, below shows some of the positive sentiment words it identified.

Keep in mind, that the more powerful tools it provides (which I haven't played with yet), allows querying language data about things like Food + Positive to capture positive opinion regarding food in a particular Shakespeare play or scene.

With that, I'm truly looking to pitting STAS against the SOTU word count illiterati that will cloud the airwaves.

Friday, February 8, 2013

Black Swan Linguistics: ITF-DF?

Is there such a thing as Inverse Term Frequency-Document Frequency as a metric? The logic would be similar to TF-IDF.

For TF-IDF, if a word is rare in a large document set, but common in three documents, you can infer that the word is probably relevant to the topic of those three documents. For example, if the word "oncologist" is rare in a corpus of 1 million documents, but occurs frequently in three particular documents, you can infer that those three documents are probably about the topic of cancer. The inverse frequency of "oncologist" tells you something about the shared topic of those documents.

But does the opposite tell us something too? If a word is used frequently in a lot of documents in a large corpus, but infrequently in one subset of documents except in one local sub-subset in which it occurs highly frequently (e.g., one page of a novel)  can we infer something about those sub-subset passages? For example, imagine this was true: "Hunter Thompson rarely used the word "lovely", but he uses it four times on one page of Hell's Angels, which really tells us something about the tone of that particular passage." That "lovely" page would be a sub-subset of the Hunter Thompson subset.

I would call it something like a Black Swan usage.

My hunch is this could be done with an approach similar to Divergence from Randomness model (which I can only claim to understand at a gross intuitive level, not an in-the-weeds algorithmic level). My hunch is that Black Swans are not quite the same as simply diverging from random because you need the average within-document occurrence rate over the whole corpus, the average within-document occurrence rate over the subset (the author), and the Black Swan occurrence rate over the sub-subset (among other variables, I suppose).

If anyone knows of techniques to do this, please lemme know.

Ahhhh, Friday night, red wine, and goofing on corpus linguistics algorithms....

Thursday, February 7, 2013

Louder Than Words - Book Review Part 3: Ch 5-7

I've finished the next chunk of Ben Bergen's embodied cognitive linguistics book Louder Than Words, putting me just about 2/3rds of the way through. I'll be putting together a single, overview review that is more linear that these interim posts, but I want to share a few thoughts before then.

  • This middle section of the book has finally gotten more deep into the weeds of cognitive linguistics  and is more satisfying to me than the earlier, more intro stuff.
  • He does a nice job of making the case that constructions like "She Verbed the box to her friend" can add semantics beyond what the verb adds. Good for him. I like a good construction.
  •  He tends to go on and on describing one experimental paradigm after another, factory-like. Even for a methodology geek like myself, I found it a bit tiresome. It's easy to lose the big picture.
  • He cites a very tightly connected set of researchers. They all agree with each other and pat each other on the back. The Chomskyeans are famous for this. It is not a lead I recommend following  Bergen never really addresses serious critics. He does play the devil's advocate game, but only as a segue into his next presentation of 16 experimental methodologies, one after another (occasionally he give us a crudely drawn picture or Excel graph).
  • Chapter 7 switches gears a bit because he begins to discuss ways in which highly specialized experiences might affect cognitive processes. Do hockey players process input differently than non-hockey players because of their hockey experience? He presents data suggesting that they do (though he's quick to point out that this is all very preliminary). This struck me as laying the groundwork for the inevitable neo-Whorfian, linguistic relativism argument that language affects thought. I have blogged about this myself, with some constructive skepticism. Bergen has worked with Lera Boroditsky, the queen of neo-Whorfianism, so it's easy to predict that he was gonna get around to that sooner or later.
My main critique is that Bergen has set himself up for an audience nightmare. Who is he writing for? Sometimes he writes for me, a person with advanced cognitive linguistics training who loves experimental methodologies. At other times, he's talking to my 90 year old grandmother. Every now and again, he whispers an aside to 1980s pop culture aficionados. The problem is, that none of us are satisfied. I grant that this topic is inherently difficult to write for because it blends detailed scientific methodology with freaky, unexpected mental behaviors. But that is Bergen's challenge. He asked for it.



Wednesday, February 6, 2013

IBM's SPSS Text Analytics - A Linguist's Perspective

This is the second in a series of posts about IBM's SPSS Text Analytics platform STAS (first post here). I haven't had much time to do anything more than review the documentation, but I must reiterate that this platform is clearly not for serious "big data" scientists. It is not a Formula One race car, but rather a Ford Model T. But let's not forget that the Model T introduced the car to the average workingman. As such, it ushered in a revolution in productivity. I'm not sure STAS will usher in a revolution, but I am even more impressed with this toolkit today simply because it has managed to package a variety of legit NLP tools into a user interface that is intuitive and easy to use. This is not to say it is perfect, nothing is. But I cannot think of any software that provides this much basic NLP functionality in such a user friendly user interface. I'm surprised by how impressed I am*. I am a skeptic by nature. I'll tell you this, STAS is a damned better tool than frikkin Wordle! If all the people who go bonkers for mindless, uninformative word clouds could use STAS for an hour or two, they'd give up word clouds for ever.

More than anything else, I'm impressed with its documentation. The STAS User's Guide (PDF) is 240 pages long and remarkably well written with examples aplenty. I'm sure we've all grown tired of the poor to non-existence of much NLP software documentation. The STAS Guide seems to actually be written by teachers, people who want the average person to learn how to use this tool properly. Quick note, I have often happily promoted the use of NLTK on this blog, and still do, but that is a free tool set. STAS is not free. In fact, it's really damned expensive. But what you get for all that money is a tool that average employees will actually use.

After all this gushing I feel compelled to swear to you, this is NOT a paid endorsement. I have no connection to IBM (other than the one I've previously explained, repeated below*)

So, exactly what text analytics can you do with STAS? In the very least, you can do all of the following with little effort:

  • Tokenize linguistic input (they call this "componentize").
  • Built stop word lists.
  • Build specialized word lists (they call these "libraries").
  • Lemmetize (or "stem") content words (they call this "de-inflecting").
  • Auto cluster linguistic terms ala simple topic modeling (they call these "categories").
  • Manually adjust topic categories as desired.
  • Auto-chunk linguistic input (they call these "terms").
  • Search for ngram chunks with Boolean and Kleene operators.
  • Search for collocates within a window of co-occurrences.
  • Build custom clusters based on ngrams (up to trigrams).
  • Auto-assign sentiment to linguistic chunks.
  • Manually "fix" sentiment labels.
  • Auto-translate content into English.
  • Visualize data.

Many/most of these things can be done automatically, but they also allow considerable manual review and revision.

My 14 day trial is tic toc-ing away. I can't do much with an hour or two in the evenings, but I hope to spend this coming Saturday constructing a set of 100 or so test documents (single sentences, most likely) that will put STAS through its NLP paces, so to speak. Until then, I want to mention a couple little linguistic quibbles.

  • The documentation consistently misuses the term "synonym". They use it to mean two words that share a root like "opportunities" and "opportunity" (p. 114). Their basic point is fine (that these two words, after stemming, can be grouped together semantically), but there's no reason to use the word "synonym" for this.
  • I haven't found a discussion of how they chunk their terms. They explicitly state that, post-chunking, the "terms/chunks" are treated as bags of words, but how do they parse their chunks to begin with?
  • There is a short discussion of "semantic networks" which sounds an awful lot like WordNet to me, but no mention of WordNet is made. 


*Let me reiterate: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.

Monday, February 4, 2013

IBM SPSS Text Analytics - Any Good? Yes.

I recently discovered that IBM bought SPSS a few years ago and is now providing a Text Analytics package called IBM SPSS Text Analytics for Surveys (producing the acronym STAS, which is either a stats package or an STD). I thought I'd take it out for a test drive so I downloaded a 14 day trail version. Before using, I reviewed these excellent tutorials: Analytics Blog and RTI's SurveyPost blog.

For data I could have used their sample data, but I decided to download my Twitter archive. Unfortunately, this caused me some pre-processing hassle. You see, STAS is technically designed for survey data and it expects unstructured language to be in the form of comment responses to questions and it expects those comments to be stored in single cells within a column in a spreadsheet (I see no reason in principle why it couldn't used for analyzing any unstructured data. You just have to package the data in a format SPSS will accept, namely a spreadsheet with the unstructured data all in one column).

Also, STAS does not directly ingests CSV files or Open Office ODC files. Apparently it only accepts inputs of four types: its own file type, Excel, ODBC, and what they call “Data Collection” which I haven't investigated.

Once you open a file, you are asked to drag-and-drop the column name containing your language data into an "Open Ended Text" box (refer to Analytics Blog for screen shots). While I appreciate the simplicity of the drag and drop functionality, my Twitter data had tokens separated into separate columns (which I thought was weird. Let me do my own tokenization, please!). STAS' functional choice means I needed to pre-process my data files. I had to merge the many token columns containing language tokens into a single column. Document pre-processing is common in language analysis, but STAS is supposed to be a platform easy to use for non-engineers. These file ingest and pre-processing steps are tedious and uninteresting and exactly why most people get frustrated. These things can be automated and it is a platform like STAS that ought to be doing this for me.

Also, it seems to only ingest a single file at a time. My Twitter data came to me separated by month so I have 38 files. I can manually merge them, but more work for me. Really no reason STAS can't let me select multiple files all formatted identically, then merge if necessary.

I was surprised and impressed that the software immediately offered me an opportunity to translation non-English comments with a single click. Simple and easy. Quality is what it is with MT. Don't blame STAS if it's a crappy translation. No matter how you slice it, it's a great function. Kudos.

I was super impressed that it will crawl the data and suggest code categories like key concepts. This is essentially topic modeling (though not as sophisticated as something like LDA. The User Guide has a whole chapter devoted to describing the details, but I haven't had time to dig in yet). Color coded clusters of concepts is a very nice function. Colors seem to refer to entity types (Person,. Org, etc). You can collapse all concepts into just the key exemplars of each cluster. There are also several nice filtering options to help you understand what your data is centered around. Here's a screenshot of my final output:


I can see key concept frequencies and filter by that. That's nice. Next steps: Can I see simple word frequencies? Ngrams?

Sentiment analysis can be done with respect to specific categories (food + positive). Pretty easy, but SPSS should mitigate lay people's over-indulgence in sentiment analysis which is tricky and not as easy as this makes it looks. This is where making something easy backfires. How can STAS encourage double checking the data? Gold Standards, sampling, etc.

No doubt, this is easy to use. An academic has the luxury of ignoring people who don't want to learn command line tools or programming languages, but the businessman does not. There's a ton of language data out there owned by thousands of companies and those companies are never going to get their regular employees to learn R just to analyze it. For them, STAS is a legitimate tool that will actually allow the average employee to dig into unstructured data. That's a win.

*In the interest of full disclosure: I do not work for IBM and this is not a sponsored blog in any way. These thoughts are entirely my own. I once worked for IBM briefly over 5 years ago and I still get the occasional IBM recruiter contacting me about opportunities, but this is my personal blog and all content is my own and reflects my honest, personal opinions.

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...