Monday, February 18, 2013

So You Want To Be A Text Analyst? Get Your Hands Dirty.

Yet again, I find myself responding to posts on a Text Analytics discussion board and I'd like to broadcast my response to a wider audience. After posting a link to my review of IBM's Text Analytics platform, someone posted this request: I need the techniques and methods, how to extract a knowledge from a text!!

First, I want to thank this questioner for the double exclamation points, because if there had been only one, I would have ignored the question as banal and unworthy of a response.

Second, I feel the need to invoke the code of Unfrozen Caveman Lawyer. I'm just a linguist. I'm frightened by the counting machines and the blinking lights. When I see a gradient descent algorithm, I think 'Oh no! Will it converge on a global minimum more quickly if features have similarly scaled values?' I don't know. Because I'm just a linguist -- that's the way I think..

My point being that the tools and techniques underlying text analytics are the math and coding part and they require a special effort to learn. That's why IBM's software is so enticing. They put the math and coding stuff under the hood.

If you want to learn the math and coding stuff, I highly recommend the following:

Third, learning the algorithms requires walking through the math with data. It takes several months of regular effort to gain competency, but once accomplished, the rewards are well worth it. There are lots of free data sets these days, like the the Enron Corpus.

Fourth, it is a fair question about the actual techniques, but the actual extraction techniques are far more complicated than any one blog post can address. This is why people complete degrees in computational linguistics or data science. There are many complicated issues and algorithms to gain competency in and that takes time. It's like learning golf or poker. Learning the theory ain't good enough. You have to get your hands dirty. This was my main point about IBM's platform. It does a lot of the work for you, under the hood.

The harder question is this: What does it mean to extract info from text? Business Intelligence is a profitable sector, but what counts as BI extracted from text? Sentiment? Topics? Every business must answer this for themselves. I did a brief review of books on Amazon under the search business intelligence and I was overwhelmed by the deluge of empty jargon. That's not to say that there isn't important stuff there, but the people who write books about it are probably not the right people to learn from. I know it can be frustrating for outsiders to a technical field listening to insiders throw buzzwords about without understanding the basics. I am reminded that California's Lieutenant Governor Gavin Newsom was a guest on Colbert recently to talk about reinventing government in a digital age (video). What transpired was a ten minute stroll down bullshit lane. Newsom spewed forth the most inane and banal set of memorized talking points which made him look like the most out-of-touch, kiss-your-baby, shake-your-hand, eat your-gramma's-homemade-pie, do-whatever-it-takes-to-get-your-vote slimy politician since Pappy O'Daniel. And Colbert called Newsom out on his bullshit too. Good for you Stephen.

Yet I'm still befuddled by the idea of "business intelligence" and those books did me no good. Right now, I think that anything that helps someone make an extra dollar of profit = business intelligence. And there's definitely many extra dollars of profit left lying around in language data.


Matías Guzmán said...

Is there any textbook you could recommend where a linguist could learn the math. I've tried with some general math books, but they get boring because they are not intended for linguists, not even social scientists. I have programming experience with java, python and r, but my mathematical skills are not that great. Thanks a lot.

Arthur said...

Another useful post with good recommendations, thanks.

Nitin said...

Great post! Loved the Caveman bit.

Given all that - the purveyors of vacuous books on text and linguistic processing who are making $$ are showing a lot of ... well "business intelligence" :-)

youtube to mp3

youtube to mp3

Chris said...

Matías, I would recommend starting with Coursera's online class Machine Learning, taught by Andrew Ng. I was very impressed with how well it covered the math behind ML algorithms (mostly it's an intro to linear algebra) and Ng relates some of the materials back to language data.

Chris said...

Arthur & Nitin: thanks for the positive feedback. There's clearly a growing number of people who want to learn about this stuff and who are looking for resources.