Sunday, January 27, 2013

Free Online NLP Resources - NLTK Still Rules Them All

I recently received an email from a US undergraduate interested in tools and resources for NLP, particularly free tagged corpora. Luckily, the NLP field has matured into an open access friendly crowd, so there are lots of resources freely available. Maybe too many. To be honest, too many search result hits is a pain. Newbies aren't looking for ridiculously long lists of resources which they have to pick through exactly BECAUSE they're newbies! They don't know how to choose between them. And all too often expert/experienced NLPers will simply push their pet language or resources not because its appropriate for newbies, but because it's the pet of the expert.

So my unsolicited teachable moment #333256: give newbies/students recommendations that are appropriate for them, not appropriate for you.

For example, with all due respect, no newbie NLPer should go anywhere near the Stanford NLP Annotated List of Resources. I'm the first to admit that's a GREAT list of resources. No argument from me. But most of those resources requires at least basic familiarization with NLP before starting (most require more).

For true newbies, The Natural Language Toolkit remains my preferred option. Its excellent teaching book, tutorials, packaged corpora and data, and solid documentation make it the reigning king of NLP intro tools. Plus, it's a mature enough toolkit to be used for more extensive projects. Hard to go wrong.

FWIW, This post was not a paid endorsement of any kind. I have no professional or personal relationship with anyone involved in the NLTK. I follow several people involved with the project on Twitter. That's as close to a personal involvement as I get. This post is not meant as a commercial advertisement, but rather as my own personal opinion.

2 comments:

Matías Guzmán said...

I started with the NLTK, but I've moved to JAVA and R since I'm not THAT interested in NLP, but rather Corpus Linguistics. But yeah, the NLTK is great for beginners.

Chris said...

Matias: yes, I hear a lot of that. NLTK is a great starter, and mature enough to handle some serious projects. But R is like a black hole, it just seems to be drawing all things statistical into it.