Wednesday, November 21, 2007

YIKES! or The New Information Extraction

The term information extraction may be taking on a whole new meaning to the greater world than computational linguists would have it mean. As someone working in the field of NLP, I think of information extraction as in line with the Wikipedia definition:

information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.

But my colleague pointed out a whole new meaning to me a couple weeks ago, the day after an episode of the NBC sitcom My Name Is Earl aired (11/1/2007: Our Other Cops Is On!). Thanks to the wonders of The Internets, I managed to find a reference to the sitcom’s usage at TV

Information extraction in a post-9/11 world involves delving into the nether regions of suspected terrorists....

In other words: TORTURE! The law of unintended consequences has brought the world of NLP and the so called War on Terror into sudden intersection (yes, there are "other" intersections... shhhhhhh, we don't talk about those). Perhaps the term IE is obsolete in CL anyway. Wikipedia described it as a subfield of IR. Manning & Schütze’s new book on the topic is called Introduction to Information Retrieval , not Introduction to Information Extraction. They define IR, on the link above, essentially as finding material that satisfies information needs (note: I'm not quoting directly because the book is not yet out).

Quibbling over names and labels of subfields is often entertaining, but it’s ultimately a fruitless endeavor. I defer to Manning & Schütze on all things NLP. Information Retrieval it is.

