Tuesday, September 25, 2007

Daume on POS tagging

Hal Daume over at his natural language processing blog makes a damned interesting claim (and his commenters basically agree):

Proposition: mark-up is always a bad idea.

That is: we should never be marking up data in ways that it's not "naturally" marked up. For instance, part-of-speech tagged data does not exist naturally. Parallel French-English data does. The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing.

His point seems to be that humans naturally translate texts, so that’s worth “computationalizing” (great word, BTW), but humans do not naturally POS tag, so why bother?

Okay, but is this false? Do humans naturally POS tag when processing language? I think it’s fair to say that humans naturally categorize natural language input, and some of this categorization could be likened to POS tagging. I’m going to need to brush up on my rusty psycholinguistics and make a more substantive post on this later.

No comments: