― Samuel Beckett, Waiting for Godot
Geoffery Pullum posted a third lament about the current state of NLP: Speech Recognition vs. Language Processing. Here are his first two:
One: Why Are We Still Waiting for Natural Language Processing?
Two: Keyword Search, Plus a Little Magic.
I have responded twice.
One: Pullum thinks there are no NLP products???
Two: Pullum’s NLP Lament: More Sleight of Hand Than Fact.
My Third Response
The more I read Pullum’s three laments, the more I keep asking myself, “exactly what is Pullum complaining about and who is he aiming his complaints at?”
As far as I can tell, Pullum is complaining that commercial forces have lured researchers away from creating his dream of a human-mimicking android like 2001’s Hal 9000 or Star Trek’s Data.
This is like saying we’re still “waiting for NASA” because they failed to give us moon houses and jet packs! Is Pullum similarly unmoved by the Mars Curiosity rover? C’mon Geoff, it’s got a frikkin laser on its head!
He’s aiming his complaints at people who know nothing about linguistics or NLP (an easy audience to convince with straw men and misrepresentation).
...we are still waiting for natural language processing (NLP).Who? Who is still waiting? I’m not waiting. I’m jumping head first into the ocean of NLP tools available right now. Who’s waiting?
...some companies run systems that enable you to hold a conversation with a machine. But that doesn’t involve NLP, i.e. syntactic and semantic analysis of sentences.This is a rhetorical slight of hand because he is about to stack the deck and compare one petite tool to his grand Platonic ideal. Pullum continues to utilize a straw man definition of NLP that 99% of people who use the term do NOT agree with. He is wrong to insinuate that contemporary NLP cannot perform “syntactic and semantic analysis of sentences.” Of course it can. In the very least it exists in the form of POS taggers, chunkers, semantic role labeling, dependency parsers, etc. The fact that most VUI tools do not employ these extra processing components is mostly a function of optimization, not ontological failure. He dismisses this as merely "dialog design", but it's what gets products working for real consumers in the here and now. Pullum also unreasonably demotes phonetics as if it is not part of linguistics. There are many NLP tools related to speech recognition, which is where his third post goes. His punching bag for this argument is Automatic Speech Recognition.
By doing this, he creates a new straw man. What he actually describes is closer to what industry calls Voice User Interface (VUI). The distinction is non-trivial because VUI is a limited special case of ASR, not the whole kit and kaboodle. Yes, there are VUI systems which are designed to nudge users to provide responses within a limited predictable range, but there are also far more sophisticated ASR systems (like Nuance’s Dragon). These systems can produce text transcripts of voice that can then easily be ingested into any number of syntactic and semantic NLP processing tools. Ignoring them is journalistic malfeasance. Pretending they don’t exist is bonkers.
Labeling noise bursts is the goal [of VUIs], not linguistically based understanding.This is true, but it’s not the whole picture. It’s true that VUIs are primarily trying to categorize noise bursts, but that’s the first step in the human language comprehension system too. It’s true that humans use some top-down context for predicting the likelihood of words in a continuous speech stream, but there’s plenty of bottom-up processing that is little more than “labeling noise bursts” (one of my favorite examples of this is Voice Onset Time for classifying speech segments). In focusing on this, VUIs are simply choosing one small part of the great human language puzzle to address.
Current ASR systems cannot reliably identify arbitrary sentences from continuous speech input. This is partly because such subtle contrasts are involved. The ear of an expert native speaker can detect subtle differences between detect wholly nitrate or holy night rate, but ASR systems will have trouble.Pullum plays a little slight-of-hand trick here as he switches from talking about word segmentation to sentence breaking. These are two different tasks. Yes, human beings are very good at word segmentation and yes, ASR is mediocre, but ASR is better than he suggests and humans are not infallible word segmenters. He overstates his premise (as pointed out by his very first commenter). So, when he says that “The ear of an expert native speaker can detect subtle differences between detect wholly nitrate or holy night rate, but ASR systems will have trouble” he’s only kinda right. In fact, plenty of “expert native speakers” would have trouble segmenting those two phrases if spoken in isolation and a well trained ASR system could very well segment those phrases successfully.
Having said all that, I agree with Pullum’s underlying point that human language comprehension is mysteriously complex and intertwined with a host of non-linguistic processes like logic and memory (making "linguistic based understanding" very challenging indeed). But this is not really a fair indictment of contemporary NLP. Yes, NLPers typically narrow their focus in order to build working tools that solve one small part of the great language processing puzzle, but put those tools together in a pipeline and you can create some pretty impressive functionality.
As Pullum knows, human speech comprehension involves a complex mixture of processes and is not entirely understood by linguists even today. Understanding speech comprehension is an ongoing project in linguistics, not a finished one. Once linguists have a fully specified model of speech comprehension, I’m sure the engineers at Nuance would be happy to model it computationally. But until we linguists provide that, they’re stuck kludging a solution. If linguists are going to complain about NLP’s failures, it’s *us* we shall complain about.
...the extent to which speech as such is being processed and understood (i.e., grammatically recognized and analyzed into its meaningful parts) is essentially zero.Zero? Really? ZERO!? Pullum was being disingenuous at best, obtuse at worst, when he wrote that. Again, his conclusion rests crucially on the straw-man comparison of one kind of limited ASR with his pie-in-the-sky fantasy of what NLP should be. This is unfair.
What Pullum refuses to tell his audience is that it is within the bounds of contemporary NLP to automatically segment a continuous human speech stream into words and then parse those words into many different grammatical and semantic categories like Parts of Speech, Subject-Verb relations, coreference, concrete nouns, verbs of motion, named entities, etc. All of this can be done by NLP tools today, right now, not in the future, by you if you have a few hours to download and learn the tools. For example, CMUSphinx Open Source Toolkit For Speech Recognition, Stanford NLP, and OpenNLP.
Pullum might complain that this NLP pipeline wouldn’t count because it wouldn’t accomplish its tasks the same way the human mind accomplishes those language tasks (and does them slower). But I repeat that it is linguists who have failed to specify exactly how the human brain accomplishes those tasks.
If Pullum is still waiting for NLP, it's because he's blaming his boots for the faults of his feet.
ADDENDUM: To be clear, I respect Geoffery Pullum quite a lot as a great linguist who has contributed (and continues to contribute) tremendous value to the field of linguistics, and language research in general. Anything I've written in my three responses to his NLP posts which might suggest otherwise is most likely the product of my uncertainty about his goals in writing these posts. I admit to feeling a bit free to employ some rhetorical flourish here and there partly because Pullum himself is quick with the lexical blade. It's fun to poke back.