I've been particularly keen on research about question answering NLP for a long time because my first ever NLP gig was as a grad student intern at a defunct question answering start-up in 2000 (QA was all the rage during the 90s tech bubble). QA is somewhat special among NLP fields because it is a combination of all of the others put together into a single, deeply complex pipeline.
When I saw this paper Tweeted by Mat Kelcey, I was excited by the title, but after reading it, I suspect the constraints of their task make it not quite applicable to commercial QA applications.
Here are some thoughts on the paper, but to be clear: these comments are my own and do not represent in any way those of my employer.
What they did:
Took question/answer pairs from a college Quiz Bowl game and trained a neural network to find answers to new questions. More to the point, "given a description of an entity, [they trained a neural net to] identify the person, place, or thing discussed".
- They used factoid questions from a game called Quiz Bowl
- Factoid questions assume small, easily identifiable answers (typically one word or maybe a short multi-word phrase)
- If you’re unfamiliar with the format of these quiz bowl games, you can play something similar at bars like Buffalo Wild Wings. You get a little device for inputting an answer and the questions are presented on TVs around the room. The *questions* are composed of 4-6 sentences, displayed one at a time. The faster you answer, the more points you get. The sentences in the question are hierarchically ordered in terms of information contained. The first sentence gives very little information away and is presented alone for maybe 5 seconds. If you can’t answer, the second sentence appears for 5 seconds giving a bit more detail. If you still can’t answer, the third sentence appears providing even more detail, but fewer points. And so on.
- Therefore, they had large *questions* composed of 4-6 sentences, providing more and more details about the answer. This amount of information is rare (though they report results of experimental guesses after just the first sentence, I believe they still used the entire *question* paragraph for training).
- They had fixed, known answer sets to train on. Plus (annotated) incorrect answers to train on.
- They whittled down their training and test data to a small set of QA pairs that *fit* their needs (no messy data) - "451 history answers and 595 literature answers that occur on average twelve times in the corpus".
- They could not handle multi-word named entities (so they manually pre-processed their corpus to convert these into single strings).
- Their use of dependency trees instead of bag o' words was nice. As a linguist, I want to see more sophisticated linguistic information used in NLP.
- They jointly learned answer and question representations in the same vector space rather than learning them separately because "most answers are themselves words (features) in other questions (e.g., a question on World War II might mention the Battle of the Bulge and vice versa). Thus, word vectors associated with such answers can be trained in the same vector space as question text enabling us to model relationships between answers instead of assuming incorrectly that all answers are independent."
- I found their error analysis in sections “5.2 Where the Attribute Space Helps Answer Questions” and 5.3 "Where all Models Struggle” especially thought provoking. More published research should include these kinds of sections.
- Footnote 7 is interesting: "We tried transforming Wikipedia sentences into quiz bowl sentences by replacing answer mentions with appropriate descriptors (e.g., \Joseph Heller" with \this author"), but the resulting sentences suffered from a variety of grammatical issues and did not help the final result." Yep, syntax. Find-and-replace not gonna cut it.