FWIW, I spent most of the time mis-reading the first author name as Watson, instead of Weston, assuming a false sense of irony :-)
The basic idea is that in order to facilitate rapid development of artificial intelligence question answering applications, someone ought to create a set of standard data sets that are highly constrained as to what they test (gee, who might be the best people to create those? the authors maybe?).
Basically, these are regression tests for reasoning. Each test contains a set of 2-5 short statements (= the corpus) and a set of 1-2 questions. The point is to expose the logical requirements for answering the questions given the way the answer is encoded in the set of statements.
They define 20 specialized tests. All of them are truly interesting as AI tests.
Their two simplest tests:
3.1 is a fairly straight forward factoid Question-Answer pairing, but 3.2 requires that a system extract, store, and connect information across sentences before discovering the answer. Cool task. Non-trivial.
3.17 is an even more challenging task:
This immediately reminded me of a textual Tarski's World.
The authors admit that few of these tasks are immediately relevant to an industrial scale question answering system like Watson, but argue that tasks like these focus research on identifying the skills needed to answer these questions
Our goal is to categorize different kinds of questions into skill sets, which become our tasks. Our hope is that the analysis of performance on these tasks will help expose weaknesses of current models and help motivate new algorithm designs that alleviate these weaknesses. We further envision this as a feedback loop where new tasks can then be designed in response, perhaps in an adversarial fashion, in order to break the new models.My Big Take-Aways:
- Not realistic QA from large data.
- They're testing artificial recombination of facts into new understanding, not NLP or information extraction per se. It's an AI paper after all, so ignoring contemporary approaches to information extraction is fine. But, can one assume that many of these answers are discoverable in large data collections using existing NLP techniques? Their core design encodes the answer in one very limited way. But industrial scale QA systems utilize large data sets where answers are typically repeated, often many many times, in many different forms in the corpora.
- Their test data is so specific, I worried that it might encourage over-fitting of solutions: Solve THAT problem, not solve THE problem.