But the recent outing of J.K. Rowling as the one true author of a crime novel published under a pseudonym was interesting not least because the software used to out her is freely available and, as it turns out, shockingly easy to use (too easy?*). You can read how Peter Millican and Patrick Juola uncovered the truth of Rowling's authorship in various places, such as:
Rowling and "Galbraith": an authorial analysis (Language Log)
You enjoy catching up to the rest of us who have actually been awake the last week or so. What I want to do is play. Much like I did with IBM's Text Analysis platform, I'm going to perform a few linguistic experiments with JGAAP over the next few weeks. The software Millican and Juola used is called the Java Graphical Authorship Attribution Program, or JGAAP. It's freely downloadable and user friendly.
I downloaded the software and opened the GUI in seconds (though the initial download site was spurious, an email to the developers quickly resolved that).
I'm running this on a modest laptop: Lenovo X100e with AMD 1.6GHz processor, 2.75 usable GB RAM, 32-bit Windows 7 OS.
First, I loaded three known authors:
- Shakespeare - a single text file with all plays.
- Christopher Marlowe - a single file from Gutenberg with most works.
- Francis Bacon - two text files: The Advancement of Learning and Book of Essays.
JGAAP provides very easy methods of adding all kinds of linguistic and document features to check and classifiers to use to categorize them. On my first try I chose 3 or 4 Canonicizers (normalizing the text for things like white space, punctuation, capitalization), 5 or 6 Event Drivers (ngrams, word length, POS, etc), 1 Event Culling (Most Common = 50, which I assume means to only care about the 50 most common tri-grams, word lengths, POSs), and WEKA Naive Bayes. Sadly, this failed after about 2 minutes and gave me an error message pointing me to log files. I couldn't find any log files, but I suspect I need to muck with my memory allocation for this heavy of processing.
Second, I wised up and I chose sparsely: 3 Canonicizers [normalize white space, strip punctuation, unify case], 1 event driver = Word Ngram-3, 1 Event Culling = most common events - 50, analysis method WEKA Naive Bayes Classifie).
This successfully produced results in about 2-3 minutes, though it thinks Francis Bacon wrote Shakespeare's sonnets (and really, who am I to disagree?).
This was but the first volley in a long battle, to be sure. But initial results are very promising. Dare I wonder if we are nearing that threshold moment when serious text analysis will require as many engineers as driving to the store requires mechanics?
*One could be forgiven for fearing that by hiding the serious intricacies of the mathematical classifiers and the more-art-than-science language models, JGAAP has put a weapon into the hands of children. I disagree (though not that strongly). My feeling is that JGAAP is to NLP what SPSS is to statistics. Serious statisticians probably just gasped in horror at the implications. But then again, serious drivers gasp in horror at the very idea of an automatic transmission. Technology made to fit the hands of the average is not as bad a thing as technical experts typically fear.
Let me pre-respond to one possible analogy: this is particularly salient a fear given the recent dust-up over bad neuroscience reporting (for example, read this). This is beside the point in that bad science journalism is its own special illness. It doesn't bear on the health of the underlying science.