In my first run (seen above), I asked JGAAP to normalize for white space, strip punctuation, turn everything into lowercase. Then I had it run a Naive Bayes classifier on the top 50 tri-grams from the three known authors (Shakespeare, Marlowe, Bacon) and one unknown author (Shakespeare's sonnets).
Based on that sample, JGAAP came to the conclusion that Francis Bacon wrote the sonnets. We know that because it lists its guesses in order from best to worst in the left window in the above image. Bacon is on top. This alone is cause to start tinkering with the model, but the results didn't look as flat weird until I looked at the image again today. It lists the probability that the sonnets were written by Bacon as 1. A probability of 1 typically means absolute certainty. So this model, given the top 50 trigrams, is absolutely certain that Francis Bacon wrote those sonnets ... Bullshit. A probabilistic model is never absolutely certain of anything. That's what makes it probabilistic, right?
So where's the bug? Turns out, it might have been poor data management on my part. I didn't bother to sample in any kind of fair and reasonable way. Here are my corpora:
- Bacon - (2 works) - 950 KB
- Marlowe (Works vol 3) - 429 KB
- Shakespeare (all plays) - 4.4 MB
- (Sonnets) - 113 KB
The Infinite Madness of Language Models
As far as I can tell, the current version does not have any obvious way of turning on error reporting logs (though I suspect that is possible, if one had the source code). It also offers no way of printing the features it's using. Id' love to see a list of those top 50 tri-grams for each author. But as of right now, it does not appear to support that. I'll add that to my enhancement requests. However, JGAAP is fast enough to simply run several trial-and-error runs in order to compare output. My goals are 1) get JGAAP to guess Shakespeare as the unknown author with a high degree of certainty and 2) try to figure out why it gave such a high confidence score to Bacon during round one.
Here are the results of several follow up experiments. Mostly, I want to tune the language model - in the parlance of JGAAP, Event Drivers (linguistic features) + Event Culling (sampling) = a language model (unless I'm misunderstanding something).
Round 2: Same specifications as Round one. I used the all of the same corpora, except I replaced Shakespeare with a sample of about 500 KB to bring it in line with the others. Then I repeated the analysis using all the same parameters. This time ... drum roll ... Bacon still wins in a landslide. JGAAP remains absolutely confident that Bacon wrote those sonnets.
Round 3: Okay. Let's expand the set of tri-grams. Same everything else as Round 2, but now I'll use the top 100 tri-grams.
D'oh! Well, it's less confident that Marlowe is involved (drunk bastard).
Round 4: For good measure. Let's expand the set of tri-grams. Same everything else as Rounds 2 and 3, but now I'll use the top 200 tri-grams.
Okay, it appears that adding more tri-grams alone gives us nothing. I feel confident dropping back down to 100. Now, I'll add one simple feature - Words (I assume this is a frequency list; again, the Event Culling will choose just the top 100 most frequent words, as well as the top 100 tri-grams, if I'm understanding this right).
We have a winner! The top score above shows that for Words, Shakespeare finally wins (though he still loses on Ngrams, the second highlighted score). As a comparison, I threw in another feature, Rare Words.
No help. My interpretation of these results is that the feature "Words" is the best predictor of Shakespearean authorship (given this set of competing authors with these tiny corpora).
But this is a stacked-deck experiment. I know perfectly well that the "Unknown Author" is Shakespeare. I'm just playing with linguistic features until I get the result I want. The actual problem of determining unknown authorship requires far more sophisticated work than what I did here (again, read Juola's detailed explanation of what he and his team did to out J.K. Rowling).
Nonetheless, I could imagine not sleeping for several days just playing with the different combinations of features to produce different language models just to see how they move the results (mind you, I didn't play with the classifier either, which adds its own dimension of playfulness).
Herein lies the value of JGAAP. More than any other tool I have personally seen, JGAAP gives the average person the easy-to-use platform to splash around and play with language in an NLP environment. When thinking about my first two experiences with JGAAP, the most salient word that jumps out at me is FUN! It's just plain fun to play around. It's fast and simple and fun. I can't say that about R, or Python, or Octave. All three of those are very powerful tool sets, but they are not fun. JGAAP is fun. It's a playground for linguists. Let me note that I beta tested a MOOC for WEKA last March and was very impressed with their interface as well (though I think JGAAP does a better job of making language modeling easy ... and that's the fun part for linguists anyway).
I am reminded of what several Twitter friends have said to me when I say that I'm a cognitive linguist: "Really! I never would have known by your Twitter feed." That's a wake up call for me. I have been involved in NLP since roughly 2000, but my passion is definitely the blood and guts of language and linguistics. JGAAP appeals to that old linguistics fire in my belly. It make me want to play with language again.