Thursday, August 1, 2013

in the dark heart of a language model lies madness....

This is the second in a series of post detailing experiments with the Java Graphical Authorship Attribution Program. The first post is here.

In my first run (seen above), I asked JGAAP to normalize for white space, strip punctuation, turn everything into lowercase. Then I had it run a Naive Bayes classifier on the top 50 tri-grams from the three known authors (Shakespeare, Marlowe, Bacon) and one unknown author (Shakespeare's sonnets).

Based on that sample, JGAAP came to the conclusion that Francis Bacon wrote the sonnets. We know that because it lists its guesses in order from best to worst in the left window in the above image. Bacon is on top. This alone is cause to start tinkering with the model, but the results didn't look as flat weird until I looked at the image again today. It lists the probability that the sonnets were written by Bacon as 1. A probability of 1 typically means absolute certainty. So this model, given the top 50 trigrams, is absolutely certain that Francis Bacon wrote those sonnets ... Bullshit. A probabilistic model is never absolutely certain of anything. That's what makes it probabilistic, right?

So where's the bug? Turns out, it might have been poor data management on my part. I didn't bother to sample in any kind of fair and reasonable way. Here are my corpora:

Known Authors
  • Bacon - (2 works) - 950 KB
  • Marlowe (Works vol 3) - 429 KB
  • Shakespeare (all plays) - 4.4 MB
Unknown Author
  • (Sonnets) - 113 KB
Clearly, I provided a much larger Shakespeare data set than any of the others. However, keep in mind that JGAAP only used the 50 most common tri-grams from any of these corpora (if I understand their Event Culling tool properly). Is the disparity in corpora size relevant if I'm also sampling just the top 50 tri-grams? Just how different would those tri-grams be if the corpora were equivalent? Let's find out.

The Infinite Madness of Language Models 
As far as I can tell, the current version does not have any obvious way of turning on error reporting logs (though I suspect that is possible, if one had the source code). It also offers no way of printing the features it's using. Id' love to see a list of those top 50 tri-grams for each author. But as of right now, it does not appear to support that. I'll add that to my enhancement requests. However, JGAAP is fast enough to simply run several trial-and-error runs in order to compare output. My goals are 1) get JGAAP to guess Shakespeare as the unknown author with a high degree of certainty and 2) try to figure out why it gave such a high confidence score to Bacon during round one.

Here are the results of several follow up experiments. Mostly, I want to tune the language model - in the parlance of JGAAP, Event Drivers (linguistic features) + Event Culling (sampling) = a language model (unless I'm misunderstanding something).

Round 2: Same specifications as Round one. I used the all of the same corpora, except I replaced Shakespeare with a sample of about 500 KB to bring it in line with the others. Then I repeated the analysis using all the same parameters. This time ... drum roll ... Bacon still wins in a landslide. JGAAP remains absolutely confident that Bacon wrote those sonnets.

Round 3: Okay. Let's expand the set of tri-grams. Same everything else as Round 2, but now I'll use the top 100 tri-grams.

D'oh! Well, it's less confident that Marlowe is involved (drunk bastard).

Round 4: For good measure. Let's expand the set of tri-grams. Same everything else as Rounds 2 and 3, but now I'll use the top 200 tri-grams.


Okay, it appears that adding more tri-grams alone gives us nothing. I feel confident dropping back down to 100. Now, I'll add one simple feature - Words (I assume this is a frequency list; again, the Event Culling will choose just the top 100 most frequent words, as well as the top 100 tri-grams, if I'm understanding this right).

We have a winner! The top score above shows that for Words, Shakespeare finally wins (though he still loses on Ngrams, the second highlighted score). As a comparison, I threw in another feature, Rare Words.

No help. My interpretation of these results is that the feature "Words" is the best predictor of Shakespearean authorship (given this set of competing authors with these tiny corpora).

But this is a stacked-deck experiment. I know perfectly well that the "Unknown Author" is Shakespeare. I'm just playing with linguistic features until I get the result I want. The actual problem of determining unknown authorship requires far more sophisticated work than what I did here (again, read Juola's detailed explanation of what he and his team did to out J.K. Rowling).

Nonetheless, I could imagine not sleeping for several days just playing with the different combinations of features to produce different language models just to see how they move the results (mind you, I didn't play with the classifier either, which adds its own dimension of playfulness).

Herein lies the value of JGAAP. More than any other tool I have personally seen, JGAAP gives the average person the easy-to-use platform to splash around and play with language in an NLP environment. When thinking about my first two experiences with JGAAP, the most salient word that jumps out at me is FUN! It's just plain fun to play around. It's fast and simple and fun. I can't say that about R, or Python, or Octave. All three of those are very powerful tool sets, but they are not fun. JGAAP is fun. It's a playground for linguists. Let me note that I beta tested a MOOC for WEKA last March and was very impressed with their interface as well (though I think JGAAP does a better job of making language modeling easy ... and that's the fun part for linguists anyway).

I am reminded of what several Twitter friends have said to me when I say that I'm a cognitive linguist: "Really! I never would have known by your Twitter feed." That's a wake up call for me. I have been involved in NLP since roughly 2000, but my passion is definitely the blood and guts of language and linguistics. JGAAP appeals to that old linguistics fire in my belly. It make me want to play with language again.

1 comment:

Susan Brown said...

Maybe you already know this, but there is a long-standing theory that Francis Bacon actually wrote the Shakespeare plays and sonnets. Most Shakespeare scholars disagree, but don't be too hard on your model! I actually came across your post because I'm teaching language models in my NLP class right now and I was wondering if anyone had already done the very experiment you tried.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...