Thursday, December 30, 2010

plagiarism and n-grams

Big media plagiarism is once again in the news as ESPN has suspended an on-air host for plagiarizing three sentences from a newspaper columnist. The on air host has admitted the plagiarism*, issued an apology, and asked for forgiveness.

The multiple and confusing ethical standards for plagiarism has have been the subject of of several LL posts (recently here) and this led me to wonder about what counts as plagiarism in the first place. Clearly a three sentence, 45 word passage, almost word for word identical with another, in the same semantic domain with the same referents, is a case of plagiarism. But what about a 20 word passage? 10 word? 4 word**?

Many short phrases are highly frequent, right? You couldn't felicitously accuse me of plagiarism for using the phrase "I am going..." could you? Even though, there can be no doubt, that someone else before me used it first. Yes, I know you can find guidelines for plagiarism in college student handbooks and such. I dealt with those for years when I taught college writing courses (and I recall flunking at least three students for plagiarism, but those were whole papers, really stupid stuff).

But I wonder, now that we have a 500 million word corpus available to us, couldn't we simply compare all n-grams to discover how likely it is that any given 5-gram is repeated? I'd prefer to do this up to 20-gram and such, but wouldn't we predict that there comes a point at which the likelihood that a particular phrase was plagiarized (given that we had found two alike) would be based solely on the general likelihood that n-grams of that size are repeated. The situation would be this: you discover that a particular 11 word passage has an identical twin from 2 years ago. Without bothering to look into whether or not the author had access to the previous work, you simply look up the likelihood that any 11-gram passage is repeated and discover that there is a 0.0002% chance that a phrase that long will be repeated.

With some effort, you could then derive predictions for near identical passages (using WordNet and similar resources)....

..just thinking out loud...

*I am ignorant of the role ESPN's producers play in the writing of on air speeches, but the quote seems clearly to have been written on a teleprompter at the time of speaking, which means someone else was involved, even if unwittingly. Nonetheless, the host is taking the fall willingly.

**Excluding obviously famous phrases like Ich bin ein Berliner.


Trochee said...

In fact, using language-modeling statistical tools from the computational linguistics community, it would be very easy to do just as you suggest; I suspect such an approach is just what and other anti-plagiarism sites use.

Chris said...

Interesting site. I couldn't help but notice that is located in downtown Oakland ... so is A spinoff?

GamesWithWords said...

Plagiarism is a bit different from copyright violations, but it might be worth looking at copyright law because folks have had to think a great deal about it, and real money is at stake.

One point of copyright is that you can copyright form, but not the truth. So, for instance, if somebody writes a science paper and you summarize it, that's within your rights (though not citing them would be bad form). This is why those historians who sued Dan Brown for stealing their ideas were laughed out of court.

As I understand it, copyright law also takes into account how many ways something could be said. You probably can't copyright the Method section of a science paper because there aren't many ways of saying, "12 native English speakers participated," and it's clear no great craft went into choosing the word order.

Getting back to this journalist. I don't know the details of the story. But you can imagine that if what the columnist wrote was, "The President has many hard choices this year. First, there is dealing with the Republican majority in the House. Second, there is the economic crisis." That's 3 sentences. But do we really want to have the columnist directly cited for stating the obvious? And should the journalist in question have to rewrite these very simple sentences in some crazy prolix prose in order to avoid the semblance of copying?

I think these issues are very fact-specific, and it's difficult to have clearly objective rules.

Chris said...

Yes, Robert Shuy is the big name in language and the law. He has written a lot of books on this.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...