The multiple and confusing ethical standards for plagiarism
Many short phrases are highly frequent, right? You couldn't felicitously accuse me of plagiarism for using the phrase "I am going..." could you? Even though, there can be no doubt, that someone else before me used it first. Yes, I know you can find guidelines for plagiarism in college student handbooks and such. I dealt with those for years when I taught college writing courses (and I recall flunking at least three students for plagiarism, but those were whole papers, really stupid stuff).
But I wonder, now that we have a 500 million word corpus available to us, couldn't we simply compare all n-grams to discover how likely it is that any given 5-gram is repeated? I'd prefer to do this up to 20-gram and such, but wouldn't we predict that there comes a point at which the likelihood that a particular phrase was plagiarized (given that we had found two alike) would be based solely on the general likelihood that n-grams of that size are repeated. The situation would be this: you discover that a particular 11 word passage has an identical twin from 2 years ago. Without bothering to look into whether or not the author had access to the previous work, you simply look up the likelihood that any 11-gram passage is repeated and discover that there is a 0.0002% chance that a phrase that long will be repeated.
With some effort, you could then derive predictions for near identical passages (using WordNet and similar resources)....
..just thinking out loud...
*I am ignorant of the role ESPN's producers play in the writing of on air speeches, but the quote seems clearly to have been written on a teleprompter at the time of speaking, which means someone else was involved, even if unwittingly. Nonetheless, the host is taking the fall willingly.
**Excluding obviously famous phrases like Ich bin ein Berliner.