Is there such a thing as Inverse Term Frequency-Document Frequency as a metric? The logic would be similar to TF-IDF.
For TF-IDF, if a word is rare in a large document set, but common in three documents, you can infer that the word is probably relevant to the topic of those three documents. For example, if the word "oncologist" is rare in a corpus of 1 million documents, but occurs frequently in three particular documents, you can infer that those three documents are probably about the topic of cancer. The inverse frequency of "oncologist" tells you something about the shared topic of those documents.
But does the opposite tell us something too? If a word is used frequently in a lot of documents in a large corpus, but infrequently in one subset of documents except in one local sub-subset in which it occurs highly frequently (e.g., one page of a novel) can we infer something about those sub-subset passages? For example, imagine this was true: "Hunter Thompson rarely used the word "lovely", but he uses it four times on one page of Hell's Angels, which really tells us something about the tone of that particular passage." That "lovely" page would be a sub-subset of the Hunter Thompson subset.
I would call it something like a Black Swan usage.
My hunch is this could be done with an approach similar to Divergence from Randomness model (which I can only claim to understand at a gross intuitive level, not an in-the-weeds algorithmic level). My hunch is that Black Swans are not quite the same as simply diverging from random because you need the average within-document occurrence rate over the whole corpus, the average within-document occurrence rate over the subset (the author), and the Black Swan occurrence rate over the sub-subset (among other variables, I suppose).
If anyone knows of techniques to do this, please lemme know.
Ahhhh, Friday night, red wine, and goofing on corpus linguistics algorithms....