Friday, February 8, 2013

Black Swan Linguistics: ITF-DF?

Is there such a thing as Inverse Term Frequency-Document Frequency as a metric? The logic would be similar to TF-IDF.

For TF-IDF, if a word is rare in a large document set, but common in three documents, you can infer that the word is probably relevant to the topic of those three documents. For example, if the word "oncologist" is rare in a corpus of 1 million documents, but occurs frequently in three particular documents, you can infer that those three documents are probably about the topic of cancer. The inverse frequency of "oncologist" tells you something about the shared topic of those documents.

But does the opposite tell us something too? If a word is used frequently in a lot of documents in a large corpus, but infrequently in one subset of documents except in one local sub-subset in which it occurs highly frequently (e.g., one page of a novel)  can we infer something about those sub-subset passages? For example, imagine this was true: "Hunter Thompson rarely used the word "lovely", but he uses it four times on one page of Hell's Angels, which really tells us something about the tone of that particular passage." That "lovely" page would be a sub-subset of the Hunter Thompson subset.

I would call it something like a Black Swan usage.

My hunch is this could be done with an approach similar to Divergence from Randomness model (which I can only claim to understand at a gross intuitive level, not an in-the-weeds algorithmic level). My hunch is that Black Swans are not quite the same as simply diverging from random because you need the average within-document occurrence rate over the whole corpus, the average within-document occurrence rate over the subset (the author), and the Black Swan occurrence rate over the sub-subset (among other variables, I suppose).

If anyone knows of techniques to do this, please lemme know.

Ahhhh, Friday night, red wine, and goofing on corpus linguistics algorithms....

3 comments:

Unknown said...

Looks like a good example of the need for hierarchical models in linguistics. We can imagine estimating means for probabilities of Hunter S Thompson producing various words, but he himself comes from a population of English speakers and so his frequencies, while distinct, are dependent on the global population. We could then zoom in and estimate probabilities for finer-grained groups like books, chapters, paragraphs, etc., which would be dependent on higher levels in the model.

Chris said...

Agreed. Kyle Wade Grove made a similar point on Twitter. I'm not well versed in hierarchical models, so imma need to auto-didacticate myself.

Pedro Santana said...

I don't know if I am quite wrong, but it seems to me that the question basically depends on the series of sub-sub- sets chosen. Then we have whether they are chosen in a priori way (a given dialect or corpus, speaker, piece of discourse) or empirically after observing actual frequencies. I imagine that some cases will not be surprising (the oncologist case) and some will be completely unexpected, which will make them more meaningful in some sense. What I do not know is how to do this (apply the empirical approach to a corpus considered as a continuum)in an efficient way, I mean how for a really big big corpus to discover the words whose frequency oscillates too strongly within the whole corpus. Think of the relative frequencies of one term along the corpus as a time series: a special subset will suppose a jump in the frequencies of that term. I imagine that, in any case, the point is to find out differences from an otherwise homogeneous distribution. Finally, if lacks of homogeneity are observed for a number of words, there is still the task of choosing which unhomogeneities are more meaningful.