The Lousy Linguist: Auto-detecting Language

Sunday, March 14, 2010

Auto-detecting Language

Why doesn't Google's translation tool automatically detect the language I paste in? This is not a terribly difficult problem to solve computationally. I suspect that if they took a bag o' trigrams (of characters, that is) and compared to a corpus using some kind of simple tf–idf weight, they'd get a pretty high degree of accuracy. Here are some distinctive trigrams from a page on Omniglot. Wanna guess the language based solely on these? I doubt it will be difficult. And I suspect that just one or two of these trigrams is distinctive enough to make an accurate guess.

änn
isk
a_m
är_
föd
och
vär

UPDATE: thanks to the cemmentators for schooling me on this. In fact, Google DOES have a detect language function. I've been trying to find documentation on their methods but haven't had much luck. I did find this discussion of a different language detector that works rather differently than I proposed. Rather than compare trigrams of letters to language models, it looks up whole words in dictionaries. While I admit to the greater simplicity of this method, I think my idea is more betterer 'cause it's more linguisticy.

Notes on my searching:

Lots of programming language detecting tools.
Several human language detecting tools, but few discussed methodology

6 comments:

Bridget Samuels said...: Swedish.

This is a good linguists' parlor trick. I once got a free haircut after identifying a shampoo as Hungarian.; March 14, 2010 at 4:36 PM
Chris said...: Damn! You is fast, too!; March 14, 2010 at 4:38 PM
Bridget Samuels said...: Happened to check Google Reader at exactly the right moment, I guess :-); March 14, 2010 at 4:39 PM
Rob Van Dam said...: They do, at least on http://translate.google.com . I only know that because I needed to translate a quote yesterday and did not recognize the language. Google informed me it was Danish a gave fairly intelligble translation (although the English grammar was a bit twisted).; March 14, 2010 at 5:25 PM
J. Frankenstein Lutes said...: Yeah, at the very top of the translate from selector box is the option to detect the language. They even recognize languages that they can't translate yet, which is neat.; March 14, 2010 at 6:37 PM
Chris said...: Doh! Yep, now I see the detect language function. Thanks! I haven't found any documentation on methodology yet, but I'm sure there's stuff out there (probably the CMU folks, hehe).; March 14, 2010 at 8:43 PM

The Lousy Linguist

Sunday, March 14, 2010

Auto-detecting Language

6 comments:

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

Tools for Linguists

Favorite Posts