Sunday, March 14, 2010

Auto-detecting Language

Why doesn't Google's translation tool automatically detect the language I paste in? This is not a terribly difficult problem to solve computationally. I suspect that if they took a bag o' trigrams (of characters, that is) and compared to a corpus using some kind of simple tf–idf weight, they'd get a pretty high degree of accuracy. Here are some distinctive trigrams from a page on Omniglot. Wanna guess the language based solely on these? I doubt it will be difficult. And I suspect that just one or two of these trigrams is distinctive enough to make an accurate guess.
  1. änn
  2. isk
  3. a_m
  4. är_
  5. föd
  6. och
  7. vär
UPDATE: thanks to the cemmentators for schooling me on this. In fact, Google DOES have a detect language function. I've been trying to find documentation on their methods but haven't had much luck. I did find this discussion of a different language detector that works rather differently than I proposed. Rather than compare trigrams of letters to language models, it looks up whole words in dictionaries. While I admit to the greater simplicity of this method, I think my idea is more betterer 'cause it's more linguisticy.

Notes on  my searching:
  1. Lots of programming language detecting tools.
  2. Several human language detecting tools, but few discussed methodology

6 comments:

Bridget Samuels said...

Swedish.

This is a good linguists' parlor trick. I once got a free haircut after identifying a shampoo as Hungarian.

Chris said...

Damn! You is fast, too!

Bridget Samuels said...

Happened to check Google Reader at exactly the right moment, I guess :-)

Rob Van Dam said...

They do, at least on http://translate.google.com . I only know that because I needed to translate a quote yesterday and did not recognize the language. Google informed me it was Danish a gave fairly intelligble translation (although the English grammar was a bit twisted).

J. Frankenstein Lutes said...

Yeah, at the very top of the translate from selector box is the option to detect the language. They even recognize languages that they can't translate yet, which is neat.

Chris said...

Doh! Yep, now I see the detect language function. Thanks! I haven't found any documentation on methodology yet, but I'm sure there's stuff out there (probably the CMU folks, hehe).

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...