Why doesn't Google's
translation tool automatically detect the language I paste in? This is not a terribly difficult problem to solve computationally. I suspect that if they took a bag o' trigrams (of characters, that is) and compared to a corpus using some kind of simple
tf–idf weight, they'd get a pretty high degree of accuracy. Here are some distinctive trigrams from a page on Omniglot. Wanna guess the language based solely on these? I doubt it will be difficult. And I suspect that just one or two of these trigrams is distinctive enough to make an accurate guess.
- änn
- isk
- a_m
- är_
- föd
- och
- vär
UPDATE: thanks to the cemmentators for schooling me on this. In fact, Google DOES have a detect language function. I've been trying to find documentation on their methods but haven't had much luck. I did find
this discussion of a different language detector that works rather differently than I proposed. Rather than compare trigrams of letters to language models, it looks up whole words in dictionaries. While I admit to the greater simplicity of this method, I think my idea is more betterer 'cause it's more linguisticy.
Notes on my searching:
- Lots of programming language detecting tools.
- Several human language detecting tools, but few discussed methodology
6 comments:
Swedish.
This is a good linguists' parlor trick. I once got a free haircut after identifying a shampoo as Hungarian.
Damn! You is fast, too!
Happened to check Google Reader at exactly the right moment, I guess :-)
They do, at least on http://translate.google.com . I only know that because I needed to translate a quote yesterday and did not recognize the language. Google informed me it was Danish a gave fairly intelligble translation (although the English grammar was a bit twisted).
Yeah, at the very top of the translate from selector box is the option to detect the language. They even recognize languages that they can't translate yet, which is neat.
Doh! Yep, now I see the detect language function. Thanks! I haven't found any documentation on methodology yet, but I'm sure there's stuff out there (probably the CMU folks, hehe).
Post a Comment