Wednesday, February 17, 2010

Inuktitut's Millionth Word!!

For some time now, the English speaking linguistics world has anxiously awaited the arrival of our millionth word in English (see here and here).  I have a bottle of Freixenet permanently on ice just for that wondrous day. But alas! It appears that Inuktitut has beaten my native language to the prize of all prizes.  According to this story about Microsoft's Inuit language software,

More than one million words have been programmed in Inuktitut through the collaboration, about 5,000 of which are new Inuktitut words (emphasis added).

I'm a gracious loser. Congratulations Inuktitut. See you at the 2 millionth mark.


Licia said...

You can have a look at the IT terminology that was standardized by Microsoft in collaboration with local experts, governments and universities in the Microsoft Community Glossary Project site, which is part of the Language Interface Pack (LIP) project. The list of terms is the same for all LIP languages and obviously it does not include function words or the standard lexicon that make up the bulk of the million words in the localized software strings.

Chris said...

Licia, thanks for the links! I'll have a closer look tomorrow. I'm still suspicious of the whole "million words" concept, but I'm open minded, and the links you provided look like a fruitful area to peruse.

I also should say that I'm impressed with Microsoft's cross-linguistics efforts, just as I was with Rosetta Stones' here

Licia said...

LIP is indeed a very interesting project.

BTW, a product like Office easily exceeds the “million words” mark :-) 
In this press release about Afrikaans, isiXhosa, Setswana, isiZulu and Sesotho sa Leboa, the number of words quoted for the localization of Windows and Office is four million.
You need to take into account the large number of duplicate strings; words like OK and Cancel can occur thousand of times in the same product. Just to give you an example, if you look up “shared” with Inuktitut as target language in the Microsoft Language Portal, you’ll get about 80 strings / 200 source words and quite some repetition (and you need to bear in mind that exact duplicates are removed from the search results).

Chris said...

Ahhhhh, okay, it's the old type/token distinction. I wasn't making that connection. I assumed they meant 1 million unique types. Thanks!

Anonymous said...

There's also a LIP blog, and a Microsoft Language Portal blog,

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...