Tuesday, April 20, 2010

Word Frequency Lists

Mark Davies and company over at BYU have released quite a collection of English word frequency data HERE.

Here's a taste:

Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 400 million word Corpus of Contemporary American English. You can be sure that the words in these lists and in this dictionary -- sorted from most to least frequent -- are really the most common ones that you will encounter in the real world.

The frequency data comes in a number of different formats:
  • An eBook containing up to the 20,000 most frequent words, along with the 20-30 most frequent collocates (nearby words) and the synonyms for each word -- which provide valuable insight into meaning and usage.
  • A printed book (from Routledge) with the top 5,000 words (including collocates) and thematic lists.
  • Lists with the top 200-300 collocates for each of the 20,000 words, giving more than 4,300,000 node word / collocate pairs
  • Simple word lists of the top 10,000 or 20,000 words, but without collocates or synonyms.
  • A free word list -- top 5,000 words, but no collocates or synonyms.
  • N-grams: more than 155 million trigrams, which can be queried by word form, lemma, part of speech, etc


Jason M. Adams said...

Unfortunately, the corpus is 1/6th the size of the google n-gram corpus but 4 times the cost. They have a point though about not needing a cluster to process it.

Chris said...


Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...