The Lousy Linguist: Word Frequency Lists

Tuesday, April 20, 2010

Word Frequency Lists

Mark Davies and company over at BYU have released quite a collection of English word frequency data HERE.

Here's a taste:

Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 400 million word Corpus of Contemporary American English. You can be sure that the words in these lists and in this dictionary -- sorted from most to least frequent -- are really the most common ones that you will encounter in the real world.

The frequency data comes in a number of different formats:

An eBook containing up to the 20,000 most frequent words, along with the 20-30 most frequent collocates (nearby words) and the synonyms for each word -- which provide valuable insight into meaning and usage.
A printed book (from Routledge) with the top 5,000 words (including collocates) and thematic lists.
Lists with the top 200-300 collocates for each of the 20,000 words, giving more than 4,300,000 node word / collocate pairs
Simple word lists of the top 10,000 or 20,000 words, but without collocates or synonyms.
A free word list -- top 5,000 words, but no collocates or synonyms.
N-grams: more than 155 million trigrams, which can be queried by word form, lemma, part of speech, etc

2 comments:

Jason M. Adams said...: Unfortunately, the corpus is 1/6th the size of the google n-gram corpus but 4 times the cost. They have a point though about not needing a cluster to process it.; April 21, 2010 at 10:34 AM
Chris said...: bummer; April 22, 2010 at 6:21 PM

The Lousy Linguist

Tuesday, April 20, 2010

Word Frequency Lists

2 comments:

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

Tools for Linguists

Favorite Posts