Here's a taste:
Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 400 million word Corpus of Contemporary American English. You can be sure that the words in these lists and in this dictionary -- sorted from most to least frequent -- are really the most common ones that you will encounter in the real world.
The frequency data comes in a number of different formats:
- An eBook containing up to the 20,000 most frequent words, along with the 20-30 most frequent collocates (nearby words) and the synonyms for each word -- which provide valuable insight into meaning and usage.
- A printed book (from Routledge) with the top 5,000 words (including collocates) and thematic lists.
- Lists with the top 200-300 collocates for each of the 20,000 words, giving more than 4,300,000 node word / collocate pairs
- Simple word lists of the top 10,000 or 20,000 words, but without collocates or synonyms.
- A free word list -- top 5,000 words, but no collocates or synonyms.
- N-grams: more than 155 million trigrams, which can be queried by word form, lemma, part of speech, etc
2 comments:
Unfortunately, the corpus is 1/6th the size of the google n-gram corpus but 4 times the cost. They have a point though about not needing a cluster to process it.
bummer
Post a Comment