Friday, February 15, 2008

Fancy Corpus Search Tool

I've only just now discovered the entirely online corpus search utility Sketch Engine by Adam Kilgarriff, Pavel Rychlý, and Jan Pomikálek. It can replicate a lot of what I do with tgrep2 and Python scripts, but a lot faster (I mean, A LOT faster).

It has the advantages of being fast, easy to use, covering corpora from multiple languages (plus allowing you to add new corpora) and providing user friendly output.

One disadvantage is the brevity of the sketches it provides. For example, I performed a sketch of the verb "prevent" in the BNC and it returned a list of subjects and objects that occur with the verb. Sweet! This is really important stuff if you're interested in FrameNet type semantic description (see my related post here). Unfortunately, it maxed out at 100 (that's a small sample of the 10,000+ examples).

Nonetheless, this utility goes a long way to providing the sort of user-friendly (yet still sophisticated) online corpus query tools that I think the average non-computationally minded linguist would benefit from greatly.

I've used Mark Davies' BNC interface a lot too and that's also an excellent, entirely online search tool. Davies provides a nice interface to a variety of corpora here.

No comments:

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handfu...