Wednesday, October 18, 2017

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.
Quick review of the tutorial:
  • Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a  vocabulary word nearby a given input word.
  • Methods: ngrams for training, order doesn't matter
  • Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.
  • Why sentences?
    • When choosing word window size, word vectors respect sentence boundaries. Why?
      • By doing this, vectors are modeling edited, written language.
      • Vectors are assuming the semantics of sentences are coherent in some way.
      • Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?
    • This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
    • How else might spoken language models differ?
    • How might methods change to account for this
    • Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
    • Why miss out on that information?
  • Why the middle?
    • Vectors use words “in the middle of a sentence”? Why?
    • Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
    • Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?
  • Synthetic Languages:
    • How well would this word vectors work for a synthetic language like Mohawk?
    • What pre-processing steps might need to be added/modified?
    • Would those modifications be enough to get similar/useful results?
  • Antonyms: Can the same method learn the difference between “ant” and “ants”?
This was a fun read.

The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...