Wednesday, October 18, 2017

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.
Quick review of the tutorial:
  • Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a  vocabulary word nearby a given input word.
  • Methods: ngrams for training, order doesn't matter
  • Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.
Questions
  • Why sentences?
    • When choosing word window size, word vectors respect sentence boundaries. Why?
      • By doing this, vectors are modeling edited, written language.
      • Vectors are assuming the semantics of sentences are coherent in some way.
      • Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?
    • This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
    • How else might spoken language models differ?
    • How might methods change to account for this
    • Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
    • Why miss out on that information?
  • Why the middle?
    • Vectors use words “in the middle of a sentence”? Why?
    • Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
    • Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?
  • Synthetic Languages:
    • How well would this word vectors work for a synthetic language like Mohawk?
    • What pre-processing steps might need to be added/modified?
    • Would those modifications be enough to get similar/useful results?
  • Antonyms: Can the same method learn the difference between “ant” and “ants”?
This was a fun read.


The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com

3 comments:

Glenn said...

His thoughts on grabbing the sentence are dead on to me. Context and placement aside for just a moment, the Objective Correlative alone is worth the effort. Another, which popped in my head right now -- A Japanese professor did a semantic study on the all of the transcripts from every TED talk and I stumbled in there. I'm not an expert but what keyed with me he ignored, or at least didn't comment on. The money shot though was the (Laughter) written into the transcript. It was a marker, a spot on accurate sign that the previous 2-3 lines in the context of that talk caused physical reaction from the audience. Checking those, I noted that (Laughter) did not get used if there were only a few chuckles. So, this was gold for a novelist. And no matter how well trained an AI is, single words aren't going to convey that information.

Fred Mailhot said...

If you're interested in learning more about this, Sebastian Ruder has an outstanding series of blog posts on word embeddings; their origins, the SoTA, cross-lingual, what's ahead, etc.

Start here: http://ruder.io/word-embeddings-1/index.html

Chris said...

Thanks Fred, good link. reading it now.

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...