This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.
Quick review of the tutorial:
- Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a vocabulary word nearby a given input word.
- Methods: ngrams for training, order doesn't matter
- Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.
Questions
- Why sentences?
- When choosing word window size, word vectors respect sentence boundaries. Why?
- By doing this, vectors are modeling edited, written language.
- Vectors are assuming the semantics of sentences are coherent in some way.
- Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?
- This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
- How else might spoken language models differ?
- How might methods change to account for this
- Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
- Why miss out on that information?
- Why the middle?
- Vectors use words “in the middle of a sentence”? Why?
- Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
- Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?
- Synthetic Languages:
- How well would this word vectors work for a synthetic language like Mohawk?
- What pre-processing steps might need to be added/modified?
- Would those modifications be enough to get similar/useful results?
- Antonyms: Can the same method learn the difference between “ant” and “ants”?
- Quote: “the network will likely learn similar word vectors for the words “ant” and “ants”
- I haven't read it yet, but these folks seem to use word vectors to learn antonyms: Word Embedding-based Antonym Detection using Thesauri and Distributional Information.
The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com