Wednesday, October 18, 2017

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.
Quick review of the tutorial:
  • Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a  vocabulary word nearby a given input word.
  • Methods: ngrams for training, order doesn't matter
  • Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.
  • Why sentences?
    • When choosing word window size, word vectors respect sentence boundaries. Why?
      • By doing this, vectors are modeling edited, written language.
      • Vectors are assuming the semantics of sentences are coherent in some way.
      • Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?
    • This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
    • How else might spoken language models differ?
    • How might methods change to account for this
    • Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
    • Why miss out on that information?
  • Why the middle?
    • Vectors use words “in the middle of a sentence”? Why?
    • Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
    • Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?
  • Synthetic Languages:
    • How well would this word vectors work for a synthetic language like Mohawk?
    • What pre-processing steps might need to be added/modified?
    • Would those modifications be enough to get similar/useful results?
  • Antonyms: Can the same method learn the difference between “ant” and “ants”?
This was a fun read.

The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

I broke it into the following sections
  1. End-to-End DL for rich output
  2. Buckets of DL
  3. Bias and variance
  4. Applied machine learning work flow
  5. New Era of ML
  6. Build a Unified Data Warehouse
  7. The 70/30 Split Revisited
  8. Comparing to Human Level Performance
  9. How do you define human level performance?
  10. How do you build a career in machine learning?
  11. AI is the new electricity
  1. End to end DL – work flow
  2. Bias and variance has changed in era of deep learning
  3. DL been around for decades, why do they work well now?
    • Scale of data and computation
    • Two teams
      • AI Teams
      • Systems team
      • Sit together
      • Difficult for any one human to be sufficiently expert in multiple fields
End-to-End DL for rich output
    • From first three buckets below
    • Traditional ML models output real numbers
    • End-to-end DL can out put more complex things than numbers
      • Sentence captions for images
      • Speech-to-text
      • Machine translation
      • Synthesize new images (13:00)
    • End-to-End DL not the solution to everything.
      • End-to-end = having just a DL between input and output
      • Rules for when to use (13:35)
        • Old way: audio ------> phonemes --> transcript
        • New DL way: audio -----------------> transcript
      • Makes for great PR, but only works some times (15:31)
      • Achilles heel – need lots of labeled data
      • Maybe phonemes are just a fantasy of linguists (15:48)
      • Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
      • Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)
    • Common problem – after first round of dev, ML not working that well, what do you do next?
      • Collect more data
      • Train longer
      • Different architecture (e.g., switch to NNs)
      • Regularization
      • Bigger model
      • More GPUs
    • Skill in ML engineer is knowing how to make these decisions (22:33)
Buckets of DL
  1. General models
    • Densely connected layers – FC
    • Sequence models – 1D (RNN, LSTM, GRU, attention)
    • Image models – 2D, 3D (Convo nets)
    • Other – unsupervised, reinforcement
  2. First three buckets driving market advances
  3. But "Other" bucket is future of AI
Bias and variance – evolving
  1. Scenario: build human level speech rec system
    • Measure human level error – 1
    • Training set error – 5%
    • Dev set – 6%
  2. Bias = difference between human error level and your system’s
  3. TIP: For bias problems try training a bigger model (25:21)
  4. Variance (overfitting): if Human 1%, Training 2%, Dev 6%
  5. TIP: for variance, try adding regularization, early stopping, best bet = more data
  6. Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
  7. sucks for you” (direct quote 26:30)
Applied machine learning work flow
  1. Is your training error high
    • Yes
      • Bigger model
      • Train longer
      • New architecture
      • Repeat until doing well on training set
  2. Is dev error high?
    • Yes
      • Add data
      • Regularization
      • New architecture
      • Repeat until doing well on training set
  3. Done
New Era of ML
  1. We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
  2. No longer a bias/variance trade-off (29:47)
  3. “Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
  4. More data has led to interesting investments
    • Data synthesis - Growing area
    • Examples-
      • OCR at Baidu
      • Take random image
      • Random word
      • Type random word in Microsoft Word
      • Use random font
      • You just created training data for OC
      • Still takes some human intervention, but lots of progress
    • Speech recognition
      • Take clean audio
      • Add random noise to background for more data
      • E.g., add car noise
      • Works remarkably well
    • NLP
      • Take ungrammatical sentences and auto-correct
      • Easy to create ungrammatical sentences programmatically
    • Video games in RL
  5. Data synthesis has a lot of limits (36:24)
    • Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
    • 20 cars in video game enough to give “realistic” impression to player
    • But 20 cars is very impoverished data set for self-driving cars
Build a Unified Data Warehouse
  1. Employees can be possessive of "their" data
  2. Baidu- it’s not your data, it’s company data
  3. Access rights can be a different issue
  4. But warehouse everything together
  5. Kaggle
The 70/30 Split Revisited
  1. In academia, common for test/train to come from same distribution
  2. But more ommon in industry for test and train to come from different distributions
    • E.g., speech rec at Baid
      • Speech enabled rear view mirror (in China)
      • 50,000 hours of regular speech data
      • Data not from rear-view mirror interactions though
      • Collect another 10 hours of rear-view mirror scenario
    • What do you do with the original 50,000 hours of not-quite right data?
      • Old method would be to build a different model for each scenario
      • New era, one model for all data
      • Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.
      • TIP: Make sure dev and test are from same distro (boosts effectiveness)
      • Good Idea: make 50,000 train, split 10,000 into dev/test
    • Dev set = problem specification
      • Me: "dev set = problem you are trying to solve"
    • Also, split off just 20 hours from 50,000 to create tiny “dev-train” set
      • this has same distro as train
  3. Mismatched train and dev set is problem that academia doesn’t work on much
    • some work on domain adaptation, but not much (44:53)
  4. New architecture fix = “hail mary” (48:58)
  5. Takes a long time to really grok bias/variance
    • People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)
Common Theme – Comparing to Human Level Performance
  1. Common to achieve human level performance, then level off
  2. Why?
    • Audience: Labels come from humans
    • Audience: Researchers get satisfied with results (the laziness hypothesis)
    • Andrew: theoretical limits (aka optimal error rate, Bayes rate)
      • Some audio so bad, impossible to transcribe (phone call from a rock concert)
      • Some images so blurry, impossible to interpret
    • Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)
  3. While worse than humans, still ways to improve
    • Get labels from humans
    • Error analysis
    • Estimate bias/variance effects
  4. For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve
How do you define human level performance?
  1. Quiz: Which is the most useful definition? (101:000
    • Example: Medical image reading
      1. Typical non-doctor error - 3%
      2. Typical doctor – 1%
      3. Expert doctor – 0.7%
      4. Team of expert doctors – 0.5%
    • Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.
What can AI do? (106:30)
  1. Anything that a typical person can do in less than one second.
    • E.g., Perception tasks
    • Audience: if a human can do it in less than a second, you can get a lot of data
How do you build a career in machine learning (111:00)
  1. Andrew says he does not have a great answer (me: but he does have a good one)
    • Taking a ML course
    • Attend DL school
    • Work on project yourself (Kaggle)
    • Mimic PhD student process
      • Read a lot of papers (20+)
      • Replicate results
    • Dirty work
      • Downloading/cleaning data
      • Re-running someone’s code
    • Don’t only do dirty work
    • PhD process + Dirty work = reliable
      • Keep it up for a year
      • Competency
AI is the new electricity (118:00)
  1. Transforms industry after industry
  2. Get in on the ground floor
  3. NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.

Saturday, April 8, 2017

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been created pitting some of the most influential NLPers in the world head to head in a contest of wit and debate that is truly what the field of NLP needed. Unfortunately, Twitter proves hapless at being able to view the thread as a single conversation using their app. But, student Sebastian Meilke used a Chrome extension called Treeverse to put together the whole thread into a single, readable format, complete with sections!

If you are at all interested in NLP or linguistics, this is a must read: NLP/CL Twitter Megathrea.

I would be remiss if I didn't note my own small role in this. Emily's poll was sparked by my own Tweet where I said I was glad the Linguistics Society of America is starting their own Society for Computational Linguistics because "the existing CL and NLP communities have gotten farther and farther off the linguistics path"

Happy reading

Saturday, March 25, 2017

Small Towns Do Tech Startups Too

I'm participating in the Startup Weekend event in Chico CA sponsored by Chicostart. Great first night. The 60 second pitches were awesome. So much variety and interesting problems to solve. I pitched my own idea about using Watson to identify fake news, but I only got two blue stickers, boo hoo. I am very impressed with the space that hosted us. They are a multi-use space that includes a lot of tech startups. Kudos for putting such a great state-of-the-art space in a small college town.

I was particularly impressed with the range of pitchers. Millenials, students, professionals, middle aged folks, and at least one senior. Not bad for a small town with big dreams.

FYI - I'll be helping build a startup this weekend focused on smart housing devices. Will be a lot of fun.

Friday, March 24, 2017

Chico Startup Weekend

I'm looking forward to participating in Chico Startup Weekend, sponsored by Chicostart. I can't say what the weekend holds, but it's nice to know there's this kind of energy and drive in small town America. Startups ain't just for Austin and Mountain View.

Monday, March 20, 2017

Using IBM Watson Knowledge Studio to Train Machine Learning Models

Using the Free Trial version of IBM's Watson Knowledge Studio, I just annotated a text and created a machine learning model in about 3 hours without writing a single line of code. The mantra of WKS is that you don't program Watson, you teach Watson.

For demo purposes I chose to identify personal relationships in Shirley Jackson's 1948 short story The Lottery. This is a haunting story about a small village and its mindless adherence to an old, and tragic tradition. I chose it because 1) it's short and 2) it has clear person relationships like brothers, sisters, mothers, and fathers. I added a few other relations like AGENT_OF (which amounts to subjects of verbs) and X_INSIDE_Y for things like pieces of paper inside a box.
Caveat: This short story is really short: 3300 words. So I had no high hopes of getting a good model out of this. I just wanted to go through an entire machine learning work flow from gathering text data to outputting a complete model without writing a single line of code. And that's just what I did.

I spent about 30 minutes prepping the data. E.g., I broke it into 20 small snippets (to facilitate test/train split later), I also edited some quotation issues, spelling, etc).
It uploaded into WKS in seconds (by simply dragging and dropping the files into the browser tool). I then created a Type System to include entity types such as these:

And relation types such as these:
I then annotated the 20 short documents in less than two hours (as is so often the case, I re-designed my type system several times along the way; luckily WKS allows me to do this fluidly without having to re-annotate).

Here's a shot of my entity annotations:

Here's a shot of my relation annotations:

I then used these manually annotated documents as ground truth to teach a machine learning model to recognize the relationships automatically using a set of linguistic features (character and word ngrams, parts-of-speech, syntactic parses, etc). I accepted the WKS suggested split of documents as 70/23/2:
I clicked "Run" and waited:

The model was trained and evaluated in about ten minutes. Here's how it performed on entity types:

And here's how it performed on relation types:

This is actually not bad given how sparse the data is. I mean, an F1 of .33 on X_INSIDE_Y from only 29 training examples on a first pass. I'll take it, especially since that one is not necessarily obvious from the text. Here's one example of the X_INSIDE_Y relation:

So I was able to train a model with 11 entity types and 81 relation types on a small corpus in less than three hours start to finish without writing a single line of code. I did not program Watson. I taught it

Thursday, March 9, 2017

Annotate texts and create learning models

IBM Watson has released a free trial version of their online Watson Knowledge Studio tool. This is one of the tools I'm most excited about because it brings linguistic annotation, rule writing, dictionary creation, and machine learning together in a single user-friendly interface designed for non engineers to use.

This tool allows users to annotate documents with mentions, relations, and coreference, then learn a linguistic model of their environments. With zero computer science background. I've trained subject matter experts in several fields to use the WKS and I'm genuinely impressed. I'll put together a demo and post it sometime this weekend.

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handfu...