Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

 
I broke it into the following sections
  1. End-to-End DL for rich output
  2. Buckets of DL
  3. Bias and variance
  4. Applied machine learning work flow
  5. New Era of ML
  6. Build a Unified Data Warehouse
  7. The 70/30 Split Revisited
  8. Comparing to Human Level Performance
  9. How do you define human level performance?
  10. How do you build a career in machine learning?
  11. AI is the new electricity
Intro
  1. End to end DL – work flow
  2. Bias and variance has changed in era of deep learning
  3. DL been around for decades, why do they work well now?
    • Scale of data and computation
    • Two teams
      • AI Teams
      • Systems team
      • Sit together
      • Difficult for any one human to be sufficiently expert in multiple fields
End-to-End DL for rich output
    • From first three buckets below
    • Traditional ML models output real numbers
    • End-to-end DL can out put more complex things than numbers
      • Sentence captions for images
      • Speech-to-text
      • Machine translation
      • Synthesize new images (13:00)
    • End-to-End DL not the solution to everything.
      • End-to-end = having just a DL between input and output
      • Rules for when to use (13:35)
        • Old way: audio ------> phonemes --> transcript
        • New DL way: audio -----------------> transcript
      • Makes for great PR, but only works some times (15:31)
      • Achilles heel – need lots of labeled data
      • Maybe phonemes are just a fantasy of linguists (15:48)
      • Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
      • Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)
    • Common problem – after first round of dev, ML not working that well, what do you do next?
      • Collect more data
      • Train longer
      • Different architecture (e.g., switch to NNs)
      • Regularization
      • Bigger model
      • More GPUs
    • Skill in ML engineer is knowing how to make these decisions (22:33)
Buckets of DL
  1. General models
    • Densely connected layers – FC
    • Sequence models – 1D (RNN, LSTM, GRU, attention)
    • Image models – 2D, 3D (Convo nets)
    • Other – unsupervised, reinforcement
  2. First three buckets driving market advances
  3. But "Other" bucket is future of AI
 
Bias and variance – evolving
  1. Scenario: build human level speech rec system
    • Measure human level error – 1
    • Training set error – 5%
    • Dev set – 6%
  2. Bias = difference between human error level and your system’s
  3. TIP: For bias problems try training a bigger model (25:21)
  4. Variance (overfitting): if Human 1%, Training 2%, Dev 6%
  5. TIP: for variance, try adding regularization, early stopping, best bet = more data
  6. Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
  7. sucks for you” (direct quote 26:30)
Applied machine learning work flow
  1. Is your training error high
    • Yes
      • Bigger model
      • Train longer
      • New architecture
      • Repeat until doing well on training set
  2. Is dev error high?
    • Yes
      • Add data
      • Regularization
      • New architecture
      • Repeat until doing well on training set
  3. Done
New Era of ML
  1. We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
  2. No longer a bias/variance trade-off (29:47)
  3. “Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
  4. More data has led to interesting investments
    • Data synthesis - Growing area
    • Examples-
      • OCR at Baidu
      • Take random image
      • Random word
      • Type random word in Microsoft Word
      • Use random font
      • You just created training data for OC
      • Still takes some human intervention, but lots of progress
    • Speech recognition
      • Take clean audio
      • Add random noise to background for more data
      • E.g., add car noise
      • Works remarkably well
    • NLP
      • Take ungrammatical sentences and auto-correct
      • Easy to create ungrammatical sentences programmatically
    • Video games in RL
  5. Data synthesis has a lot of limits (36:24)
    • Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
    • 20 cars in video game enough to give “realistic” impression to player
    • But 20 cars is very impoverished data set for self-driving cars
Build a Unified Data Warehouse
  1. Employees can be possessive of "their" data
  2. Baidu- it’s not your data, it’s company data
  3. Access rights can be a different issue
  4. But warehouse everything together
  5. Kaggle
The 70/30 Split Revisited
  1. In academia, common for test/train to come from same distribution
  2. But more ommon in industry for test and train to come from different distributions
    • E.g., speech rec at Baid
      • Speech enabled rear view mirror (in China)
      • 50,000 hours of regular speech data
      • Data not from rear-view mirror interactions though
      • Collect another 10 hours of rear-view mirror scenario
    • What do you do with the original 50,000 hours of not-quite right data?
      • Old method would be to build a different model for each scenario
      • New era, one model for all data
      • Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.
      • TIP: Make sure dev and test are from same distro (boosts effectiveness)
      • Good Idea: make 50,000 train, split 10,000 into dev/test
    • Dev set = problem specification
      • Me: "dev set = problem you are trying to solve"
    • Also, split off just 20 hours from 50,000 to create tiny “dev-train” set
      • this has same distro as train
  3. Mismatched train and dev set is problem that academia doesn’t work on much
    • some work on domain adaptation, but not much (44:53)
  4. New architecture fix = “hail mary” (48:58)
  5. Takes a long time to really grok bias/variance
    • People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)
Common Theme – Comparing to Human Level Performance
  1. Common to achieve human level performance, then level off
  2. Why?
    • Audience: Labels come from humans
    • Audience: Researchers get satisfied with results (the laziness hypothesis)
    • Andrew: theoretical limits (aka optimal error rate, Bayes rate)
      • Some audio so bad, impossible to transcribe (phone call from a rock concert)
      • Some images so blurry, impossible to interpret
    • Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)
  3. While worse than humans, still ways to improve
    • Get labels from humans
    • Error analysis
    • Estimate bias/variance effects
  4. For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve
How do you define human level performance?
  1. Quiz: Which is the most useful definition? (101:000
    • Example: Medical image reading
      1. Typical non-doctor error - 3%
      2. Typical doctor – 1%
      3. Expert doctor – 0.7%
      4. Team of expert doctors – 0.5%
    • Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.
What can AI do? (106:30)
  1. Anything that a typical person can do in less than one second.
    • E.g., Perception tasks
    • Audience: if a human can do it in less than a second, you can get a lot of data
How do you build a career in machine learning (111:00)
  1. Andrew says he does not have a great answer (me: but he does have a good one)
    • Taking a ML course
    • Attend DL school
    • Work on project yourself (Kaggle)
    • Mimic PhD student process
      • Read a lot of papers (20+)
      • Replicate results
    • Dirty work
      • Downloading/cleaning data
      • Re-running someone’s code
    • Don’t only do dirty work
    • PhD process + Dirty work = reliable
      • Keep it up for a year
      • Competency
AI is the new electricity (118:00)
  1. Transforms industry after industry
  2. Get in on the ground floor
  3. NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.
 

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handfu...