- End-to-End DL for rich output
- Buckets of DL
- Bias and variance
- Applied machine learning work flow
- New Era of ML
- Build a Unified Data Warehouse
- The 70/30 Split Revisited
- Comparing to Human Level Performance
- How do you define human level performance?
- How do you build a career in machine learning?
- AI is the new electricity
- End to end DL – work flow
- Bias and variance has changed in era of deep learning
- DL been around for decades, why do they work well now?
- Scale of data and computation
- Two teams
- AI Teams
- Systems team
- Sit together
- Difficult for any one human to be sufficiently expert in multiple fields
- From first three buckets below
- Traditional ML models output real numbers
- End-to-end DL can out put more complex things than numbers
- Sentence captions for images
- Speech-to-text
- Machine translation
- Synthesize new images (13:00)
- End-to-End DL not the solution to everything.
- End-to-end = having just a DL between input and output
- Rules for when to use (13:35)
- Old way: audio ------> phonemes --> transcript
- New DL way: audio -----------------> transcript
- Makes for great PR, but only works some times (15:31)
- Achilles heel – need lots of labeled data
- Maybe phonemes are just a fantasy of linguists (15:48)
- Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
- Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)
- Common problem – after first round of dev, ML not working that well, what do you do next?
- Collect more data
- Train longer
- Different architecture (e.g., switch to NNs)
- Regularization
- Bigger model
- More GPUs
- Skill in ML engineer is knowing how to make these decisions (22:33)
Buckets of DL
- General models
- Densely connected layers – FC
- Sequence models – 1D (RNN, LSTM, GRU, attention)
- Image models – 2D, 3D (Convo nets)
- Other – unsupervised, reinforcement
- First three buckets driving market advances
- But "Other" bucket is future of AI
Bias and variance – evolving
- Scenario: build human level speech rec system
- Measure human level error – 1
- Training set error – 5%
- Dev set – 6%
- Bias = difference between human error level and your system’s
- TIP: For bias problems try training a bigger model (25:21)
- Variance (overfitting): if Human 1%, Training 2%, Dev 6%
- TIP: for variance, try adding regularization, early stopping, best bet = more data
- Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
- “sucks for you” (direct quote 26:30)
Applied machine learning work flow
- Is your training error high
- Yes
- Bigger model
- Train longer
- New architecture
- Repeat until doing well on training set
- Is dev error high?
- Yes
- Add data
- Regularization
- New architecture
- Repeat until doing well on training set
- Done
New Era of ML
- We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
- No longer a bias/variance trade-off (29:47)
- “Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
- More data has led to interesting investments
- Data synthesis - Growing area
- Examples-
- OCR at Baidu
- Take random image
- Random word
- Type random word in Microsoft Word
- Use random font
- You just created training data for OC
- Still takes some human intervention, but lots of progress
- Speech recognition
- Take clean audio
- Add random noise to background for more data
- E.g., add car noise
- Works remarkably well
- NLP
- Take ungrammatical sentences and auto-correct
- Easy to create ungrammatical sentences programmatically
- Video games in RL
- Data synthesis has a lot of limits (36:24)
- Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
- 20 cars in video game enough to give “realistic” impression to player
- But 20 cars is very impoverished data set for self-driving cars
- Employees can be possessive of "their" data
- Baidu- it’s not your data, it’s company data
- Access rights can be a different issue
- But warehouse everything together
- Kaggle
- In academia, common for test/train to come from same distribution
- But more ommon in industry for test and train to come from different distributions
- E.g., speech rec at Baid
- Speech enabled rear view mirror (in China)
- 50,000 hours of regular speech data
- Data not from rear-view mirror interactions though
- Collect another 10 hours of rear-view mirror scenario
- What do you do with the original 50,000 hours of not-quite right data?
- Old method would be to build a different model for each scenario
- New era, one model for all data
- Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.
- TIP: Make sure dev and test are from same distro (boosts effectiveness)
- Good Idea: make 50,000 train, split 10,000 into dev/test
- Dev set = problem specification
- Me: "dev set = problem you are trying to solve"
- Also, split off just 20 hours from 50,000 to create tiny “dev-train” set
- this has same distro as train
- Mismatched train and dev set is problem that academia doesn’t work on much
- some work on domain adaptation, but not much (44:53)
- New architecture fix = “hail mary” (48:58)
- Takes a long time to really grok bias/variance
- People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)
Common Theme – Comparing to Human Level Performance
- Common to achieve human level performance, then level off
- Why?
- Audience: Labels come from humans
- Audience: Researchers get satisfied with results (the laziness hypothesis)
- Andrew: theoretical limits (aka optimal error rate, Bayes rate)
- Some audio so bad, impossible to transcribe (phone call from a rock concert)
- Some images so blurry, impossible to interpret
- Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)
- While worse than humans, still ways to improve
- Get labels from humans
- Error analysis
- Estimate bias/variance effects
- For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve
How do you define human level performance?
- Quiz: Which is the most useful definition? (101:000
- Example: Medical image reading
- Typical non-doctor error - 3%
- Typical doctor – 1%
- Expert doctor – 0.7%
- Team of expert doctors – 0.5%
- Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.
What can AI do? (106:30)
- Anything that a typical person can do in less than one second.
- E.g., Perception tasks
- Audience: if a human can do it in less than a second, you can get a lot of data
- Andrew says he does not have a great answer (me: but he does have a good one)
- Taking a ML course
- Attend DL school
- Work on project yourself (Kaggle)
- Mimic PhD student process
- Read a lot of papers (20+)
- Replicate results
- Dirty work
- Downloading/cleaning data
- Re-running someone’s code
- Don’t only do dirty work
- PhD process + Dirty work = reliable
- Keep it up for a year
- Competency
AI is the new electricity (118:00)
- Transforms industry after industry
- Get in on the ground floor
- NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.