The Lousy Linguist: September 2017

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

I broke it into the following sections

End-to-End DL for rich output
Buckets of DL
Bias and variance
Applied machine learning work flow
New Era of ML
Build a Unified Data Warehouse
The 70/30 Split Revisited
Comparing to Human Level Performance
How do you define human level performance?
How do you build a career in machine learning?
AI is the new electricity

Intro

End to end DL – work flow
Bias and variance has changed in era of deep learning
DL been around for decades, why do they work well now?

Scale of data and computation
Two teams

AI Teams
Systems team
Sit together
Difficult for any one human to be sufficiently expert in multiple fields

End-to-End DL for rich output

From first three buckets below
Traditional ML models output real numbers
End-to-end DL can out put more complex things than numbers

Sentence captions for images
Speech-to-text
Machine translation
Synthesize new images (13:00)

End-to-End DL not the solution to everything.

End-to-end = having just a DL between input and output
Rules for when to use (13:35)

Old way: audio ------> phonemes --> transcript
New DL way: audio -----------------> transcript

Makes for great PR, but only works some times (15:31)
Achilles heel – need lots of labeled data
Maybe phonemes are just a fantasy of linguists (15:48)
Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)

Common problem – after first round of dev, ML not working that well, what do you do next?

Collect more data
Train longer
Different architecture (e.g., switch to NNs)
Regularization
Bigger model
More GPUs

Skill in ML engineer is knowing how to make these decisions (22:33)

Buckets of DL

General models

Densely connected layers – FC
Sequence models – 1D (RNN, LSTM, GRU, attention)
Image models – 2D, 3D (Convo nets)
Other – unsupervised, reinforcement

First three buckets driving market advances
But "Other" bucket is future of AI

Bias and variance – evolving

Scenario: build human level speech rec system

Measure human level error – 1
Training set error – 5%
Dev set – 6%

Bias = difference between human error level and your system’s
TIP: For bias problems try training a bigger model (25:21)
Variance (overfitting): if Human 1%, Training 2%, Dev 6%
TIP: for variance, try adding regularization, early stopping, best bet = more data
Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
“sucks for you” (direct quote 26:30)

Applied machine learning work flow

Is your training error high

Bigger model
Train longer
New architecture
Repeat until doing well on training set

Is dev error high?

Add data
Regularization
New architecture
Repeat until doing well on training set

Done

New Era of ML

We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
No longer a bias/variance trade-off (29:47)
“Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
More data has led to interesting investments

Data synthesis - Growing area
Examples-

OCR at Baidu
Take random image
Random word
Type random word in Microsoft Word
Use random font
You just created training data for OC
Still takes some human intervention, but lots of progress

Speech recognition

Take clean audio
Add random noise to background for more data
E.g., add car noise
Works remarkably well

Take ungrammatical sentences and auto-correct
Easy to create ungrammatical sentences programmatically

Video games in RL

Data synthesis has a lot of limits (36:24)

Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
20 cars in video game enough to give “realistic” impression to player
But 20 cars is very impoverished data set for self-driving cars

Build a Unified Data Warehouse

Employees can be possessive of "their" data
Baidu- it’s not your data, it’s company data
Access rights can be a different issue
But warehouse everything together
Kaggle

The 70/30 Split Revisited

In academia, common for test/train to come from same distribution
But more ommon in industry for test and train to come from different distributions

E.g., speech rec at Baid

Speech enabled rear view mirror (in China)

50,000 hours of regular speech data

Data not from rear-view mirror interactions though

Collect another 10 hours of rear-view mirror scenario

What do you do with the original 50,000 hours of not-quite right data?

Old method would be to build a different model for each scenario

New era, one model for all data

Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.

TIP: Make sure dev and test are from same distro (boosts effectiveness)

Good Idea: make 50,000 train, split 10,000 into dev/test

Dev set = problem specification

Me: "dev set = problem you are trying to solve"

Also, split off just 20 hours from 50,000 to create tiny “dev-train” set

this has same distro as train

Mismatched train and dev set is problem that academia doesn’t work on much

some work on domain adaptation, but not much (44:53)

New architecture fix = “hail mary” (48:58)
Takes a long time to really grok bias/variance

People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)

Common Theme – Comparing to Human Level Performance

Common to achieve human level performance, then level off
Why?

Audience: Labels come from humans
Audience: Researchers get satisfied with results (the laziness hypothesis)
Andrew: theoretical limits (aka optimal error rate, Bayes rate)

Some audio so bad, impossible to transcribe (phone call from a rock concert)
Some images so blurry, impossible to interpret

Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)

While worse than humans, still ways to improve

Get labels from humans
Error analysis
Estimate bias/variance effects

For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve

How do you define human level performance?

Quiz: Which is the most useful definition? (101:000

Example: Medical image reading

Typical non-doctor error - 3%
Typical doctor – 1%
Expert doctor – 0.7%
Team of expert doctors – 0.5%

Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.

What can AI do? (106:30)

Anything that a typical person can do in less than one second.

E.g., Perception tasks
Audience: if a human can do it in less than a second, you can get a lot of data

How do you build a career in machine learning (111:00)

Andrew says he does not have a great answer (me: but he does have a good one)

Taking a ML course
Attend DL school
Work on project yourself (Kaggle)
Mimic PhD student process

Read a lot of papers (20+)
Replicate results

Dirty work

Downloading/cleaning data
Re-running someone’s code

Don’t only do dirty work
PhD process + Dirty work = reliable

Keep it up for a year
Competency

AI is the new electricity (118:00)

Transforms industry after industry
Get in on the ground floor
NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.

The Lousy Linguist

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

Tools for Linguists

Favorite Posts