Tuesday, December 26, 2017

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.  

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.  


Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data
Step 2: Describe Your Data Honestly
Step 3: Focus on Linguistic Structure, At Least Some of the Time
Step 4: Do Error Analysis
Since this is preliminary, I’ll concentrate on just steps 1 & 3

UPDATE 12/28/2017 #2: Robert Munro has written up an excellent analysis of this data, A Step in the Right Direction for NLP. He addresses many of the points I make below. He also makes some solid suggestions for how to account for bias. Well worth a read.

UPDATE 12/28/2017 #1: My original analysis of their IAA process was mistaken. I have corrected it. Also added another comment under "Some Thoughts" at the end.

Step 1: Know Your Data

The data download
  • “large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: toxic, severe_toxic, obscene, threat, insult, identity_hate” (note: there are a large number of comments with no ratings; presumably these are neutral)
  • Actual download for Training set is a CSV of ID, comment, rating (1 or 0) for each label. 
  • 100k talk page diffs
  • Comments come from 2001–2015
  • There are 10 judgements per diff (final rating is the average)
  • Here’s the original paper discussing this work Ex Machina: Personal Attacks Seen at Scale by Wulczyn, Thain, and Dixon.
Who created the data?
  • Wikipedia editors
  • Largely young white men (lots of research on this)
    • 2011 WMF survey, 90% of Wikipedians are men, 9% are female, and 1% are transgender/transsexual
    • 53% are under 30 (see here for more info)
Who rated the data?
  • 3591 Crowdflower participants (you have to dig into the paper and associated repos to discover this)
  • This is a huge number of raters.
  • Looking at the demographics spreadsheet available here, most are not native English speakers. I'm surprised this was allowed
  • Only18% were native speakers of English
  • Only three gender options were offered (male, female, other). Only one user chose other
  • 54% are under 30
  • 65% chose male as their gender response
  • 68% chose bachelors, masters, doctorate, or professional as level education
  • I'm admittedly confused by how they obtained their final ratings because the paper discusses a slightly different task (not involving the six labels specifically), and the authors' wiki discussion of annotation and data release (here) also discusses what look like slightly different tasks. Were these six labels aggregated post hoc over different tasks using the same data? I don't see a clear example of asking human raters to give judgments for these six specific labels. I'll keep digging.
  • In general, I find the labels problematic (Dixon admitted to some of this in a discussion here).
Language varieties (in comments data)?
  • Lightly edited English, mostly North American (e.g., one comment said they "corrected" the spelling of “recognised” to “recognized)
  • Some commenters self-identify as non-native speakers
  • Lots of spelling typos
  • Mixed punctuation usage
  • Some meta tags in comments
  • Inconsistent use of spaces before/after commas
  • Inconsistent use of caps (inconsistent for both all caps and camel case)
  • Some examples of ACSII art
  • Excessive apostrophe usage (one comment has several hundred apostrophes)
  • Length of comments vary considerably
    • From five words, to hundreds of words
    • This might have a spurious effect on the human raters
    • Humans get tired. When they encounter a long comment, they may be tempted to rely solely on the first few sentences, or possibly the mere presence of some key words.
  • Some character encoding issues: â€

Step 3: 
Focus on Linguistic Structure, At Least Some of the Time

  • A did a quick skim of a few dozen of the neutral comments (those with 0s for all labels) and it looked like they had no swear words.
  • I fear this will lead a model to over-learn that the mere presence of a swear word means it should get a label.
  • See the excellent blog Strong Language for reasons this is not true.
  • I would throw in some extra training data that includes neutral comments with swear words.
  • Perhaps due to non-native speakers or typos, the wrong morphological variants can be found for some words (e.g., "I am not the only one who finds the article too ridiculous and trivia to be included.")
  • Lemmatizing can help (but it can also hurt)
  • Some comments lack sentence ending punctuation, leading to run ons. (e.g., "It was very constructive you are just very very stupid.")
  • Large amount of grammatical errors
  • If any parsing is done (shallow or full), could be some issues
  • A Twitter trained parser might be useful for the short comments, but not the long ones.
  • All comments are decontextual 
  • Missing the comments before reduces information (many comments are responses to something, but we don't get the something) 
  • Also missing the page topic that the comment is referring to
  • Hence, we're missing discourse and pragmatic links

Some Thoughts
The data is not simply toxic comments, but it’s the kind of toxic comments that young men make to other young men.  And the labels are what young, educated, non-native English speaking men think counts as toxic.

My gut feeling is that the generalizability of any model trained on this data will be limited. 

My main concern though is the lack of linguistic context (under Discourse). Exactly what counts as "toxic" under these de-contextual circumstances? Would the rating be the same if the context were present? I don't know. My hunch is that at least some of these ratings would change.

Wednesday, October 18, 2017

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.
Quick review of the tutorial:
  • Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a  vocabulary word nearby a given input word.
  • Methods: ngrams for training, order doesn't matter
  • Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.
  • Why sentences?
    • When choosing word window size, word vectors respect sentence boundaries. Why?
      • By doing this, vectors are modeling edited, written language.
      • Vectors are assuming the semantics of sentences are coherent in some way.
      • Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?
    • This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
    • How else might spoken language models differ?
    • How might methods change to account for this
    • Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
    • Why miss out on that information?
  • Why the middle?
    • Vectors use words “in the middle of a sentence”? Why?
    • Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
    • Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?
  • Synthetic Languages:
    • How well would this word vectors work for a synthetic language like Mohawk?
    • What pre-processing steps might need to be added/modified?
    • Would those modifications be enough to get similar/useful results?
  • Antonyms: Can the same method learn the difference between “ant” and “ants”?
This was a fun read.

The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

I broke it into the following sections
  1. End-to-End DL for rich output
  2. Buckets of DL
  3. Bias and variance
  4. Applied machine learning work flow
  5. New Era of ML
  6. Build a Unified Data Warehouse
  7. The 70/30 Split Revisited
  8. Comparing to Human Level Performance
  9. How do you define human level performance?
  10. How do you build a career in machine learning?
  11. AI is the new electricity
  1. End to end DL – work flow
  2. Bias and variance has changed in era of deep learning
  3. DL been around for decades, why do they work well now?
    • Scale of data and computation
    • Two teams
      • AI Teams
      • Systems team
      • Sit together
      • Difficult for any one human to be sufficiently expert in multiple fields
End-to-End DL for rich output
    • From first three buckets below
    • Traditional ML models output real numbers
    • End-to-end DL can out put more complex things than numbers
      • Sentence captions for images
      • Speech-to-text
      • Machine translation
      • Synthesize new images (13:00)
    • End-to-End DL not the solution to everything.
      • End-to-end = having just a DL between input and output
      • Rules for when to use (13:35)
        • Old way: audio ------> phonemes --> transcript
        • New DL way: audio -----------------> transcript
      • Makes for great PR, but only works some times (15:31)
      • Achilles heel – need lots of labeled data
      • Maybe phonemes are just a fantasy of linguists (15:48)
      • Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
      • Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)
    • Common problem – after first round of dev, ML not working that well, what do you do next?
      • Collect more data
      • Train longer
      • Different architecture (e.g., switch to NNs)
      • Regularization
      • Bigger model
      • More GPUs
    • Skill in ML engineer is knowing how to make these decisions (22:33)
Buckets of DL
  1. General models
    • Densely connected layers – FC
    • Sequence models – 1D (RNN, LSTM, GRU, attention)
    • Image models – 2D, 3D (Convo nets)
    • Other – unsupervised, reinforcement
  2. First three buckets driving market advances
  3. But "Other" bucket is future of AI
Bias and variance – evolving
  1. Scenario: build human level speech rec system
    • Measure human level error – 1
    • Training set error – 5%
    • Dev set – 6%
  2. Bias = difference between human error level and your system’s
  3. TIP: For bias problems try training a bigger model (25:21)
  4. Variance (overfitting): if Human 1%, Training 2%, Dev 6%
  5. TIP: for variance, try adding regularization, early stopping, best bet = more data
  6. Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
  7. sucks for you” (direct quote 26:30)
Applied machine learning work flow
  1. Is your training error high
    • Yes
      • Bigger model
      • Train longer
      • New architecture
      • Repeat until doing well on training set
  2. Is dev error high?
    • Yes
      • Add data
      • Regularization
      • New architecture
      • Repeat until doing well on training set
  3. Done
New Era of ML
  1. We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
  2. No longer a bias/variance trade-off (29:47)
  3. “Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
  4. More data has led to interesting investments
    • Data synthesis - Growing area
    • Examples-
      • OCR at Baidu
      • Take random image
      • Random word
      • Type random word in Microsoft Word
      • Use random font
      • You just created training data for OC
      • Still takes some human intervention, but lots of progress
    • Speech recognition
      • Take clean audio
      • Add random noise to background for more data
      • E.g., add car noise
      • Works remarkably well
    • NLP
      • Take ungrammatical sentences and auto-correct
      • Easy to create ungrammatical sentences programmatically
    • Video games in RL
  5. Data synthesis has a lot of limits (36:24)
    • Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
    • 20 cars in video game enough to give “realistic” impression to player
    • But 20 cars is very impoverished data set for self-driving cars
Build a Unified Data Warehouse
  1. Employees can be possessive of "their" data
  2. Baidu- it’s not your data, it’s company data
  3. Access rights can be a different issue
  4. But warehouse everything together
  5. Kaggle
The 70/30 Split Revisited
  1. In academia, common for test/train to come from same distribution
  2. But more ommon in industry for test and train to come from different distributions
    • E.g., speech rec at Baid
      • Speech enabled rear view mirror (in China)
      • 50,000 hours of regular speech data
      • Data not from rear-view mirror interactions though
      • Collect another 10 hours of rear-view mirror scenario
    • What do you do with the original 50,000 hours of not-quite right data?
      • Old method would be to build a different model for each scenario
      • New era, one model for all data
      • Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.
      • TIP: Make sure dev and test are from same distro (boosts effectiveness)
      • Good Idea: make 50,000 train, split 10,000 into dev/test
    • Dev set = problem specification
      • Me: "dev set = problem you are trying to solve"
    • Also, split off just 20 hours from 50,000 to create tiny “dev-train” set
      • this has same distro as train
  3. Mismatched train and dev set is problem that academia doesn’t work on much
    • some work on domain adaptation, but not much (44:53)
  4. New architecture fix = “hail mary” (48:58)
  5. Takes a long time to really grok bias/variance
    • People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)
Common Theme – Comparing to Human Level Performance
  1. Common to achieve human level performance, then level off
  2. Why?
    • Audience: Labels come from humans
    • Audience: Researchers get satisfied with results (the laziness hypothesis)
    • Andrew: theoretical limits (aka optimal error rate, Bayes rate)
      • Some audio so bad, impossible to transcribe (phone call from a rock concert)
      • Some images so blurry, impossible to interpret
    • Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)
  3. While worse than humans, still ways to improve
    • Get labels from humans
    • Error analysis
    • Estimate bias/variance effects
  4. For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve
How do you define human level performance?
  1. Quiz: Which is the most useful definition? (101:000
    • Example: Medical image reading
      1. Typical non-doctor error - 3%
      2. Typical doctor – 1%
      3. Expert doctor – 0.7%
      4. Team of expert doctors – 0.5%
    • Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.
What can AI do? (106:30)
  1. Anything that a typical person can do in less than one second.
    • E.g., Perception tasks
    • Audience: if a human can do it in less than a second, you can get a lot of data
How do you build a career in machine learning (111:00)
  1. Andrew says he does not have a great answer (me: but he does have a good one)
    • Taking a ML course
    • Attend DL school
    • Work on project yourself (Kaggle)
    • Mimic PhD student process
      • Read a lot of papers (20+)
      • Replicate results
    • Dirty work
      • Downloading/cleaning data
      • Re-running someone’s code
    • Don’t only do dirty work
    • PhD process + Dirty work = reliable
      • Keep it up for a year
      • Competency
AI is the new electricity (118:00)
  1. Transforms industry after industry
  2. Get in on the ground floor
  3. NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.

Saturday, April 8, 2017

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been created pitting some of the most influential NLPers in the world head to head in a contest of wit and debate that is truly what the field of NLP needed. Unfortunately, Twitter proves hapless at being able to view the thread as a single conversation using their app. But, student Sebastian Meilke used a Chrome extension called Treeverse to put together the whole thread into a single, readable format, complete with sections!

If you are at all interested in NLP or linguistics, this is a must read: NLP/CL Twitter Megathrea.

I would be remiss if I didn't note my own small role in this. Emily's poll was sparked by my own Tweet where I said I was glad the Linguistics Society of America is starting their own Society for Computational Linguistics because "the existing CL and NLP communities have gotten farther and farther off the linguistics path"

Happy reading

Saturday, March 25, 2017

Small Towns Do Tech Startups Too

I'm participating in the Startup Weekend event in Chico CA sponsored by Chicostart. Great first night. The 60 second pitches were awesome. So much variety and interesting problems to solve. I pitched my own idea about using Watson to identify fake news, but I only got two blue stickers, boo hoo. I am very impressed with the Build.com space that hosted us. They are a multi-use space that includes a lot of tech startups. Kudos for putting such a great state-of-the-art space in a small college town.

I was particularly impressed with the range of pitchers. Millenials, students, professionals, middle aged folks, and at least one senior. Not bad for a small town with big dreams.

FYI - I'll be helping build a startup this weekend focused on smart housing devices. Will be a lot of fun.

Friday, March 24, 2017

Chico Startup Weekend

I'm looking forward to participating in Chico Startup Weekend, sponsored by Chicostart. I can't say what the weekend holds, but it's nice to know there's this kind of energy and drive in small town America. Startups ain't just for Austin and Mountain View.

Monday, March 20, 2017

Using IBM Watson Knowledge Studio to Train Machine Learning Models

Using the Free Trial version of IBM's Watson Knowledge Studio, I just annotated a text and created a machine learning model in about 3 hours without writing a single line of code. The mantra of WKS is that you don't program Watson, you teach Watson.

For demo purposes I chose to identify personal relationships in Shirley Jackson's 1948 short story The Lottery. This is a haunting story about a small village and its mindless adherence to an old, and tragic tradition. I chose it because 1) it's short and 2) it has clear person relationships like brothers, sisters, mothers, and fathers. I added a few other relations like AGENT_OF (which amounts to subjects of verbs) and X_INSIDE_Y for things like pieces of paper inside a box.
Caveat: This short story is really short: 3300 words. So I had no high hopes of getting a good model out of this. I just wanted to go through an entire machine learning work flow from gathering text data to outputting a complete model without writing a single line of code. And that's just what I did.

I spent about 30 minutes prepping the data. E.g., I broke it into 20 small snippets (to facilitate test/train split later), I also edited some quotation issues, spelling, etc).
It uploaded into WKS in seconds (by simply dragging and dropping the files into the browser tool). I then created a Type System to include entity types such as these:

And relation types such as these:
I then annotated the 20 short documents in less than two hours (as is so often the case, I re-designed my type system several times along the way; luckily WKS allows me to do this fluidly without having to re-annotate).

Here's a shot of my entity annotations:

Here's a shot of my relation annotations:

I then used these manually annotated documents as ground truth to teach a machine learning model to recognize the relationships automatically using a set of linguistic features (character and word ngrams, parts-of-speech, syntactic parses, etc). I accepted the WKS suggested split of documents as 70/23/2:
I clicked "Run" and waited:

The model was trained and evaluated in about ten minutes. Here's how it performed on entity types:

And here's how it performed on relation types:

This is actually not bad given how sparse the data is. I mean, an F1 of .33 on X_INSIDE_Y from only 29 training examples on a first pass. I'll take it, especially since that one is not necessarily obvious from the text. Here's one example of the X_INSIDE_Y relation:

So I was able to train a model with 11 entity types and 81 relation types on a small corpus in less than three hours start to finish without writing a single line of code. I did not program Watson. I taught it

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...