Monday, February 17, 2020

TV Linguistics - and the fictional Princeton Linguistics department

 [reposted from 11/20/10]

I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ... since ... um ... okay, the most linguistics oriented sitcom episode EVER! But thanks to the innerwebz, I have caught up on my TV addiction.

The set-up has Jack Donaghy being the voice of, (I BEG you to sign up, PLEASE!!!!)  a website that demonstrates the correct pronunciation of all English words. Apparently, when Jack was a poor undergrad at Princeton, he was hired by the "Linguistics Department" to pronounce every word in an English dictionary to preserve the correct pronunciation for generations to come. But they sold his readings, and hence his voice is now the voice of (as well as the first perfect microwave...).

Here is as faithful a transcript of the critical dialogue as I can muster:

Jack: Those bastards!
Liz: Who bastards?
Jack: Part of my Princeton scholarship included work for the Linguistics department. They wanted me to record every word in the dictionary to preserve the perfect American accent in case of nuclear war. Well, the cold war ended, and Princeton began selling the recordings.
Liz: So people can just buy your voice?
Jack: Ohhhh, the things it's been dragged into. Thomas the Tank Engine; Wu-Tang songs...

This must have been the glory days before the hippies took over and started "protecting" undergrads from "exploitation." Whatever...

In any case, it's understandable that this trivial tid-bit of academic minutia blew right by most people, but it is a fact of the world we live in that Princeton University does not have a linguistics department per se. They do offer an Undergraduate Program in Linguistics in which students can "pursue a Certificate in Linguistics," but this is not an official department as far as I understand it. Jack, if he is the same age as the actor Alec Baldwin, would have been at Princeton in late 1970s. Maybe they had a full fledged department back then, I honestly don't know.

You can watch the episode College at NBC, or wherever else you prefer. BTW, there's an awesome ode to color perception conundrums at the end as well. It's all kinda linguisticee/cog sciencee (I never know how to add the -ee morpheme?).

Random after-point: Near the end of Thursday's episode of Community, Dean Pelton actually utilized the Shakespearean subjunctive construction Would that X were Y... He says "Would that this hoodie were a time hoodie" around the 19:20 mark (see Hamlet, would it were not so, you are my mother). Just thought that was kinda awesome.

And not for nuthin', but if you haven't seen Tina Fey's Mark Twain Prize speech, it's a gem: HERE.

TV anachronisms - The birth of a metaphor

[reposted from 4/12/13]

A Twitter exchange with Ben Zimmer over the metaphorical use of the phrase "pause button" in the new TV show The Americans (set in 1981) led me to think about how metaphors begin their lives. I didn't watch the episode in question, but apparently several viewers noticed that the show used the phrase "pause button" metaphorically to mean something like to put a romantic relationship on hold.

Ben tweeted this fact as a likely anachronism, presumably because the technology of pause buttons was too young in 1981 to have likely jumped to metaphorical use by then. I was not the only one who immediately took to Google Ngrams to start testing this hypothesis. In the end, Tweeter @Manganpaper found a good example from 1981 from some kind of self-help book.

But what interests me is an example I found from 1987:
Consumers have pushed the "pause" button on sales of video-cassette recorders, for years in the fast-forward mode.
Ben reluctantly conceded the example:

I'd have to review my historical linguistics books, but I don't think words necessarily shift their meanings radically all at once. I believe they can take on characteristics of associated meanings slowly, thus widening or narrowing their meaning as their linguistic environment unfolds. Eventually, a word can come to mean something quite radically different than it originally meant. I see no reason that the life of a metaphor could not follow a similar trajectory. Ben objected to the fact that the 1987 use of "pause button" I linked to was semantically linked to the literal use of actual pause buttons because it dealt with the conceptual space of VCR sales. But my hunch is that this is how many metaphors start their lives, making small conceptual leaps, not big ones. I could be wrong though. The sad truth is that finding good empirical data for the life span of metaphors is extremely difficult. The fact is that even with the awe inspiring large natural language data sets currently available in many languages, studying a linguistically high level data type like metaphor remains out of reach of most NLP techniques.

But this is why our NLP blood boils. There are miles to go before we sleep...

Tuesday, December 26, 2017

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.  

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.  


Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data
Step 2: Describe Your Data Honestly
Step 3: Focus on Linguistic Structure, At Least Some of the Time
Step 4: Do Error Analysis
Since this is preliminary, I’ll concentrate on just steps 1 & 3

UPDATE 12/28/2017 #2: Robert Munro has written up an excellent analysis of this data, A Step in the Right Direction for NLP. He addresses many of the points I make below. He also makes some solid suggestions for how to account for bias. Well worth a read.

UPDATE 12/28/2017 #1: My original analysis of their IAA process was mistaken. I have corrected it. Also added another comment under "Some Thoughts" at the end.

Step 1: Know Your Data

The data download
  • “large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: toxic, severe_toxic, obscene, threat, insult, identity_hate” (note: there are a large number of comments with no ratings; presumably these are neutral)
  • Actual download for Training set is a CSV of ID, comment, rating (1 or 0) for each label. 
  • 100k talk page diffs
  • Comments come from 2001–2015
  • There are 10 judgements per diff (final rating is the average)
  • Here’s the original paper discussing this work Ex Machina: Personal Attacks Seen at Scale by Wulczyn, Thain, and Dixon.
Who created the data?
  • Wikipedia editors
  • Largely young white men (lots of research on this)
    • 2011 WMF survey, 90% of Wikipedians are men, 9% are female, and 1% are transgender/transsexual
    • 53% are under 30 (see here for more info)
Who rated the data?
  • 3591 Crowdflower participants (you have to dig into the paper and associated repos to discover this)
  • This is a huge number of raters.
  • Looking at the demographics spreadsheet available here, most are not native English speakers. I'm surprised this was allowed
  • Only18% were native speakers of English
  • Only three gender options were offered (male, female, other). Only one user chose other
  • 54% are under 30
  • 65% chose male as their gender response
  • 68% chose bachelors, masters, doctorate, or professional as level education
  • I'm admittedly confused by how they obtained their final ratings because the paper discusses a slightly different task (not involving the six labels specifically), and the authors' wiki discussion of annotation and data release (here) also discusses what look like slightly different tasks. Were these six labels aggregated post hoc over different tasks using the same data? I don't see a clear example of asking human raters to give judgments for these six specific labels. I'll keep digging.
  • In general, I find the labels problematic (Dixon admitted to some of this in a discussion here).
Language varieties (in comments data)?
  • Lightly edited English, mostly North American (e.g., one comment said they "corrected" the spelling of “recognised” to “recognized)
  • Some commenters self-identify as non-native speakers
  • Lots of spelling typos
  • Mixed punctuation usage
  • Some meta tags in comments
  • Inconsistent use of spaces before/after commas
  • Inconsistent use of caps (inconsistent for both all caps and camel case)
  • Some examples of ACSII art
  • Excessive apostrophe usage (one comment has several hundred apostrophes)
  • Length of comments vary considerably
    • From five words, to hundreds of words
    • This might have a spurious effect on the human raters
    • Humans get tired. When they encounter a long comment, they may be tempted to rely solely on the first few sentences, or possibly the mere presence of some key words.
  • Some character encoding issues: â€

Step 3: 
Focus on Linguistic Structure, At Least Some of the Time

  • A did a quick skim of a few dozen of the neutral comments (those with 0s for all labels) and it looked like they had no swear words.
  • I fear this will lead a model to over-learn that the mere presence of a swear word means it should get a label.
  • See the excellent blog Strong Language for reasons this is not true.
  • I would throw in some extra training data that includes neutral comments with swear words.
  • Perhaps due to non-native speakers or typos, the wrong morphological variants can be found for some words (e.g., "I am not the only one who finds the article too ridiculous and trivia to be included.")
  • Lemmatizing can help (but it can also hurt)
  • Some comments lack sentence ending punctuation, leading to run ons. (e.g., "It was very constructive you are just very very stupid.")
  • Large amount of grammatical errors
  • If any parsing is done (shallow or full), could be some issues
  • A Twitter trained parser might be useful for the short comments, but not the long ones.
  • All comments are decontextual 
  • Missing the comments before reduces information (many comments are responses to something, but we don't get the something) 
  • Also missing the page topic that the comment is referring to
  • Hence, we're missing discourse and pragmatic links

Some Thoughts
The data is not simply toxic comments, but it’s the kind of toxic comments that young men make to other young men.  And the labels are what young, educated, non-native English speaking men think counts as toxic.

My gut feeling is that the generalizability of any model trained on this data will be limited. 

My main concern though is the lack of linguistic context (under Discourse). Exactly what counts as "toxic" under these de-contextual circumstances? Would the rating be the same if the context were present? I don't know. My hunch is that at least some of these ratings would change.

Wednesday, October 18, 2017

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.
Quick review of the tutorial:
  • Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a  vocabulary word nearby a given input word.
  • Methods: ngrams for training, order doesn't matter
  • Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.
  • Why sentences?
    • When choosing word window size, word vectors respect sentence boundaries. Why?
      • By doing this, vectors are modeling edited, written language.
      • Vectors are assuming the semantics of sentences are coherent in some way.
      • Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?
    • This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
    • How else might spoken language models differ?
    • How might methods change to account for this
    • Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
    • Why miss out on that information?
  • Why the middle?
    • Vectors use words “in the middle of a sentence”? Why?
    • Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
    • Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?
  • Synthetic Languages:
    • How well would this word vectors work for a synthetic language like Mohawk?
    • What pre-processing steps might need to be added/modified?
    • Would those modifications be enough to get similar/useful results?
  • Antonyms: Can the same method learn the difference between “ant” and “ants”?
This was a fun read.

The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

I broke it into the following sections
  1. End-to-End DL for rich output
  2. Buckets of DL
  3. Bias and variance
  4. Applied machine learning work flow
  5. New Era of ML
  6. Build a Unified Data Warehouse
  7. The 70/30 Split Revisited
  8. Comparing to Human Level Performance
  9. How do you define human level performance?
  10. How do you build a career in machine learning?
  11. AI is the new electricity
  1. End to end DL – work flow
  2. Bias and variance has changed in era of deep learning
  3. DL been around for decades, why do they work well now?
    • Scale of data and computation
    • Two teams
      • AI Teams
      • Systems team
      • Sit together
      • Difficult for any one human to be sufficiently expert in multiple fields
End-to-End DL for rich output
    • From first three buckets below
    • Traditional ML models output real numbers
    • End-to-end DL can out put more complex things than numbers
      • Sentence captions for images
      • Speech-to-text
      • Machine translation
      • Synthesize new images (13:00)
    • End-to-End DL not the solution to everything.
      • End-to-end = having just a DL between input and output
      • Rules for when to use (13:35)
        • Old way: audio ------> phonemes --> transcript
        • New DL way: audio -----------------> transcript
      • Makes for great PR, but only works some times (15:31)
      • Achilles heel – need lots of labeled data
      • Maybe phonemes are just a fantasy of linguists (15:48)
      • Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
      • Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)
    • Common problem – after first round of dev, ML not working that well, what do you do next?
      • Collect more data
      • Train longer
      • Different architecture (e.g., switch to NNs)
      • Regularization
      • Bigger model
      • More GPUs
    • Skill in ML engineer is knowing how to make these decisions (22:33)
Buckets of DL
  1. General models
    • Densely connected layers – FC
    • Sequence models – 1D (RNN, LSTM, GRU, attention)
    • Image models – 2D, 3D (Convo nets)
    • Other – unsupervised, reinforcement
  2. First three buckets driving market advances
  3. But "Other" bucket is future of AI
Bias and variance – evolving
  1. Scenario: build human level speech rec system
    • Measure human level error – 1
    • Training set error – 5%
    • Dev set – 6%
  2. Bias = difference between human error level and your system’s
  3. TIP: For bias problems try training a bigger model (25:21)
  4. Variance (overfitting): if Human 1%, Training 2%, Dev 6%
  5. TIP: for variance, try adding regularization, early stopping, best bet = more data
  6. Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
  7. sucks for you” (direct quote 26:30)
Applied machine learning work flow
  1. Is your training error high
    • Yes
      • Bigger model
      • Train longer
      • New architecture
      • Repeat until doing well on training set
  2. Is dev error high?
    • Yes
      • Add data
      • Regularization
      • New architecture
      • Repeat until doing well on training set
  3. Done
New Era of ML
  1. We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
  2. No longer a bias/variance trade-off (29:47)
  3. “Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
  4. More data has led to interesting investments
    • Data synthesis - Growing area
    • Examples-
      • OCR at Baidu
      • Take random image
      • Random word
      • Type random word in Microsoft Word
      • Use random font
      • You just created training data for OC
      • Still takes some human intervention, but lots of progress
    • Speech recognition
      • Take clean audio
      • Add random noise to background for more data
      • E.g., add car noise
      • Works remarkably well
    • NLP
      • Take ungrammatical sentences and auto-correct
      • Easy to create ungrammatical sentences programmatically
    • Video games in RL
  5. Data synthesis has a lot of limits (36:24)
    • Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
    • 20 cars in video game enough to give “realistic” impression to player
    • But 20 cars is very impoverished data set for self-driving cars
Build a Unified Data Warehouse
  1. Employees can be possessive of "their" data
  2. Baidu- it’s not your data, it’s company data
  3. Access rights can be a different issue
  4. But warehouse everything together
  5. Kaggle
The 70/30 Split Revisited
  1. In academia, common for test/train to come from same distribution
  2. But more ommon in industry for test and train to come from different distributions
    • E.g., speech rec at Baid
      • Speech enabled rear view mirror (in China)
      • 50,000 hours of regular speech data
      • Data not from rear-view mirror interactions though
      • Collect another 10 hours of rear-view mirror scenario
    • What do you do with the original 50,000 hours of not-quite right data?
      • Old method would be to build a different model for each scenario
      • New era, one model for all data
      • Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.
      • TIP: Make sure dev and test are from same distro (boosts effectiveness)
      • Good Idea: make 50,000 train, split 10,000 into dev/test
    • Dev set = problem specification
      • Me: "dev set = problem you are trying to solve"
    • Also, split off just 20 hours from 50,000 to create tiny “dev-train” set
      • this has same distro as train
  3. Mismatched train and dev set is problem that academia doesn’t work on much
    • some work on domain adaptation, but not much (44:53)
  4. New architecture fix = “hail mary” (48:58)
  5. Takes a long time to really grok bias/variance
    • People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)
Common Theme – Comparing to Human Level Performance
  1. Common to achieve human level performance, then level off
  2. Why?
    • Audience: Labels come from humans
    • Audience: Researchers get satisfied with results (the laziness hypothesis)
    • Andrew: theoretical limits (aka optimal error rate, Bayes rate)
      • Some audio so bad, impossible to transcribe (phone call from a rock concert)
      • Some images so blurry, impossible to interpret
    • Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)
  3. While worse than humans, still ways to improve
    • Get labels from humans
    • Error analysis
    • Estimate bias/variance effects
  4. For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve
How do you define human level performance?
  1. Quiz: Which is the most useful definition? (101:000
    • Example: Medical image reading
      1. Typical non-doctor error - 3%
      2. Typical doctor – 1%
      3. Expert doctor – 0.7%
      4. Team of expert doctors – 0.5%
    • Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.
What can AI do? (106:30)
  1. Anything that a typical person can do in less than one second.
    • E.g., Perception tasks
    • Audience: if a human can do it in less than a second, you can get a lot of data
How do you build a career in machine learning (111:00)
  1. Andrew says he does not have a great answer (me: but he does have a good one)
    • Taking a ML course
    • Attend DL school
    • Work on project yourself (Kaggle)
    • Mimic PhD student process
      • Read a lot of papers (20+)
      • Replicate results
    • Dirty work
      • Downloading/cleaning data
      • Re-running someone’s code
    • Don’t only do dirty work
    • PhD process + Dirty work = reliable
      • Keep it up for a year
      • Competency
AI is the new electricity (118:00)
  1. Transforms industry after industry
  2. Get in on the ground floor
  3. NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.

Saturday, April 8, 2017

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been created pitting some of the most influential NLPers in the world head to head in a contest of wit and debate that is truly what the field of NLP needed. Unfortunately, Twitter proves hapless at being able to view the thread as a single conversation using their app. But, student Sebastian Meilke used a Chrome extension called Treeverse to put together the whole thread into a single, readable format, complete with sections!

If you are at all interested in NLP or linguistics, this is a must read: NLP/CL Twitter Megathrea.

I would be remiss if I didn't note my own small role in this. Emily's poll was sparked by my own Tweet where I said I was glad the Linguistics Society of America is starting their own Society for Computational Linguistics because "the existing CL and NLP communities have gotten farther and farther off the linguistics path"

Happy reading

Saturday, March 25, 2017

Small Towns Do Tech Startups Too

I'm participating in the Startup Weekend event in Chico CA sponsored by Chicostart. Great first night. The 60 second pitches were awesome. So much variety and interesting problems to solve. I pitched my own idea about using Watson to identify fake news, but I only got two blue stickers, boo hoo. I am very impressed with the space that hosted us. They are a multi-use space that includes a lot of tech startups. Kudos for putting such a great state-of-the-art space in a small college town.

I was particularly impressed with the range of pitchers. Millenials, students, professionals, middle aged folks, and at least one senior. Not bad for a small town with big dreams.

FYI - I'll be helping build a startup this weekend focused on smart housing devices. Will be a lot of fun.

TV Linguistics - and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...