The Lousy Linguist

Monday, February 17, 2020

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

[reposted from 11/20/10]

I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ... since ... um ... okay, the most linguistics oriented sitcom episode EVER! But thanks to the innerwebz, I have caught up on my TV addiction.

The set-up has Jack Donaghy being the voice of Pronouncify.com, (I BEG you to sign up, PLEASE!!!!) a website that demonstrates the correct pronunciation of all English words. Apparently, when Jack was a poor undergrad at Princeton, he was hired by the "Linguistics Department" to pronounce every word in an English dictionary to preserve the correct pronunciation for generations to come. But they sold his readings, and hence his voice is now the voice of Pronouncify.com (as well as the first perfect microwave...).

Here is as faithful a transcript of the critical dialogue as I can muster:

Jack: Those bastards!
Liz: Who bastards?
Jack: Part of my Princeton scholarship included work for the Linguistics department. They wanted me to record every word in the dictionary to preserve the perfect American accent in case of nuclear war. Well, the cold war ended, and Princeton began selling the recordings.
Liz: So people can just buy your voice?
Jack: Ohhhh, the things it's been dragged into. Thomas the Tank Engine; Wu-Tang songs...

This must have been the glory days before the hippies took over and started "protecting" undergrads from "exploitation." Whatever...

In any case, it's understandable that this trivial tid-bit of academic minutia blew right by most people, but it is a fact of the world we live in that Princeton University does not have a linguistics department per se. They do offer an Undergraduate Program in Linguistics in which students can "pursue a Certificate in Linguistics," but this is not an official department as far as I understand it. Jack, if he is the same age as the actor Alec Baldwin, would have been at Princeton in late 1970s. Maybe they had a full fledged department back then, I honestly don't know.

You can watch the episode College at NBC, or wherever else you prefer. BTW, there's an awesome ode to color perception conundrums at the end as well. It's all kinda linguisticee/cog sciencee (I never know how to add the -ee morpheme?).

Random after-point: Near the end of Thursday's episode of Community, Dean Pelton actually utilized the Shakespearean subjunctive construction Would that X were Y... He says "Would that this hoodie were a time hoodie" around the 19:20 mark (see Hamlet, would it were not so, you are my mother). Just thought that was kinda awesome.

And not for nuthin', but if you haven't seen Tina Fey's Mark Twain Prize speech, it's a gem: HERE.

TV anachronisms - The birth of a metaphor

[reposted from 4/12/13]

A Twitter exchange with Ben Zimmer over the metaphorical use of the phrase "pause button" in the new TV show The Americans (set in 1981) led me to think about how metaphors begin their lives. I didn't watch the episode in question, but apparently several viewers noticed that the show used the phrase "pause button" metaphorically to mean something like to put a romantic relationship on hold.

Ben tweeted this fact as a likely anachronism, presumably because the technology of pause buttons was too young in 1981 to have likely jumped to metaphorical use by then. I was not the only one who immediately took to Google Ngrams to start testing this hypothesis. In the end, Tweeter @Manganpaper found a good example from 1981 from some kind of self-help book.

But what interests me is an example I found from 1987:

Consumers have pushed the "pause" button on sales of video-cassette recorders, for years in the fast-forward mode.

Ben reluctantly conceded the example:

I'd have to review my historical linguistics books, but I don't think words necessarily shift their meanings radically all at once. I believe they can take on characteristics of associated meanings slowly, thus widening or narrowing their meaning as their linguistic environment unfolds. Eventually, a word can come to mean something quite radically different than it originally meant. I see no reason that the life of a metaphor could not follow a similar trajectory. Ben objected to the fact that the 1987 use of "pause button" I linked to was semantically linked to the literal use of actual pause buttons because it dealt with the conceptual space of VCR sales. But my hunch is that this is how many metaphors start their lives, making small conceptual leaps, not big ones. I could be wrong though. The sad truth is that finding good empirical data for the life span of metaphors is extremely difficult. The fact is that even with the awe inspiring large natural language data sets currently available in many languages, studying a linguistically high level data type like metaphor remains out of reach of most NLP techniques.

But this is why our NLP blood boils. There are miles to go before we sleep...

Tuesday, December 26, 2017

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.

Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data
Step 2: Describe Your Data Honestly
Step 3: Focus on Linguistic Structure, At Least Some of the Time
Step 4: Do Error Analysis

Since this is preliminary, I’ll concentrate on just steps 1 & 3

UPDATE 12/28/2017 #2: Robert Munro has written up an excellent analysis of this data, A Step in the Right Direction for NLP. He addresses many of the points I make below. He also makes some solid suggestions for how to account for bias. Well worth a read.

UPDATE 12/28/2017 #1: My original analysis of their IAA process was mistaken. I have corrected it. Also added another comment under "Some Thoughts" at the end.

Step 1: Know Your Data

The data download

“large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: toxic, severe_toxic, obscene, threat, insult, identity_hate” (note: there are a large number of comments with no ratings; presumably these are neutral)
Actual download for Training set is a CSV of ID, comment, rating (1 or 0) for each label.
100k talk page diffs
Comments come from 2001–2015
There are 10 judgements per diff (final rating is the average)
Here’s the original paper discussing this work Ex Machina: Personal Attacks Seen at Scale by Wulczyn, Thain, and Dixon.

Who created the data?

Wikipedia editors
Largely young white men (lots of research on this)

2011 WMF survey, 90% of Wikipedians are men, 9% are female, and 1% are transgender/transsexual
53% are under 30 (see here for more info)

Who rated the data?

3591 Crowdflower participants (you have to dig into the paper and associated repos to discover this)
This is a huge number of raters.
Looking at the demographics spreadsheet available here, most are not native English speakers. I'm surprised this was allowed
Only18% were native speakers of English
Only three gender options were offered (male, female, other). Only one user chose other
54% are under 30
65% chose male as their gender response
68% chose bachelors, masters, doctorate, or professional as level education
I'm admittedly confused by how they obtained their final ratings because the paper discusses a slightly different task (not involving the six labels specifically), and the authors' wiki discussion of annotation and data release (here) also discusses what look like slightly different tasks. Were these six labels aggregated post hoc over different tasks using the same data? I don't see a clear example of asking human raters to give judgments for these six specific labels. I'll keep digging.
In general, I find the labels problematic (Dixon admitted to some of this in a discussion here).

Language varieties (in comments data)?

Lightly edited English, mostly North American (e.g., one comment said they "corrected" the spelling of “recognised” to “recognized)
Some commenters self-identify as non-native speakers
Lots of spelling typos
Mixed punctuation usage
Some meta tags in comments
Inconsistent use of spaces before/after commas
Inconsistent use of caps (inconsistent for both all caps and camel case)
Some examples of ACSII art
Excessive apostrophe usage (one comment has several hundred apostrophes)
Length of comments vary considerably

From five words, to hundreds of words
This might have a spurious effect on the human raters
Humans get tired. When they encounter a long comment, they may be tempted to rely solely on the first few sentences, or possibly the mere presence of some key words.

Some character encoding issues: â€

Step 3:

Focus on Linguistic Structure, At Least Some of the Time

Lexicon

A did a quick skim of a few dozen of the neutral comments (those with 0s for all labels) and it looked like they had no swear words.
I fear this will lead a model to over-learn that the mere presence of a swear word means it should get a label.
See the excellent blog Strong Language for reasons this is not true.
I would throw in some extra training data that includes neutral comments with swear words.

Morphology

Perhaps due to non-native speakers or typos, the wrong morphological variants can be found for some words (e.g., "I am not the only one who finds the article too ridiculous and trivia to be included.")
Lemmatizing can help (but it can also hurt)

Syntax

Some comments lack sentence ending punctuation, leading to run ons. (e.g., "It was very constructive you are just very very stupid.")
Large amount of grammatical errors
If any parsing is done (shallow or full), could be some issues
A Twitter trained parser might be useful for the short comments, but not the long ones.

Discourse

All comments are decontextual
Missing the comments before reduces information (many comments are responses to something, but we don't get the something)
Also missing the page topic that the comment is referring to
Hence, we're missing discourse and pragmatic links

Some Thoughts
The data is not simply toxic comments, but it’s the kind of toxic comments that young men make to other young men. And the labels are what young, educated, non-native English speaking men think counts as toxic.

My gut feeling is that the generalizability of any model trained on this data will be limited.

My main concern though is the lack of linguistic context (under Discourse). Exactly what counts as "toxic" under these de-contextual circumstances? Would the rating be the same if the context were present? I don't know. My hunch is that at least some of these ratings would change.

Wednesday, October 18, 2017

A linguist asks some questions about word vectors

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.

Quick review of the tutorial:

Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a vocabulary word nearby a given input word.
Methods: ngrams for training, order doesn't matter
Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.

Questions

Why sentences?

When choosing word window size, word vectors respect sentence boundaries. Why?

By doing this, vectors are modeling edited, written language.
Vectors are assuming the semantics of sentences are coherent in some way.
Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?

This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
How else might spoken language models differ?
How might methods change to account for this
Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
Why miss out on that information?

Why the middle?

Vectors use words “in the middle of a sentence”? Why?
Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?

Synthetic Languages:

How well would this word vectors work for a synthetic language like Mohawk?
What pre-processing steps might need to be added/modified?
Would those modifications be enough to get similar/useful results?

Antonyms: Can the same method learn the difference between “ant” and “ants”?

Quote: “the network will likely learn similar word vectors for the words “ant” and “ants”
I haven't read it yet, but these folks seem to use word vectors to learn antonyms: Word Embedding-based Antonym Detection using Thesauri and Distributional Information.

This was a fun read.

The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

I broke it into the following sections

End-to-End DL for rich output
Buckets of DL
Bias and variance
Applied machine learning work flow
New Era of ML
Build a Unified Data Warehouse
The 70/30 Split Revisited
Comparing to Human Level Performance
How do you define human level performance?
How do you build a career in machine learning?
AI is the new electricity

Intro

End to end DL – work flow
Bias and variance has changed in era of deep learning
DL been around for decades, why do they work well now?

Scale of data and computation
Two teams

AI Teams
Systems team
Sit together
Difficult for any one human to be sufficiently expert in multiple fields

End-to-End DL for rich output

From first three buckets below
Traditional ML models output real numbers
End-to-end DL can out put more complex things than numbers

Sentence captions for images
Speech-to-text
Machine translation
Synthesize new images (13:00)

End-to-End DL not the solution to everything.

End-to-end = having just a DL between input and output
Rules for when to use (13:35)

Old way: audio ------> phonemes --> transcript
New DL way: audio -----------------> transcript

Makes for great PR, but only works some times (15:31)
Achilles heel – need lots of labeled data
Maybe phonemes are just a fantasy of linguists (15:48)
Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)

Common problem – after first round of dev, ML not working that well, what do you do next?

Collect more data
Train longer
Different architecture (e.g., switch to NNs)
Regularization
Bigger model
More GPUs

Skill in ML engineer is knowing how to make these decisions (22:33)

Buckets of DL

General models

Densely connected layers – FC
Sequence models – 1D (RNN, LSTM, GRU, attention)
Image models – 2D, 3D (Convo nets)
Other – unsupervised, reinforcement

First three buckets driving market advances
But "Other" bucket is future of AI

Bias and variance – evolving

Scenario: build human level speech rec system

Measure human level error – 1
Training set error – 5%
Dev set – 6%

Bias = difference between human error level and your system’s
TIP: For bias problems try training a bigger model (25:21)
Variance (overfitting): if Human 1%, Training 2%, Dev 6%
TIP: for variance, try adding regularization, early stopping, best bet = more data
Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
“sucks for you” (direct quote 26:30)

Applied machine learning work flow

Is your training error high

Bigger model
Train longer
New architecture
Repeat until doing well on training set

Is dev error high?

Add data
Regularization
New architecture
Repeat until doing well on training set

Done

New Era of ML

We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
No longer a bias/variance trade-off (29:47)
“Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
More data has led to interesting investments

Data synthesis - Growing area
Examples-

OCR at Baidu
Take random image
Random word
Type random word in Microsoft Word
Use random font
You just created training data for OC
Still takes some human intervention, but lots of progress

Speech recognition

Take clean audio
Add random noise to background for more data
E.g., add car noise
Works remarkably well

Take ungrammatical sentences and auto-correct
Easy to create ungrammatical sentences programmatically

Video games in RL

Data synthesis has a lot of limits (36:24)

Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
20 cars in video game enough to give “realistic” impression to player
But 20 cars is very impoverished data set for self-driving cars

Build a Unified Data Warehouse

Employees can be possessive of "their" data
Baidu- it’s not your data, it’s company data
Access rights can be a different issue
But warehouse everything together
Kaggle

The 70/30 Split Revisited

In academia, common for test/train to come from same distribution
But more ommon in industry for test and train to come from different distributions

E.g., speech rec at Baid

Speech enabled rear view mirror (in China)

50,000 hours of regular speech data

Data not from rear-view mirror interactions though

Collect another 10 hours of rear-view mirror scenario

What do you do with the original 50,000 hours of not-quite right data?

Old method would be to build a different model for each scenario

New era, one model for all data

Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.

TIP: Make sure dev and test are from same distro (boosts effectiveness)

Good Idea: make 50,000 train, split 10,000 into dev/test

Dev set = problem specification

Me: "dev set = problem you are trying to solve"

Also, split off just 20 hours from 50,000 to create tiny “dev-train” set

this has same distro as train

Mismatched train and dev set is problem that academia doesn’t work on much

some work on domain adaptation, but not much (44:53)

New architecture fix = “hail mary” (48:58)
Takes a long time to really grok bias/variance

People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)

Common Theme – Comparing to Human Level Performance

Common to achieve human level performance, then level off
Why?

Audience: Labels come from humans
Audience: Researchers get satisfied with results (the laziness hypothesis)
Andrew: theoretical limits (aka optimal error rate, Bayes rate)

Some audio so bad, impossible to transcribe (phone call from a rock concert)
Some images so blurry, impossible to interpret

Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)

While worse than humans, still ways to improve

Get labels from humans
Error analysis
Estimate bias/variance effects

For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve

How do you define human level performance?

Quiz: Which is the most useful definition? (101:000

Example: Medical image reading

Typical non-doctor error - 3%
Typical doctor – 1%
Expert doctor – 0.7%
Team of expert doctors – 0.5%

Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.

What can AI do? (106:30)

Anything that a typical person can do in less than one second.

E.g., Perception tasks
Audience: if a human can do it in less than a second, you can get a lot of data

How do you build a career in machine learning (111:00)

Andrew says he does not have a great answer (me: but he does have a good one)

Taking a ML course
Attend DL school
Work on project yourself (Kaggle)
Mimic PhD student process

Read a lot of papers (20+)
Replicate results

Dirty work

Downloading/cleaning data
Re-running someone’s code

Don’t only do dirty work
PhD process + Dirty work = reliable

Keep it up for a year
Competency

AI is the new electricity (118:00)

Transforms industry after industry
Get in on the ground floor
NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.

Saturday, April 8, 2017

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been created pitting some of the most influential NLPers in the world head to head in a contest of wit and debate that is truly what the field of NLP needed. Unfortunately, Twitter proves hapless at being able to view the thread as a single conversation using their app. But, student Sebastian Meilke used a Chrome extension called Treeverse to put together the whole thread into a single, readable format, complete with sections!

If you are at all interested in NLP or linguistics, this is a must read: NLP/CL Twitter Megathrea.

I would be remiss if I didn't note my own small role in this. Emily's poll was sparked by my own Tweet where I said I was glad the Linguistics Society of America is starting their own Society for Computational Linguistics because "the existing CL and NLP communities have gotten farther and farther off the linguistics path"

Happy reading

Saturday, March 25, 2017

Small Towns Do Tech Startups Too

I'm participating in the Startup Weekend event in Chico CA sponsored by Chicostart. Great first night. The 60 second pitches were awesome. So much variety and interesting problems to solve. I pitched my own idea about using Watson to identify fake news, but I only got two blue stickers, boo hoo. I am very impressed with the Build.com space that hosted us. They are a multi-use space that includes a lot of tech startups. Kudos for putting such a great state-of-the-art space in a small college town.

I was particularly impressed with the range of pitchers. Millenials, students, professionals, middle aged folks, and at least one senior. Not bad for a small town with big dreams.

FYI - I'll be helping build a startup this weekend focused on smart housing devices. Will be a lot of fun.

The Lousy Linguist

Monday, February 17, 2020

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

TV anachronisms - The birth of a metaphor

Tuesday, December 26, 2017

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.

Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data

Wednesday, October 18, 2017

A linguist asks some questions about word vectors

Thursday, September 14, 2017

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

Saturday, April 8, 2017

NLPers: How would you characterize your linguistics background?

Saturday, March 25, 2017

Small Towns Do Tech Startups Too

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

Tools for Linguists

Favorite Posts