The Lousy Linguist

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

2020-02-17T16:09:00.001-05:00

[reposted from 11/20/10]

I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ... since ... um ... okay, the most linguistics oriented sitcom episode EVER! But thanks to the innerwebz, I have caught up on my TV addiction.

The set-up has Jack Donaghy being the voice of Pronouncify.com, (I BEG you to sign up, PLEASE!!!!) a website that demonstrates the correct pronunciation of all English words. Apparently, when Jack was a poor undergrad at Princeton, he was hired by the "Linguistics Department" to pronounce every word in an English dictionary to preserve the correct pronunciation for generations to come. But they sold his readings, and hence his voice is now the voice of Pronouncify.com (as well as the first perfect microwave...).

Here is as faithful a transcript of the critical dialogue as I can muster:

Jack: Those bastards!
Liz: Who bastards?
Jack: Part of my Princeton scholarship included work for the Linguistics department. They wanted me to record every word in the dictionary to preserve the perfect American accent in case of nuclear war. Well, the cold war ended, and Princeton began selling the recordings.
Liz: So people can just buy your voice?
Jack: Ohhhh, the things it's been dragged into. Thomas the Tank Engine; Wu-Tang songs...

This must have been the glory days before the hippies took over and started "protecting" undergrads from "exploitation." Whatever...

In any case, it's understandable that this trivial tid-bit of academic minutia blew right by most people, but it is a fact of the world we live in that Princeton University does not have a linguistics department per se. They do offer an Undergraduate Program in Linguistics in which students can "pursue a Certificate in Linguistics," but this is not an official department as far as I understand it. Jack, if he is the same age as the actor Alec Baldwin, would have been at Princeton in late 1970s. Maybe they had a full fledged department back then, I honestly don't know.

You can watch the episode College at NBC, or wherever else you prefer. BTW, there's an awesome ode to color perception conundrums at the end as well. It's all kinda linguisticee/cog sciencee (I never know how to add the -ee morpheme?).

Random after-point: Near the end of Thursday's episode of Community, Dean Pelton actually utilized the Shakespearean subjunctive construction Would that X were Y... He says "Would that this hoodie were a time hoodie" around the 19:20 mark (see Hamlet, would it were not so, you are my mother). Just thought that was kinda awesome.

And not for nuthin', but if you haven't seen Tina Fey's Mark Twain Prize speech, it's a gem: HERE.

TV anachronisms - The birth of a metaphor

2020-02-17T16:07:00.001-05:00

[reposted from 4/12/13]

A Twitter exchange with Ben Zimmer over the metaphorical use of the phrase "pause button" in the new TV show The Americans (set in 1981) led me to think about how metaphors begin their lives. I didn't watch the episode in question, but apparently several viewers noticed that the show used the phrase "pause button" metaphorically to mean something like to put a romantic relationship on hold.

Ben tweeted this fact as a likely anachronism, presumably because the technology of pause buttons was too young in 1981 to have likely jumped to metaphorical use by then. I was not the only one who immediately took to Google Ngrams to start testing this hypothesis. In the end, Tweeter @Manganpaper found a good example from 1981 from some kind of self-help book.

But what interests me is an example I found from 1987:

Consumers have pushed the "pause" button on sales of video-cassette recorders, for years in the fast-forward mode.

Ben reluctantly conceded the example:

I'd have to review my historical linguistics books, but I don't think words necessarily shift their meanings radically all at once. I believe they can take on characteristics of associated meanings slowly, thus widening or narrowing their meaning as their linguistic environment unfolds. Eventually, a word can come to mean something quite radically different than it originally meant. I see no reason that the life of a metaphor could not follow a similar trajectory. Ben objected to the fact that the 1987 use of "pause button" I linked to was semantically linked to the literal use of actual pause buttons because it dealt with the conceptual space of VCR sales. But my hunch is that this is how many metaphors start their lives, making small conceptual leaps, not big ones. I could be wrong though. The sad truth is that finding good empirical data for the life span of metaphors is extremely difficult. The fact is that even with the awe inspiring large natural language data sets currently available in many languages, studying a linguistically high level data type like metaphor remains out of reach of most NLP techniques.

But this is why our NLP blood boils. There are miles to go before we sleep...

Putting the Linguistics into Kaggle Competitions

2017-12-26T19:20:00.000-05:00

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.

Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data
Step 2: Describe Your Data Honestly
Step 3: Focus on Linguistic Structure, At Least Some of the Time
Step 4: Do Error Analysis

Since this is preliminary, I’ll concentrate on just steps 1 & 3

UPDATE 12/28/2017 #2: Robert Munro has written up an excellent analysis of this data, A Step in the Right Direction for NLP. He addresses many of the points I make below. He also makes some solid suggestions for how to account for bias. Well worth a read.

UPDATE 12/28/2017 #1: My original analysis of their IAA process was mistaken. I have corrected it. Also added another comment under "Some Thoughts" at the end.

Step 1: Know Your Data

The data download

“large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: toxic, severe_toxic, obscene, threat, insult, identity_hate” (note: there are a large number of comments with no ratings; presumably these are neutral)
Actual download for Training set is a CSV of ID, comment, rating (1 or 0) for each label.
100k talk page diffs
Comments come from 2001–2015
There are 10 judgements per diff (final rating is the average)
Here’s the original paper discussing this work Ex Machina: Personal Attacks Seen at Scale by Wulczyn, Thain, and Dixon.

Who created the data?

Wikipedia editors
Largely young white men (lots of research on this)

2011 WMF survey, 90% of Wikipedians are men, 9% are female, and 1% are transgender/transsexual
53% are under 30 (see here for more info)

Who rated the data?

3591 Crowdflower participants (you have to dig into the paper and associated repos to discover this)
This is a huge number of raters.
Looking at the demographics spreadsheet available here, most are not native English speakers. I'm surprised this was allowed
Only18% were native speakers of English
Only three gender options were offered (male, female, other). Only one user chose other
54% are under 30
65% chose male as their gender response
68% chose bachelors, masters, doctorate, or professional as level education
I'm admittedly confused by how they obtained their final ratings because the paper discusses a slightly different task (not involving the six labels specifically), and the authors' wiki discussion of annotation and data release (here) also discusses what look like slightly different tasks. Were these six labels aggregated post hoc over different tasks using the same data? I don't see a clear example of asking human raters to give judgments for these six specific labels. I'll keep digging.
In general, I find the labels problematic (Dixon admitted to some of this in a discussion here).

Language varieties (in comments data)?

Lightly edited English, mostly North American (e.g., one comment said they "corrected" the spelling of “recognised” to “recognized)
Some commenters self-identify as non-native speakers
Lots of spelling typos
Mixed punctuation usage
Some meta tags in comments
Inconsistent use of spaces before/after commas
Inconsistent use of caps (inconsistent for both all caps and camel case)
Some examples of ACSII art
Excessive apostrophe usage (one comment has several hundred apostrophes)
Length of comments vary considerably

From five words, to hundreds of words
This might have a spurious effect on the human raters
Humans get tired. When they encounter a long comment, they may be tempted to rely solely on the first few sentences, or possibly the mere presence of some key words.

Some character encoding issues: â€

Step 3:

Focus on Linguistic Structure, At Least Some of the Time

Lexicon

A did a quick skim of a few dozen of the neutral comments (those with 0s for all labels) and it looked like they had no swear words.
I fear this will lead a model to over-learn that the mere presence of a swear word means it should get a label.
See the excellent blog Strong Language for reasons this is not true.
I would throw in some extra training data that includes neutral comments with swear words.

Morphology

Perhaps due to non-native speakers or typos, the wrong morphological variants can be found for some words (e.g., "I am not the only one who finds the article too ridiculous and trivia to be included.")
Lemmatizing can help (but it can also hurt)

Syntax

Some comments lack sentence ending punctuation, leading to run ons. (e.g., "It was very constructive you are just very very stupid.")
Large amount of grammatical errors
If any parsing is done (shallow or full), could be some issues
A Twitter trained parser might be useful for the short comments, but not the long ones.

Discourse

All comments are decontextual
Missing the comments before reduces information (many comments are responses to something, but we don't get the something)
Also missing the page topic that the comment is referring to
Hence, we're missing discourse and pragmatic links

Some Thoughts
The data is not simply toxic comments, but it’s the kind of toxic comments that young men make to other young men. And the labels are what young, educated, non-native English speaking men think counts as toxic.

My gut feeling is that the generalizability of any model trained on this data will be limited.

My main concern though is the lack of linguistic context (under Discourse). Exactly what counts as "toxic" under these de-contextual circumstances? Would the rating be the same if the context were present? I don't know. My hunch is that at least some of these ratings would change.

A linguist asks some questions about word vectors

2017-10-18T17:13:00.000-04:00

I have at best a passing familiarity with word vectors, strictly from a 30,000 foot view. I've never directly used them outside a handful of toy tutorials (though they have been embedded in a variety of tools I use professionally). There is something very linguisticee about word vectors, so I'm generally happy with their use. The basic idea goes back aways, but was most famously articulated by J. R. Firth's pithy phrase "You shall know a word by the company it keeps" (c. 1957).

This is a set of questions I had when reading Chris McCormick's tutorial Word2Vec Tutorial - The Skip-Gram Model. My ideal audience is NLPers who have a good understanding of the SOTA of word vectors.

Quick review of the tutorial:

Key Take Away: Word Vectors = the output probabilities that relate how likely it is to find a vocabulary word nearby a given input word.
Methods: ngrams for training, order doesn't matter
Insight: If two words have similar contexts, then the network is motivated to learn similar word vectors for the two words.

Questions

Why sentences?

When choosing word window size, word vectors respect sentence boundaries. Why?

By doing this, vectors are modeling edited, written language.
Vectors are assuming the semantics of sentences are coherent in some way.
Vectors are, in essence, relying on sentences to be semantically coherent. As long as they are, the method works, but when they aren’t, how does this method break down?

This is an assumption often broken in spoken language with dissfluencies, topic shifts, interruptions, etc
How else might spoken language models differ?
How might methods change to account for this
Do words from previous sentences have semantic relationships to the words in following sentences? (Yes. Because discourse.)
Why miss out on that information?

Why the middle?

Vectors use words “in the middle of a sentence”? Why?
Do words that tend to occur in the middle of sentences differ in meaningful ways than those that do not? (Possibly. E.g, verbs and prepositions rarely begin or end sentences. Information structures affects internal ordering of constituents in interesting ways).
Did they look at the rate of occurrences of each unique lexical entry in-the-middle and near the peripheries?

Synthetic Languages:

How well would this word vectors work for a synthetic language like Mohawk?
What pre-processing steps might need to be added/modified?
Would those modifications be enough to get similar/useful results?

Antonyms: Can the same method learn the difference between “ant” and “ants”?

Quote: “the network will likely learn similar word vectors for the words “ant” and “ants”
I haven't read it yet, but these folks seem to use word vectors to learn antonyms: Word Embedding-based Antonym Detection using Thesauri and Distributional Information.

This was a fun read.

The original tutorial:
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

2017-09-14T18:47:00.004-04:00

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a helpful resource for anyone who wants to watch the video.

I broke it into the following sections

End-to-End DL for rich output
Buckets of DL
Bias and variance
Applied machine learning work flow
New Era of ML
Build a Unified Data Warehouse
The 70/30 Split Revisited
Comparing to Human Level Performance
How do you define human level performance?
How do you build a career in machine learning?
AI is the new electricity

Intro

End to end DL – work flow
Bias and variance has changed in era of deep learning
DL been around for decades, why do they work well now?

Scale of data and computation
Two teams

AI Teams
Systems team
Sit together
Difficult for any one human to be sufficiently expert in multiple fields

End-to-End DL for rich output

From first three buckets below
Traditional ML models output real numbers
End-to-end DL can out put more complex things than numbers

Sentence captions for images
Speech-to-text
Machine translation
Synthesize new images (13:00)

End-to-End DL not the solution to everything.

End-to-end = having just a DL between input and output
Rules for when to use (13:35)

Old way: audio ------> phonemes --> transcript
New DL way: audio -----------------> transcript

Makes for great PR, but only works some times (15:31)
Achilles heel – need lots of labeled data
Maybe phonemes are just a fantasy of linguists (15:48)
Advantage of old non-end-to-end architecture is it allows you to manually add more information to the processing (18:16)
Also, for self-driving cars, no one has enough data (right now) to make end-to-end work) (20:42)

Common problem – after first round of dev, ML not working that well, what do you do next?

Collect more data
Train longer
Different architecture (e.g., switch to NNs)
Regularization
Bigger model
More GPUs

Skill in ML engineer is knowing how to make these decisions (22:33)

Buckets of DL

General models

Densely connected layers – FC
Sequence models – 1D (RNN, LSTM, GRU, attention)
Image models – 2D, 3D (Convo nets)
Other – unsupervised, reinforcement

First three buckets driving market advances
But "Other" bucket is future of AI

Bias and variance – evolving

Scenario: build human level speech rec system

Measure human level error – 1
Training set error – 5%
Dev set – 6%

Bias = difference between human error level and your system’s
TIP: For bias problems try training a bigger model (25:21)
Variance (overfitting): if Human 1%, Training 2%, Dev 6%
TIP: for variance, try adding regularization, early stopping, best bet = more data
Both high bias and high variance: if Human 1%, Training 5%, Dev 10%
“sucks for you” (direct quote 26:30)

Applied machine learning work flow

Is your training error high

Bigger model
Train longer
New architecture
Repeat until doing well on training set

Is dev error high?

Add data
Regularization
New architecture
Repeat until doing well on training set

Done

New Era of ML

We now know whatever problem you are facing (high bias or high variance) you have at least one action you can take to correct
No longer a bias/variance trade-off (29:47)
“Dumb” formula of bigger model/more data is easy to implement even for non-experts and is enough to do very well on a lot of problems (31:09)
More data has led to interesting investments

Data synthesis - Growing area
Examples-

OCR at Baidu
Take random image
Random word
Type random word in Microsoft Word
Use random font
You just created training data for OC
Still takes some human intervention, but lots of progress

Speech recognition

Take clean audio
Add random noise to background for more data
E.g., add car noise
Works remarkably well

Take ungrammatical sentences and auto-correct
Easy to create ungrammatical sentences programmatically

Video games in RL

Data synthesis has a lot of limits (36:24)

Why not take cars from Grand Theft Auto and use that as training data for self-driving cars
20 cars in video game enough to give “realistic” impression to player
But 20 cars is very impoverished data set for self-driving cars

Build a Unified Data Warehouse

Employees can be possessive of "their" data
Baidu- it’s not your data, it’s company data
Access rights can be a different issue
But warehouse everything together
Kaggle

The 70/30 Split Revisited

In academia, common for test/train to come from same distribution
But more ommon in industry for test and train to come from different distributions

E.g., speech rec at Baid

Speech enabled rear view mirror (in China)

50,000 hours of regular speech data

Data not from rear-view mirror interactions though

Collect another 10 hours of rear-view mirror scenario

What do you do with the original 50,000 hours of not-quite right data?

Old method would be to build a different model for each scenario

New era, one model for all data

Bad idea, split 50,000 into training/dev, use 10,000 as test. DON’T DO THIS.

TIP: Make sure dev and test are from same distro (boosts effectiveness)

Good Idea: make 50,000 train, split 10,000 into dev/test

Dev set = problem specification

Me: "dev set = problem you are trying to solve"

Also, split off just 20 hours from 50,000 to create tiny “dev-train” set

this has same distro as train

Mismatched train and dev set is problem that academia doesn’t work on much

some work on domain adaptation, but not much (44:53)

New architecture fix = “hail mary” (48:58)
Takes a long time to really grok bias/variance

People who really understand bias/variance deeply are able to drive rapid progress in machine learning (50:33)

Common Theme – Comparing to Human Level Performance

Common to achieve human level performance, then level off
Why?

Audience: Labels come from humans
Audience: Researchers get satisfied with results (the laziness hypothesis)
Andrew: theoretical limits (aka optimal error rate, Bayes rate)

Some audio so bad, impossible to transcribe (phone call from a rock concert)
Some images so blurry, impossible to interpret

Humans are really good at some things, so once you surpass human level accuracy, there’s not much room left to improve (54:38)

While worse than humans, still ways to improve

Get labels from humans
Error analysis
Estimate bias/variance effects

For tasks that humans are bad at (say 30% error rate), really hard to find guidance on how to improve

How do you define human level performance?

Quiz: Which is the most useful definition? (101:000

Example: Medical image reading

Typical non-doctor error - 3%
Typical doctor – 1%
Expert doctor – 0.7%
Team of expert doctors – 0.5%

Answer: Team of expert doctors is best because ideally you are using human performance to proxy optimal error rate.

What can AI do? (106:30)

Anything that a typical person can do in less than one second.

E.g., Perception tasks
Audience: if a human can do it in less than a second, you can get a lot of data

How do you build a career in machine learning (111:00)

Andrew says he does not have a great answer (me: but he does have a good one)

Taking a ML course
Attend DL school
Work on project yourself (Kaggle)
Mimic PhD student process

Read a lot of papers (20+)
Replicate results

Dirty work

Downloading/cleaning data
Re-running someone’s code

Don’t only do dirty work
PhD process + Dirty work = reliable

Keep it up for a year
Competency

AI is the new electricity (118:00)

Transforms industry after industry
Get in on the ground floor
NOTE: this is the title of his follow-up talk, which has a video link at the end of the one above.

NLPers: How would you characterize your linguistics background?

2017-04-08T18:40:00.001-04:00

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been created pitting some of the most influential NLPers in the world head to head in a contest of wit and debate that is truly what the field of NLP needed. Unfortunately, Twitter proves hapless at being able to view the thread as a single conversation using their app. But, student Sebastian Meilke used a Chrome extension called Treeverse to put together the whole thread into a single, readable format, complete with sections!

If you are at all interested in NLP or linguistics, this is a must read: NLP/CL Twitter Megathrea.

I would be remiss if I didn't note my own small role in this. Emily's poll was sparked by my own Tweet where I said I was glad the Linguistics Society of America is starting their own Society for Computational Linguistics because "the existing CL and NLP communities have gotten farther and farther off the linguistics path"

Happy reading

Small Towns Do Tech Startups Too

2017-03-25T11:01:00.002-04:00

I'm participating in the Startup Weekend event in Chico CA sponsored by Chicostart. Great first night. The 60 second pitches were awesome. So much variety and interesting problems to solve. I pitched my own idea about using Watson to identify fake news, but I only got two blue stickers, boo hoo. I am very impressed with the Build.com space that hosted us. They are a multi-use space that includes a lot of tech startups. Kudos for putting such a great state-of-the-art space in a small college town.

I was particularly impressed with the range of pitchers. Millenials, students, professionals, middle aged folks, and at least one senior. Not bad for a small town with big dreams.

FYI - I'll be helping build a startup this weekend focused on smart housing devices. Will be a lot of fun.

Chico Startup Weekend

2017-03-24T17:02:00.001-04:00

I'm looking forward to participating in Chico Startup Weekend, sponsored by Chicostart. I can't say what the weekend holds, but it's nice to know there's this kind of energy and drive in small town America. Startups ain't just for Austin and Mountain View.

Using IBM Watson Knowledge Studio to Train Machine Learning Models

2017-03-20T01:01:00.000-04:00

Using the Free Trial version of IBM's Watson Knowledge Studio, I just annotated a text and created a machine learning model in about 3 hours without writing a single line of code. The mantra of WKS is that you don't program Watson, you teach Watson.

For demo purposes I chose to identify personal relationships in Shirley Jackson's 1948 short story The Lottery. This is a haunting story about a small village and its mindless adherence to an old, and tragic tradition. I chose it because 1) it's short and 2) it has clear person relationships like brothers, sisters, mothers, and fathers. I added a few other relations like AGENT_OF (which amounts to subjects of verbs) and X_INSIDE_Y for things like pieces of paper inside a box.

Caveat: This short story is really short: 3300 words. So I had no high hopes of getting a good model out of this. I just wanted to go through an entire machine learning work flow from gathering text data to outputting a complete model without writing a single line of code. And that's just what I did.

WORK FLOW
I spent about 30 minutes prepping the data. E.g., I broke it into 20 small snippets (to facilitate test/train split later), I also edited some quotation issues, spelling, etc).

It uploaded into WKS in seconds (by simply dragging and dropping the files into the browser tool). I then created a Type System to include entity types such as these:

And relation types such as these:

I then annotated the 20 short documents in less than two hours (as is so often the case, I re-designed my type system several times along the way; luckily WKS allows me to do this fluidly without having to re-annotate).

Here's a shot of my entity annotations:

Here's a shot of my relation annotations:

I then used these manually annotated documents as ground truth to teach a machine learning model to recognize the relationships automatically using a set of linguistic features (character and word ngrams, parts-of-speech, syntactic parses, etc). I accepted the WKS suggested split of documents as 70/23/2:

I clicked "Run" and waited:

The model was trained and evaluated in about ten minutes. Here's how it performed on entity types:

And here's how it performed on relation types:

This is actually not bad given how sparse the data is. I mean, an F1 of .33 on X_INSIDE_Y from only 29 training examples on a first pass. I'll take it, especially since that one is not necessarily obvious from the text. Here's one example of the X_INSIDE_Y relation:

So I was able to train a model with 11 entity types and 81 relation types on a small corpus in less than three hours start to finish without writing a single line of code. I did not program Watson. I taught it

Annotate texts and create learning models

2017-03-09T10:44:00.001-05:00

IBM Watson has released a free trial version of their online Watson Knowledge Studio tool. This is one of the tools I'm most excited about because it brings linguistic annotation, rule writing, dictionary creation, and machine learning together in a single user-friendly interface designed for non engineers to use.

Watson Knowledge Studio

This tool allows users to annotate documents with mentions, relations, and coreference, then learn a linguistic model of their environments. With zero computer science background. I've trained subject matter experts in several fields to use the WKS and I'm genuinely impressed. I'll put together a demo and post it sometime this weekend.

one of the best preschools in chico

2016-10-14T14:43:00.001-04:00

My sister's preschool, Kids First Learning Center, was voted one of the best prescools in Chico in the local News and Review Best of Chico contest. Very cool

the value of play for preschool children

2016-09-25T16:32:00.002-04:00

A nice article came up in The Atlantic about the comeback of recess and play in elementary schools. My sister posted some thought about how she encourages play when she teaches her preschool kids: For our little Chico preschool, play and education are two sides of the same coin.

from preschool to scholar athlete

2016-09-23T09:55:00.001-04:00

One of the most endearing things about being a teacher is seeing former students go on to achieve great things. My sister's preschool student Jack Emanuel has been named a Subway Scholar athlete. Very cool. Congratulations Jack.

some thoughts on data analytics for a micro business

2016-09-05T19:23:00.000-04:00

My first thought as a new business owner trying to utilize data analytics is this: how empty this all seems. I don't need to understand my market right now, I need to break into it. How can I use data analytics to open doors? I don't have time for nuance right now, I need sign ups. I'm faced with one of the most basic problems in business: getting noticed. And I wanna do it on the cheap.

Let me stipulate that this post relates only to a small slice of DA proper: low-cost advertising DA available from Google & Facebook.

My sister just relocated her preschool, Kids First, to a new town, Chico CA, and I'm helping her buy a domain, set up a website, and do some promotion using Google AdWords Express, Facebook ads, and some local print advertising.

This is the first time I've been on this side of the data analytics equation: as a consumer of the analysis, not a producer. And my experience is telling.

Business model: We're not a small business. We're a micro business. Employees = 1 (Miss Lori). We don't need to sell a million widgets or gain a million likes. The business model of our preschool remains very old fashioned: get ten kids signed up, average that head count annually, and we're successful.

The market: Chico is a college town with 85,000 people in the city, about 200,000 in the larger area. There is a state university, a junior college, two hospitals, two high schools, two middle schools, and several elementary schools, as well as a robust farming economy (all employing professionals with kids, right?).

Advertising: Bought a small print ad in the local weekly for four weeks ($200/wk). One-time ad in the university student paper for the first week of the semester (hoping to grab the attention of new faculty who might finger through it out of curiosity). Small Facebook ad ($50), small Google AdWords campaign ($120/month). My sister "boosted" the preschool's Facebook page for a few days for $20.

Data Analytics: FB ad = zero calls. FB said 6,322 people saw her boosted virtual tour video and 411 people clicked on it in just a few days (I'm suspicious of these numbers because Lori said she limited the ad to Chico. I would be shocked if that many Chicoans clicked her video in a couple days). Zero calls. I started the AdWords campaign on September 3. I limited the area to the Chico region (those 200k people) because I don't want to pay for clicks from people who are looking for preschools in San Francisco. We've gotten 4 clicks in about 48 hours. Zero calls.

This is early, of course, but still, this all seems so empty because of the nature of our business. We don't need clicks or views or likes: we need parents to sign their kids up. There is a disconnect to me between the old fashioned, brick-and-mortar business reality of trying to profitably run a preschool and the virtual reality of these meager data analytics. The NLPer in me likes the fact that AdWords shows me which search words drive clicks, but the business owner in me says, where's the beef? I need a parent to pay me money. Don't care 'bout no clicks.

I'm not in the business of advertising DA, but I have some appreciation for the tools and techniques. And even I'm frustrated trying to connect the dots. Or more to the point, I don't know how to use data analytics to go from clicks to phone calls (to actual sign ups).

I get that using DA may prove useful in the long run, but first we need old fashioned visibility. Without that, nuance is worthless. I can only imagine how frustrated and perhaps infuriated many small business owners are at this same disconnect between their business models and the services offered. This experience has already deepened my appreciation for the customer experience in the DA ecosystem.

But ultimately what I think is this: Data analytics, schmata analytics. I don't need a data scientist. I need Don Draper.

Newest and Greatest Preschool in Chico

2016-09-01T15:17:00.001-04:00

My sister has just re-located her preschool to Chico, CA. Kids First Learning Center.

Yet Another Bad review of Suicide Squad

2016-08-15T17:27:00.000-04:00

There are lots of reviews of Suicide Squad detailing how bad it is. This is another one. This is less a movie than a series of loosely related scenes. It had roughly three sections:

A) The set-up: Character introductions. The movie begins with an amateurish method of introducing the characters that is literally one person listing their names and features followed by a flashback scene for each. Dumbest exposition structure ever.

B) The Mission: Load everybody up in a helicopter, give them weapons, send them along their way. This plays out as an extension of A where each person gets a montage playing with their toys of choice.

C) Switcheroo: Plot twist that changes the mission objective. Unfortunately, the movie fails to adequately set up this core plot point because the director spent so much time showing off weapons and Margot Robbie's arse, he forgot to have a character explain the mission in any memorable way. When the twist in the mission objective is "revealed" midway into the siege, it's more confusing than revelation. At this point, no sane person is looking for coherence in the film anyway, so it hardly matters.

The directorial style can best be summed up as 'just keep everyone shooting guns, no one will notice the incoherence'.

Fun with Stanford's online demo

2016-07-11T12:53:00.000-04:00

I've long been a fan of Stanford's online parser demo, but now they've outdone themselves with a demo page for their CoreNLP tools. Not only does it take your text and show the parse and entities, it also lets you develop a regex to capture your input text, including semantic regexes!

This is just plain fun: http://corenlp.run/

IBM Watson at NAACL 2016

2016-06-19T21:04:00.000-04:00

There were several Twitter NLP flare-ups recently triggered by the contrast between academic NLP and industry NLP. I'm not going to re-litigate those arguments, but I will note that one IBM Watson question answering team anticipated this very thing in their current NAACL paper for the NAACL HLT 2016 Workshop on Human-Computer Question Answering.

The paper is titled Watson Discovery Advisor: Question-answering in an industrial setting.

The Abstract

This work discusses a mix of challenges arising from Watson Discovery Advisor (WDA), an industrial strength descendant of the Watson Jeopardy! Question Answering system currently used in production in industry settings. Typical challenges include generation of appropriate training questions, adaptation to new industry domains, and iterative improvement of the system through manual error analyses.

The paper's topic is not surprising given that four of the authors are PhDs (Charley, Graham, Allen, and Kristen). Hence, it was largely a group of fishes out of water: they had an academic bent, but are daily wrestling with the real-word challenges of paying-customers and very messy data.

Here are five take-aways:

Real-world questions and answers are far more ambiguous and domain-specific than academic training sets.
Domain tuning involves far more than just retraining ML models.
Useful error analysis requires deep dives into specific QA failures (as opposed to broad statistical generalizations).
Defining what counts as an error is itself embedded in the context of the customer's needs and the domain data. What counts as an error to one customer may be acceptable to another.
Quiz-Bowl evaluations are highly constrained, special-cases of general QA, a point I made in 2014 here (pats self on back). Their lesson's learned are of little value to the industry QA world (for now, at least).

I do hope you will read the brief paper in full (as well as the other excellent papers in the workshop).

Genetic Mutation, Thick Data, and Human Intuition

2016-01-25T15:09:00.000-05:00

There are two stories trending heavily in my social network sites that are seemingly unrelated, yet they share one obvious conclusion: the value of human intuition in finding needles in big data haystacks. Reading them highlighted to me the special role humans must still can play in the emerging 21^st century world of big data.

In the first story, The Patient Who Diagnosed Her Own Genetic Mutation—and an Olympic Athlete's, a woman with muscular dystrophy sees a photo of an Olympic sprinter’s bulging muscles and thinks to herself, “she has the same condition I do.” What in the world would cause her to think that? There is no pattern in the data that would suggest this. The story is accompanied by a startling picture of two women who, at first glance, look nothing alike. But once guided by the needle in the haystack that this woman saw, a similarity is illuminated and eventually a connection is made between two medically disparate facts that, once combined, opened a new path of inquiry into muscle growth and dystrophy that is now a productive area of research. Mind you, no new chemical compound was discovered. No new technique or method that allowed scientists to see something that couldn’t be seen before was built. Nope. Nothing *new* came into being, but rather a connection was found between two things that all the world’s experts never saw before. One epiphany by a human being looking for a needle in a haystack. And she found it.

In the second story, Why Big Data Needs Thick Data, an anthropologist working closely to understand the user stories of just 100 Motorola cases discovers a pattern that Motorola’s own big data efforts missed. How? Because his case-study approach emphasized context. Money quote:

For Big Data to be analyzable, it must use normalizing, standardizing, defining, clustering, all processes that strips the the data set of context, meaning, and stories. Thick Data can rescue Big Data from the context-loss that comes with the processes of making it usable.

Traditional machine learning techniques are designed to find large patterns in big data, but those same techniques fail to address the needle in the haystack problem. This is where humans and intuition truly stands apart. Both of these articles are well worth reading in the context of discovering the gaps in current data analysis techniques that humans must fill.

UPDATE: Here's a third story making a similar point. a human being using an automatically culled dictionary noticed a misogynist tendency in the examples it provided. A rabid feminist writes…

And here's a fourth: Algorithms Need Managers, Too. Money quote: "Google’s hard goal of maximizing clicks on ads had led to a situation in which its algorithms, refined through feedback over time, were in effect defaming people with certain kinds of names."

Advice for linguistics grad students entering industry

2016-01-10T17:19:00.000-05:00

At the LSA mixer yesterday I had the chance to chat with a dozen or so grad students in linguistics who were interested non-academic jobs. Here I'll note some of the recurring themes and advice I gave.

The First Job
Advice: Be on the look-out and know what a good opportunity looks like.

Most students were very interested in the jump. How do you make that first transition from academics to industry? In general, you need to be in the market, actively looking, actively promoting yourself as a candidate. For me, it was a random posting on The Linguist List that caught my eye. In the summer of 2004 I was a bored ABD grad student. I knew I wasn't going to be competitive for academic jobs at that point, so I checked The Linguist List job board daily. One day I saw a posting from a small consulting company. They were looking for a linguist to help them create translation complexity metrics. They listed every sub-genre in linguists as their requirements. This told me they really didn't know what they wanted. I saw that as an opportunity because I could sweep in and help them understand what they needed. I applied and after several phone calls I was asked to create a proposal for their customer. I had a conference call to discuss the proposal (I was in shorts and a t-shirt in an empty lab during the call, but they didn't know that). Long story short, I got the job*, moved to DC and spent about two years working as a consultant on that and other government contracts. That first job was a big step in moving into industry. I had very impressive clients, a skill set that was rare in the market, and a well defined deliverable that I could point to as a success.

Visibility
Advice: Make recruiters come to you. Maintain a robust LinkedIn profile and be active on the site on a weekly basis (so that recruiters will find you).

Several students wondered if LinkedIn was considered legitimate. I believe it's fair to say that within the tech and NLP world, LinkedIn is very much legit. My LinkedIn profile has been crucial to being recruited for multiple jobs, two of which I accepted. Algorithms are constantly searching this site for all kinds of jobs. In fact, most of the really good jobs for linguists are not posted on job sites, but rather are filled only by recruiter. So you need strategies for waving your flag and getting them to come to you. In the DC area, there are excellent opportunities for linguists at DARPA, CASL, IARPA, NIST, MITRE and RAND, and many other FFRDCs (federally funded research and development centers), but they rarely post these to jobs boards. You need them to find you. A good LinkedIn page is a great way to increase your visibility.

Another way to increase your visibility is to go public with your projects. You can always blog descriptions and analysis. For computer science students, a GitHub account is virtually a requirement. I think linguists should follow their lead. You most likely write little scripts anyway. Maybe an R script to do some analysis, or a Python script to extract some data. Put those up on GitHub with a little README document. That's an easy place for tech companies to see your work. Also, if you have created data sets that you can freely distribute, put those up on GitHub too. I also recommend competing in a Kaggle competition. Kaggle sponsors many machine learning competitions. They provide data, set the requirements, and post results. It's a great way to both practice a little NLP and data science, and also increase your visibility (and put your Kaggle competitions on your resume!). here are two linguistically intriguing Kaggle competitions ready for you right now: Hillary Clinton's Emails (think about the many things you could analyze in those!); NIPS 2015 Papers (how can a linguist characterize a bunch of machine learning papers?).

Have you managed to automate a process that you once did manually (either through an R script, or maybe Excel formulas), write that up on a blog post. Automating manual processes is huge in industry. You know the messy nature of language data better than anyone else, so write some blog posts describing the kind of messiness you see and what you do about it. That's gold.

Resume
Advice: List tools and data sets. Do you use Praat? List that. Do you use the Buckeye Corpus? List that. Make it clear that you have experience with tools and data management. Those are two areas where tech companies always have work to perform, so make it clear that you can perform that work.

*FYI, here's what the deal was with that first consultant job: The FBI tests lots of people as potential translators. So, for example, they will give a native speaker of Vietnamese several passages of Vietnamese writing, one that is simple, one that is medium complex, and one that is complex); then the applicant is asked to translate the passages into English. the FBI grades each translation. The problem was that the FBI didn't have a standardized metric for what counted as a complex passage in Vietnamese (or the many many other languages that they hire translators for). They relied on experienced translators to recommend passages from work they had done in the past. Turns out, that was a lousy way to find example passages. The actual complexity of passages was wildly uneven, and there was no consistency across languages.

LSA 2016 Evening Recomendations

2016-01-07T18:52:00.000-05:00

With the LSA's annual convention officially underway, I've thrown together a list of a few restaurants and bars within a short walking distance of the convention center that grad students and attendees might want to enjoy. My walking estimates assume you are standing in front of the convention center.

Busboys and Poets (4 blocks west at 5th & K) - A DC Institution. You will not be forgiven if you do not make at least one pilgrimage here.

Maddy’s Taproom (4 blocks east at 13th & L) - Good beer selection.

RFD Washington (4 blocks south at 7th & H) - Large bottled beer selection, good draft beer selection (food ain't that great).

Churchkey (6 blocks northeast at 14th & Rhode Island) - Officially, one of the best beer rooms in the US.

Stan's Restaurant (7 blocks east at L & Vermont) - Downstairs, casual. very strong drinks. Supposedly good wings (I'm a vegetarian, so I hold no opinion)

Daikaya - Ramen - Izakaya (7 blocks Southwest at 6th & G) - Upstairs bar can be easier to get into sometimes. It's a popular place.

Teaism, Penn Quarter (8 blocks south at 8th & G) - Great snack place mid-way to the national Mall. Large downstairs dining area. great place to have some tea, a snack, and catch up on conference planning.

There are, of course, lots of other places within a short walk. I recommend 14th street in general. 9th street has some good stuff, especially as you get closer to U, but it's a little sketchy of a walk.

online psycholinguistics demos 2015

2015-11-29T11:38:00.001-05:00

I was asked recently about an old post from 2008 that listed a variety of online psycholinguistics demos. All of the links are dead now, so I was asked if I knew of any updated ones. This is what I can find. Any suggestions would be welcomed.

Potsdam Research Institute for Multilingualism - Links to Psycholinguistics Sites & Resources

Harvard Implicit Associations Task: Project Implicit is a non-profit organization and international collaboration between researchers who are interested in implicit social cognition - thoughts and feelings outside of conscious awareness and control. The goal of the organization is to educate the public about hidden biases and to provide a “virtual laboratory” for collecting data on the Internet.

webspr - Conduct psycholinguistic experiments (e.g. self-paced reading and speeded acceptability judgment tasks) remotely using a web interface

Games With Words: Learn about language and about yourself while advancing cutting-edge science. How good is your language sense?

Lexical Decision Task demo: In a lexical decision task (LDT), a participant needs to make a decision about whether combinations of letters are words or not. For example, when you see the word "GIRL", you respond "yes, this is a real English word", but when you see the letters "XLFFE" you respond "No, this is not a real English word".

Psychological Research on the Net: links seem to be updated regularly, but not a lot on linguistics.

Categorical Perception: Categorical perception means that a change in some variable along a continuum is perceived, not as gradual but as instances of discrete categories. The test presented here is a classical demonstration of categorical perception for a certain type of speech-like stimuli.

Paul Warren has a variety of demos at the site for his textbook "Introducing Psycholinguistics"

McGurk demo

Various other demos from Warren's textbook

Google's TensorFlow and "mathematical tricks"

2015-11-14T09:46:00.000-05:00

TensorFlow is a new open source software library for machine learning distributed by Google. In some ways, this could be seen as a competitor to BlueMix (though much less user friendly). Erik Mueller, who worked on the original Watson Jeopardy system (and has a vested interest in AI with his new company Symbolic AI), just wrote a brief review of TensorFLow for Wired.

Google’s TensorFlow Alone Will Not Revolutionize AI

Unfortunately, it's not really a review of TensorFlow itself, but rather makes a general point against statistical approaches, which I basically agree with, but the argument requires a much more comprehensive treatment.

Some good quotes from the article:

"I think [TensorFlow] will focus our attention on experimenting with mathematical tricks, rather than on understanding human thought processes."
"I’d rather see us design AI systems that are understandable and communicative."

The Language Myth - Book Review

2015-06-10T21:32:00.000-04:00

Linguistics professor Vyvyan Evans recently published a new book that has at least one group of linguists in a state of frenzy: The Language Myth: Why language is not an instinct. The book's blurb sums up its content:

Some scientists have argued that language is innate, a type of unique human 'instinct' pre-programmed in us from birth. In this book, Vyvyan Evans argues that this received wisdom is, in fact, a myth. Debunking the notion of a language 'instinct', Evans demonstrates that language is related to other animal forms of communication; that languages exhibit staggering diversity; that we learn our mother tongue drawing on general properties and abilities of the human mind, rather than an inborn 'universal' grammar; that language is not autonomous but is closely related to other aspects of our mental lives; and that, ultimately, language and the mind reflect and draw upon the way we interact with others in the world.

Evans' grounds his motivation in the idea that there are a variety of false claims about how language works ("myths") deeply rooted in our culture's background knowledge as well explicated in introductory text books. He goes further to claim that these false claims have been pushed by a small number of pre-eminent scholars whose fame and influence have caused these false claims to be taken more seriously than they deserve on their face.

By all rights, I should be a good audience for this book. I was trained as a linguist in a department that was openly hostile to the language instinct doctrine that this book argues against (see my post about that experience).

The book is organized by two principles. First, each chapter starts by stating one false claim and providing a description of why it was proposed as an explanation of how language works. Second, each chapter then deconstructs the myth into component claims and shoots holes in each one.

The Good

Evans does a service to the lay audience by pointing out that that deep divisions exist within the filed of linguistics. Too often non-experts assume a technical field is homogeneous and everyone agrees on the basic theories. This is simply not true of linguistics.

Evans also does a service to his audience by stepping through the logic of refutation. His point-counterpoint style can be detailed at times, but I appreciate a book that doesn't treat its readers like third graders (I'm looking at you Gladwell).

For me, the standout chapter was 5: Is language a distinct module in the mind. This chapter is devoted to neurolinguistics and here Evans is at his sharpest when leading the reader through his point-counterpoint about brain regions and functionality.

The Bad

Evans fails to do justice to the myths he debunks. He was accused of creating straw men (and addresses this somewhat in the introduction), but ultimately I have to agree. Evans does not provide a fair description of arguments like poverty of the stimulus.

Evans quickly shows his bias and directly attacks just two people: Noam Chomsky and Steven Pinker (and to a lesser extent, Jerry Fodor). Evans wants to debunk general notions that have crept into the general public's background beliefs about language, but what he really does is rail against two guys. And worse, he often devolves into a detailed point-counterpoint with just one book, Pinkers' 1994 The Language Instinct. Any reader unfamiliar with that book will quickly get drowned by arguments against claims they never encountered. As an exercise, I would recommend Evans re-write this book without a single reference to Chomsky, Pinker, or Fodor. I suspect the result will be a more effective piece of writing.

Lest some Chomskyean take this review wrongly, let me be clear: I think Chomsky is broadly wrong and Evans is broadly right. But even though I believe Pinker is wrong and Evans is right, I find Pinker a far superior writer and seller of ideas. And that is a serious problem.

Evans would have been better off throwing away the anti-Chomsky rants and simply write his view of how language works. A book on its own terms. Instead he comes across as your drunk uncle at Christmas who can't stop complaining about how the ref in a high school football game 20 years ago screwed him over with a bad call. This might actually be true, but get over it.

I feel Evans has taken on too much. Each myth is worth a small book itself to debunk properly. This is partly what leads to the straw man arguments. Efficiency. A non-straw man version of Evans' book would be 3000 pages long and only appeal to the three people in the world who know enough detail about both Chomsky and functionalist theory to properly understand all that detail. So I *get* why Evans chose this style. I just think Pinker is better at it. Ultimately Evans alienates his lay audience by ranting about people they don't know and arguments they are unfamiliar with.

A detail complaint: He can be disingenuous with citations. On page 110 he uses the wording "the most recent version of Universal Grammar", but turn to the footnote on 264 and he cites publications from 1981 and 1993. In a book published in 2015, citations from 81 and 93 hardly count as recent. See also page 116 where he cites "a more recent study" that was actually published in 2004 (and probably conducted in 2002).

I don't want to be critical of a book that argues a position I align with, but I must be honest. This book just doesn't cut it.

The Language Myth - Preliminary Thoughts

2015-05-17T21:50:00.000-04:00

I started reading The Language Myth: Why Language Is Not an Instinct by Vyvyan Evans. This book argues that Noam Chomsky is wrong about the basic nature of language. The book has sparked controversy and there have probably been published more words in blogs and tweets in response than are contained in the actual book.

I'm two chapters in, but before I begin posting my review, I wanted to do a post on academic sub-culture, specifically the one I was trained in. I did my (not quite completed) PhD in linguistics at SUNY Buffalo in 1998-2004. The students only half-jokingly called it Berkeley East because, at the time, about half the faculty had been trained at Berkeley (and several others were closely affiliated in research), and Berkeley is one of the great strong-holds of anti-Chomsky sentiment. Buffalo was clearly a "functionalist" school (though no one ever really knew what that meant, functionalism never really being a field, more a culture).

In any case, we were clearly, undeniably, virulently, anti-Chomsky. And that's the culture I want to describe to provide some sense of how different the associations are with the name "Chomsky" for me (and I suspect Evans), than for non linguists, and for non-Chomskyan linguists.

So what was it like to be a grad student in a functionalist linguistics department, with respect to Noam Chomsky?

[SPOILER ALERT - inflammatory language below. Most of this post is intended to represent a thought climate within functionalist linguistics, not factual evidence]

I never quite drank the functionalist Kool-Aid (nor the Chomskyean Kool-Aid either, to be clear); nonetheless I remain endowed with a healthy dose of Chomsky skepticism.

Here is how I remember the general critique of Chomsky echoed in the halls of SUNY Buffalo linguistics (this is my memory of ten+ years ago, not intended to be a technical critique; this is meant to give the impression of what the culture of a functionalist department felt like).

The Presence of Chomsky

First, we didn't talk about Chomsky much, he was peripheral. What little we said about him was typically mocking and belittling (grad students, ya know).
The syntax courses, however, were designed to teach Chomsky's theories for half a semester, then each instructor was given the second half to teach whatever alternative theory they wanted. For my Syntax I course, we used one of Andrew Radford's Minimalism textbooks (then RRG for the second half). For my Syntax II, we used Elizabeth Cowper's GB textbook (then what Matthew Dryer called "Basic Theory", which I always preferred above all else).
We had a summer reading group for years. One summer we read Chomsky’s The Minimalist Program because we felt responsible for understanding the paradigm (we wanted to try to understand the *other*). The group included two senior faculty, both with serious syntax background.

The Perception of Chomsky
(amongst my cohort, this is what my professors and fellow grad students, and I, thought about the guy. Whether we were accurate or not is another thing)

Noam Chomsky is a likable man, for those who get to meet him in person.
Chomsky did linguistics a great service by taking linguistics in the general direction of hard science.

However,

Chomsky's ideas have never been accepted by a majority of linguists, if you include semanticists, discourse, sociolinguistics, international linguists, psycholinguists, anthropological linguists, historical linguists, field linguists, philologists, etc. Outside of American syntacticians, Chomsky is a footnote, a non factor.
Many of his fiercest critics were former students or colleagues.
Chomsky radically changes his theories every ten years or so, simply ignoring his previous claims when they're proved wrong.
Chomsky has never made a serious attempt to understand other theories or engage in linguistic debate; he lives in a cocoon.
He bases major theoretical mechanisms on scant evidence, often obsessing over a single sentence in a language he himself has never studied, based only on evidence from an obscure source (like a grad student thesis).
He condescendingly dismisses most linguistic evidence (like spoken data) with the unfounded distinction between narrow syntax and broad syntax. This allows him to cherry pick data that suits him, and ignore data that refutes his claims.
When critiques are presented by serious linguists with evidence, the evidence is discarded as *irrelevant*, the linguists are derided as foolish amateurs, and the critiques are dismissed as naive. But rarely are the points taken as serious debate.
Chomsky only debates internal mechanisms of his own theories; anyone who argues using mechanisms outside of those Chomsky-internals is derided as ignorant. In other words, there is only one theoretical mechanism, only one set of theoretical terms and artifacts; only these will be recognized as *legitimate* linguistics. Anything else is ignored.
Chomsky doesn't engage with the wider linguistics community.
Chomsky expects to be taken seriously in a way that he himself would never allow anyone else to be taken seriously: lacking substantial evidence, lacking external coherence, and lacking anything approximating collegiality.
Oh, and Chomsky himself hasn't done serious linguistic analysis since the 80s. He has devoted most of the last 30 years to stabbing at political windmills. At most, he spends maybe 10% of his time on linguistics.

That’s the image of the man as I recall from the view of a functionalist department devoted to descriptive linguistics. Let the verbal assaults begin!!!

UPDATE (May 5): This post prompted a spirited Reddit discussion, well word reading.