Tuesday, December 26, 2017

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.  

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.  


Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data
Step 2: Describe Your Data Honestly
Step 3: Focus on Linguistic Structure, At Least Some of the Time
Step 4: Do Error Analysis
Since this is preliminary, I’ll concentrate on just steps 1 & 3

UPDATE 12/28/2017 #2: Robert Munro has written up an excellent analysis of this data, A Step in the Right Direction for NLP. He addresses many of the points I make below. He also makes some solid suggestions for how to account for bias. Well worth a read.

UPDATE 12/28/2017 #1: My original analysis of their IAA process was mistaken. I have corrected it. Also added another comment under "Some Thoughts" at the end.

Step 1: Know Your Data

The data download
  • “large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: toxic, severe_toxic, obscene, threat, insult, identity_hate” (note: there are a large number of comments with no ratings; presumably these are neutral)
  • Actual download for Training set is a CSV of ID, comment, rating (1 or 0) for each label. 
  • 100k talk page diffs
  • Comments come from 2001–2015
  • There are 10 judgements per diff (final rating is the average)
  • Here’s the original paper discussing this work Ex Machina: Personal Attacks Seen at Scale by Wulczyn, Thain, and Dixon.
Who created the data?
  • Wikipedia editors
  • Largely young white men (lots of research on this)
    • 2011 WMF survey, 90% of Wikipedians are men, 9% are female, and 1% are transgender/transsexual
    • 53% are under 30 (see here for more info)
Who rated the data?
  • 3591 Crowdflower participants (you have to dig into the paper and associated repos to discover this)
  • This is a huge number of raters.
  • Looking at the demographics spreadsheet available here, most are not native English speakers. I'm surprised this was allowed
  • Only18% were native speakers of English
  • Only three gender options were offered (male, female, other). Only one user chose other
  • 54% are under 30
  • 65% chose male as their gender response
  • 68% chose bachelors, masters, doctorate, or professional as level education
  • I'm admittedly confused by how they obtained their final ratings because the paper discusses a slightly different task (not involving the six labels specifically), and the authors' wiki discussion of annotation and data release (here) also discusses what look like slightly different tasks. Were these six labels aggregated post hoc over different tasks using the same data? I don't see a clear example of asking human raters to give judgments for these six specific labels. I'll keep digging.
  • In general, I find the labels problematic (Dixon admitted to some of this in a discussion here).
Language varieties (in comments data)?
  • Lightly edited English, mostly North American (e.g., one comment said they "corrected" the spelling of “recognised” to “recognized)
  • Some commenters self-identify as non-native speakers
  • Lots of spelling typos
  • Mixed punctuation usage
  • Some meta tags in comments
  • Inconsistent use of spaces before/after commas
  • Inconsistent use of caps (inconsistent for both all caps and camel case)
  • Some examples of ACSII art
  • Excessive apostrophe usage (one comment has several hundred apostrophes)
  • Length of comments vary considerably
    • From five words, to hundreds of words
    • This might have a spurious effect on the human raters
    • Humans get tired. When they encounter a long comment, they may be tempted to rely solely on the first few sentences, or possibly the mere presence of some key words.
  • Some character encoding issues: â€

Step 3: 
Focus on Linguistic Structure, At Least Some of the Time

  • A did a quick skim of a few dozen of the neutral comments (those with 0s for all labels) and it looked like they had no swear words.
  • I fear this will lead a model to over-learn that the mere presence of a swear word means it should get a label.
  • See the excellent blog Strong Language for reasons this is not true.
  • I would throw in some extra training data that includes neutral comments with swear words.
  • Perhaps due to non-native speakers or typos, the wrong morphological variants can be found for some words (e.g., "I am not the only one who finds the article too ridiculous and trivia to be included.")
  • Lemmatizing can help (but it can also hurt)
  • Some comments lack sentence ending punctuation, leading to run ons. (e.g., "It was very constructive you are just very very stupid.")
  • Large amount of grammatical errors
  • If any parsing is done (shallow or full), could be some issues
  • A Twitter trained parser might be useful for the short comments, but not the long ones.
  • All comments are decontextual 
  • Missing the comments before reduces information (many comments are responses to something, but we don't get the something) 
  • Also missing the page topic that the comment is referring to
  • Hence, we're missing discourse and pragmatic links

Some Thoughts
The data is not simply toxic comments, but it’s the kind of toxic comments that young men make to other young men.  And the labels are what young, educated, non-native English speaking men think counts as toxic.

My gut feeling is that the generalizability of any model trained on this data will be limited. 

My main concern though is the lack of linguistic context (under Discourse). Exactly what counts as "toxic" under these de-contextual circumstances? Would the rating be the same if the context were present? I don't know. My hunch is that at least some of these ratings would change.

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...