Tuesday, December 26, 2017

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics, I want to apply some of her thoughts to the data from the recently opened Kaggle competition Toxic Comment Classification Challenge.

Dr. Bender's point: not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.  

Toxic Comment Challenge Correlate: not all Kaggle submissions need to be linguistically informative—but they should all be linguistically informed

First let me say I'm really excited about this competition because A) it uses real language data, B) it's not obvious what techniques will work well (it's a novel/interesting problem), and C) it's a socially important task to get right. Kudos to Kaggle for supporting this.  


Dr. Bender identifies four steps to put linguistics into CL

Step 1: Know Your Data
Step 2: Describe Your Data Honestly
Step 3: Focus on Linguistic Structure, At Least Some of the Time
Step 4: Do Error Analysis
Since this is preliminary, I’ll concentrate on just steps 1 & 3

UPDATE 12/28/2017 #2: Robert Munro has written up an excellent analysis of this data, A Step in the Right Direction for NLP. He addresses many of the points I make below. He also makes some solid suggestions for how to account for bias. Well worth a read.

UPDATE 12/28/2017 #1: My original analysis of their IAA process was mistaken. I have corrected it. Also added another comment under "Some Thoughts" at the end.

Step 1: Know Your Data

The data download
  • “large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: toxic, severe_toxic, obscene, threat, insult, identity_hate” (note: there are a large number of comments with no ratings; presumably these are neutral)
  • Actual download for Training set is a CSV of ID, comment, rating (1 or 0) for each label. 
  • 100k talk page diffs
  • Comments come from 2001–2015
  • There are 10 judgements per diff (final rating is the average)
  • Here’s the original paper discussing this work Ex Machina: Personal Attacks Seen at Scale by Wulczyn, Thain, and Dixon.
Who created the data?
  • Wikipedia editors
  • Largely young white men (lots of research on this)
    • 2011 WMF survey, 90% of Wikipedians are men, 9% are female, and 1% are transgender/transsexual
    • 53% are under 30 (see here for more info)
Who rated the data?
  • 3591 Crowdflower participants (you have to dig into the paper and associated repos to discover this)
  • This is a huge number of raters.
  • Looking at the demographics spreadsheet available here, most are not native English speakers. I'm surprised this was allowed
  • Only18% were native speakers of English
  • Only three gender options were offered (male, female, other). Only one user chose other
  • 54% are under 30
  • 65% chose male as their gender response
  • 68% chose bachelors, masters, doctorate, or professional as level education
  • I'm admittedly confused by how they obtained their final ratings because the paper discusses a slightly different task (not involving the six labels specifically), and the authors' wiki discussion of annotation and data release (here) also discusses what look like slightly different tasks. Were these six labels aggregated post hoc over different tasks using the same data? I don't see a clear example of asking human raters to give judgments for these six specific labels. I'll keep digging.
  • In general, I find the labels problematic (Dixon admitted to some of this in a discussion here).
Language varieties (in comments data)?
  • Lightly edited English, mostly North American (e.g., one comment said they "corrected" the spelling of “recognised” to “recognized)
  • Some commenters self-identify as non-native speakers
  • Lots of spelling typos
  • Mixed punctuation usage
  • Some meta tags in comments
  • Inconsistent use of spaces before/after commas
  • Inconsistent use of caps (inconsistent for both all caps and camel case)
  • Some examples of ACSII art
  • Excessive apostrophe usage (one comment has several hundred apostrophes)
  • Length of comments vary considerably
    • From five words, to hundreds of words
    • This might have a spurious effect on the human raters
    • Humans get tired. When they encounter a long comment, they may be tempted to rely solely on the first few sentences, or possibly the mere presence of some key words.
  • Some character encoding issues: â€

Step 3: 
Focus on Linguistic Structure, At Least Some of the Time

  • A did a quick skim of a few dozen of the neutral comments (those with 0s for all labels) and it looked like they had no swear words.
  • I fear this will lead a model to over-learn that the mere presence of a swear word means it should get a label.
  • See the excellent blog Strong Language for reasons this is not true.
  • I would throw in some extra training data that includes neutral comments with swear words.
  • Perhaps due to non-native speakers or typos, the wrong morphological variants can be found for some words (e.g., "I am not the only one who finds the article too ridiculous and trivia to be included.")
  • Lemmatizing can help (but it can also hurt)
  • Some comments lack sentence ending punctuation, leading to run ons. (e.g., "It was very constructive you are just very very stupid.")
  • Large amount of grammatical errors
  • If any parsing is done (shallow or full), could be some issues
  • A Twitter trained parser might be useful for the short comments, but not the long ones.
  • All comments are decontextual 
  • Missing the comments before reduces information (many comments are responses to something, but we don't get the something) 
  • Also missing the page topic that the comment is referring to
  • Hence, we're missing discourse and pragmatic links

Some Thoughts
The data is not simply toxic comments, but it’s the kind of toxic comments that young men make to other young men.  And the labels are what young, educated, non-native English speaking men think counts as toxic.

My gut feeling is that the generalizability of any model trained on this data will be limited. 

My main concern though is the lack of linguistic context (under Discourse). Exactly what counts as "toxic" under these de-contextual circumstances? Would the rating be the same if the context were present? I don't know. My hunch is that at least some of these ratings would change.


Fred Mailhot said...

Thanks for doing this and for highlighting some potential pitfalls/issues. I just started working a bit on this challenge today...

Lucas said...

Thanks for this detailed analysis of the competition data!

I basically agree with everything you are writing here. :) The labelling schema is a first approximation to dig into the concept of toxicity and severe toxicity (these are themselves an approximation for things that make people leave a discussion). There's a lots of challenges with trying to create a taxonomy for toxicity; any resources you know of would be warmly welcomed! A few interesting things about this first approximation taxonomy: we noticed is that there's a good deal of overlap between the categories, but also significant non-overlap; it's also good pretty good coverage (most things that are toxic fall into one of the categories). Our boarder goal is to explore new kinds of user-interaction using ML to make conversations better: to allow them to have wider diversity of participation (both people and ideas), higher quality of contributions, and more empathy between the contributors. We'd like to find a set of categories that a wide diversity of people find meaningful and useful in feedback when they are writing, useful to moderation of conversation, and useful to organize comments when reading them. You can find a talk I recently gave at Wikimedia on this subject here: https://www.youtube.com/watch?v=nMENRAkeHnQ&t=1s

I'd be super happy to discuss more, feel free to reach out on github or by email.

(Lucas Dixon)

Patrick Russum said...

First time reader and I actually found your blog through a college assignment (get that traffic!). The points you cover here are absolutely fascinating to me, particularly because they overlap with the type of undergrad research I'm doing and want to continue to do in graduate school. My main issue right now is the opposite of the types of issues that you seem to be covering here; while I am very interested in Computational Linguistics, especially sentiment analysis, I am not well-versed in the 'Computational' part. Now my career goes hand in hand with data science, so I can stumble my way through the basics of the technical know-how, but I often find myself lost. For instance, I wrote a paper where I found, through both a semantics analyzer and through regular discourse analysis, that there was quite a bit of negativity in the twitter comments of a new video announcement from a well known queer Youtuber. I would say that this would be expected, except the intention of the commentators was not to be toxic, all of them were fans and excited about the video. It was merely that the conversations often revolved around very dark topics and few of the posts refrained from mentioning at least something negative from the past or present in their lives revolving around being a transgender person. Unfortunately, when I decided to expand on this with my current research, looking at by and for and about queer news headlines, the sentiment analyzer was almost useless. My data set was too large for the user-friendly 'demo' and even when I found a subset that could be processed, it all came back as extremely neutral, despite some of the most common words being 'brutality', 'murdered', 'homeless', etc. This was frustrating and made me confused as to what could have made the difference and if there were a better NLP/sentiment analyzer to use that might provide more insight or a basis for comparison (note, I think part of the issue was of linguistic genre, the sentiment analyzer was trained on twitter and did well for the twitter comments, not so well for the style and form used in news discourse). This led to hours of fruitless searching for any NLP/sentiment processor that I could learn to use at least as a point of comparison. Unfortunately for me, I don't know coding languages and I was lost every time I tried to find something that I could get up and running without a basic background in coding. I know this is a very long comment, but this does seem like an issue I run into often in Corpus Linguistics, tools made by and for data scientists with a steep learning curve for Linguists with, admittedly, very little experience with computer science. I was wondering what your thoughts are? I know saying 'user-friendly' interface is easier said than done, but I do think it would be easier to get 'linguistically informed' analysis if the materials were more accessible? Also, I should just take a course in Python and get it over with, it seems to be a useful skill. I can't wait to read more of your blog!

Chris said...


Thanks for these excellent comments. You touch on several important issues, but mainly you are right. Computational tools are primarily by and for computers scientists. I have to say, though, that they are largely more accessible now than they were 15 years ago. If you don't mind, I would like to post this comment to linguist Twitter and get some reaction, then create a more thoughtful post responding to you points.

For a quick pointer, I think learning Python is great, but if you want to be able to use sentiment tools with less learning curve, I recommend R. I highly recommend Julia Silge and David Robinson's excellent, free, and accessible book "Text Mining with R". The sentiment chapter is here: https://www.tidytextmining.com/sentiment.html

I will take me a couple days, but I will respond more fully to your larger points.

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...