Friday, November 5, 2010

The Perils of Pretty Pictures

As is too often the case, bad NLP starts with bad linguistics.

The journalist and data visualization advocate David McCandless gave a TED talk recently on The beauty of data visualization which included a reference to a chart about when people in relationships break-up based on scraping  “10,000 Facebook status updates for the phrases "breakup" and "broken up" (see here).

(image from The Daily Dish)

He did not go into detail about his actual scraping technique, so it’s not clear what he actually scraped for*, but let’s assume he literally only extracted occurrences of those two constructions.  What’s wrong with this? Well, it just seems unnatural for people to use those particular phrases to talk about breaking up. Under what conditions would someone use these constructions?
  • breakup =  a bare NP, single token
  • broken up = past participle, particle verb
I’m sure we can construct some examples, but they would be low probability, right?

My intuition is that the following are more likely ways of talking about a break up:
  • I broke up with my boyfriend last night.
  • I dumped that asshole last night.
McCandless seems distracted by the visualizations, as if they are the data. They are not. A visualization is only as good as the data underlying it, and I fear McCandless’ pretty charts are masking fundamentally vacuous data (like the nearly worthless Facebook data). But in the TED forum, a journalist like McCandless can sell a little snake oil and convince his audience that it’s perfume. I respect his point about relativizing data and I definitely think visualization is important, but it is not THE point of data.

This reminds me of the difference between the meaning of the term “model” in the social sciences and the hard sciences. In many cases, a social science model is little more than a visualization of concepts, masking a lack of data to support it; whereas a model in the hard sciences is almost always a computational algorithm that takes in data and spits out predictions.

*On the image of the chart, it says the searches were for "we broke up because", but McCandless says in the talk that he scraped for the phrases breakup and broken up.


Oliver Mason said...

Even assuming all inflected forms for _break up_ had been collected, what about phrases such as "we finally broke up for Easter"? Given that many FB users are school kids or students, they'd be pretty likely to talk about holidays.

Now what would be more accurate (but harder to collect) are changes in the users relationship status. Changing from "in a relationship" to "single" is probably much more reliable. But you would need to monitor a large number of people for that. Unless there's a hidden API call that send out notifications if that bit changes...

Chris said...

yeah, deciding exactly what to collect is critical. I also wondered how they "collected" 10,000 status comments. How do they get access to that data?

Jason M. Adams said...

I didn't watch the TED talk or anything, but I did see the graphic earlier this week. I assumed he was actually scraping the message facebook gives when someone breaks up: "Sally Sosorry is no longer in a relationship." Or whatever..

Jason M. Adams said...

.. and to finish my thought, I was trying to say that would seem to be much more accurate. But I guess it's not reliably scrapeable? (scrapable?)

Chris said...

@Jason, are those messages generated automatically when someone changes their relationship status?

Soren said...

Your point's well taken, but the analysis probably isn't as flawed as you make it out to be. Even though there are other events to which the words "break up" or "broken up" could refer, it's probably a safe assumption that the distribution of these terms is homogeneous with time; ie, the background noise of the term "break up" unrelated to relationships is pretty constant. On the other hand, it's a good assumption that relationship termination is heterogeneous with time. Thus, even though the individual numbers aren't terribly accurate, the illustration of heterogeneity is accurate. And that's the main point of data visualization. Nobody looks at a figure to get the exact numbers. The figure just illustrates the overall trend.

Chris said...

Soren, I take you point, but at this point, I would need to know exactly how he extracted the data to make a judgment of its quality. Also, he only scraped 10,000 examples, that's a small sample in NLP terms.

NLPers: How would you characterize your linguistics background?

That was the poll question my hero Professor Emily Bender posed on Twitter March 30th. 573 tweets later, a truly epic thread had been cre...