The journalist and data visualization advocate David McCandless gave a TED talk recently on The beauty of data visualization which included a reference to a chart about when people in relationships break-up based on scraping “10,000 Facebook status updates for the phrases "breakup" and "broken up" (see here).
(image from The Daily Dish)
- breakup = a bare NP, single token
- broken up = past participle, particle verb
My intuition is that the following are more likely ways of talking about a break up:
- I broke up with my boyfriend last night.
- I dumped that asshole last night.
This reminds me of the difference between the meaning of the term “model” in the social sciences and the hard sciences. In many cases, a social science model is little more than a visualization of concepts, masking a lack of data to support it; whereas a model in the hard sciences is almost always a computational algorithm that takes in data and spits out predictions.
*On the image of the chart, it says the searches were for "we broke up because", but McCandless says in the talk that he scraped for the phrases breakup and broken up.
7 comments:
Even assuming all inflected forms for _break up_ had been collected, what about phrases such as "we finally broke up for Easter"? Given that many FB users are school kids or students, they'd be pretty likely to talk about holidays.
Now what would be more accurate (but harder to collect) are changes in the users relationship status. Changing from "in a relationship" to "single" is probably much more reliable. But you would need to monitor a large number of people for that. Unless there's a hidden API call that send out notifications if that bit changes...
yeah, deciding exactly what to collect is critical. I also wondered how they "collected" 10,000 status comments. How do they get access to that data?
I didn't watch the TED talk or anything, but I did see the graphic earlier this week. I assumed he was actually scraping the message facebook gives when someone breaks up: "Sally Sosorry is no longer in a relationship." Or whatever..
.. and to finish my thought, I was trying to say that would seem to be much more accurate. But I guess it's not reliably scrapeable? (scrapable?)
@Jason, are those messages generated automatically when someone changes their relationship status?
Your point's well taken, but the analysis probably isn't as flawed as you make it out to be. Even though there are other events to which the words "break up" or "broken up" could refer, it's probably a safe assumption that the distribution of these terms is homogeneous with time; ie, the background noise of the term "break up" unrelated to relationships is pretty constant. On the other hand, it's a good assumption that relationship termination is heterogeneous with time. Thus, even though the individual numbers aren't terribly accurate, the illustration of heterogeneity is accurate. And that's the main point of data visualization. Nobody looks at a figure to get the exact numbers. The figure just illustrates the overall trend.
Soren, I take you point, but at this point, I would need to know exactly how he extracted the data to make a judgment of its quality. Also, he only scraped 10,000 examples, that's a small sample in NLP terms.
Post a Comment