The journalist and data visualization advocate David McCandless gave a TED talk recently on The beauty of data visualization which included a reference to a chart about when people in relationships break-up based on scraping “10,000 Facebook status updates for the phrases "breakup" and "broken up" (see here).
(image from The Daily Dish)
He did not go into detail about his actual scraping technique, so it’s not clear what he actually scraped for*, but let’s assume he literally only extracted occurrences of those two constructions. What’s wrong with this? Well, it just seems unnatural for people to use those particular phrases to talk about breaking up. Under what conditions would someone use these constructions?
- breakup = a bare NP, single token
- broken up = past participle, particle verb
My intuition is that the following are more likely ways of talking about a break up:
- I broke up with my boyfriend last night.
- I dumped that asshole last night.
This reminds me of the difference between the meaning of the term “model” in the social sciences and the hard sciences. In many cases, a social science model is little more than a visualization of concepts, masking a lack of data to support it; whereas a model in the hard sciences is almost always a computational algorithm that takes in data and spits out predictions.
*On the image of the chart, it says the searches were for "we broke up because", but McCandless says in the talk that he scraped for the phrases breakup and broken up.