Monday, January 25, 2016

Genetic Mutation, Thick Data, and Human Intuition

There are two stories trending heavily in my social network sites that are seemingly unrelated, yet they share one obvious conclusion: the value of human intuition in finding needles in big data haystacks. Reading them highlighted to me the special role humans must still can play in the emerging 21st century world of big data.

In the first story, The Patient Who Diagnosed Her Own Genetic Mutation—and an Olympic Athlete's, a woman with muscular dystrophy sees a photo of an Olympic sprinter’s bulging muscles and thinks to herself, “she has the same condition I do.” What in the world would cause her to think that? There is no pattern in the data that would suggest this. The story is accompanied by a startling picture of two women who, at first glance, look nothing alike. But once guided by the needle in the haystack that this woman saw, a similarity is illuminated and eventually a connection is made between two medically disparate facts that, once combined, opened a new path of inquiry into muscle growth and dystrophy that is now a productive area of research. Mind you, no new chemical compound was discovered. No new technique or method that allowed scientists to see something that couldn’t be seen before was built. Nope. Nothing *new* came into being, but rather a connection was found between two things that all the world’s experts never saw before. One epiphany by a human being looking for a needle in a haystack. And she found it.

In the second story, Why Big Data Needs Thick Data, an anthropologist working closely to understand the user stories of just 100 Motorola cases discovers a pattern that Motorola’s own big data efforts missed. How? Because his case-study approach emphasized context. Money quote:
For Big Data to be analyzable, it must use normalizing, standardizing, defining, clustering, all processes that strips the the data set of context, meaning, and stories. Thick Data can rescue Big Data from the context-loss that comes with the processes of making it usable.
Traditional machine learning techniques are designed to find large patterns in big data, but those same techniques fail to address the needle in the haystack problem. This is where humans and intuition truly stands apart. Both of these articles are well worth reading in the context of discovering the gaps in current data analysis techniques that humans must fill.

UPDATE: Here's a third story making a similar point. a human being using an automatically culled dictionary noticed a misogynist tendency in the examples it provided. A rabid feminist writes

And here's a fourth: Algorithms Need Managers, Too. Money quote: "Google’s hard goal of maximizing clicks on ads had led to a situation in which its algorithms, refined through feedback over time, were in effect defaming people with certain kinds of names."

Sunday, January 10, 2016

Advice for linguistics grad students entering industry

At the LSA mixer yesterday I had the chance to chat with a dozen or so grad students in linguistics who were interested non-academic jobs. Here I'll note some of the recurring themes and advice I gave.

The First Job
Advice: Be on the look-out and know what a good opportunity looks like.

Most students were very interested in the jump. How do you make that first transition from academics to industry? In general, you need to be in the market, actively looking, actively promoting yourself as a candidate. For me, it was a random posting on The Linguist List that caught my eye. In the summer of 2004 I was a bored ABD grad student. I knew I wasn't going to be competitive for academic jobs at that point, so I checked The Linguist List job board daily. One day I saw a posting from a small consulting company. They were looking for a linguist to help them create translation complexity metrics. They listed every sub-genre in linguists as their requirements. This told me they really didn't know what they wanted. I saw that as an opportunity because I could sweep in and help them understand what they needed. I applied and after several phone calls I was asked to create a proposal for their customer. I had a conference call to discuss the proposal (I was in shorts and  a t-shirt in an empty lab during the call, but they didn't know that). Long story short, I got the job*, moved to DC and spent about two years working as a consultant on that and other government contracts. That first job was a big step in moving into industry. I had very impressive clients, a skill set that was rare in the market, and a well defined deliverable that I could point to as a success.

Advice: Make recruiters come to you. Maintain a robust LinkedIn profile and be active on the site on a weekly basis (so that recruiters will find you).

Several students wondered if LinkedIn was considered legitimate. I believe it's fair to say that within the tech and NLP world, LinkedIn is very much legit. My LinkedIn profile has been crucial to being recruited for multiple jobs, two of which I accepted. Algorithms are constantly searching this site for all kinds of jobs. In fact, most of the really good jobs for linguists are not posted on job sites, but rather are filled only by recruiter. So you need strategies for waving your flag and getting them to come to you. In the DC area, there are excellent opportunities for linguists at DARPA, CASL, IARPA, NIST, MITRE and RAND, and many other FFRDCs (federally funded research and development centers), but they rarely post these to jobs boards. You need them to find you. A good LinkedIn page is a great way to increase your visibility.

Another way to increase your visibility is to go public with your projects. You can always blog descriptions and analysis. For computer science students, a GitHub account is virtually a requirement. I think linguists should follow their lead. You most likely write little scripts anyway. Maybe an R script to do some analysis, or a Python script to extract some data. Put those up on GitHub with a little README document. That's an easy place for tech companies to see your work. Also, if you have created data sets that you can freely distribute, put those up on GitHub too. I also recommend competing in a Kaggle competition. Kaggle sponsors many machine learning competitions. They provide data, set the requirements, and post results. It's a great way to both practice a little NLP and data science, and also increase your visibility (and put your Kaggle competitions on your resume!). here are two linguistically intriguing Kaggle competitions ready for you right now: Hillary Clinton's Emails (think about the many things you could analyze in those!); NIPS 2015 Papers (how can a linguist characterize a bunch of machine learning papers?).

Have you managed to automate a process that you once did manually (either through an R script, or maybe Excel formulas), write that up on a blog post. Automating manual processes is huge in industry.  You know the messy nature of language data better than anyone else, so write some blog posts describing the kind of messiness you see and what you do about it. That's gold.

Advice: List tools and data sets. Do you use Praat? List that. Do you use the Buckeye Corpus? List that. Make it clear that you have experience with tools and data management. Those are two areas where tech companies always have work to perform, so make it clear that you can perform that work.

*FYI, here's what the deal was with that first consultant job: The FBI tests lots of people as potential translators. So, for example, they will give a native speaker of Vietnamese several passages of Vietnamese writing, one that is simple, one that is medium complex, and one that is complex); then the applicant is asked to translate the passages into English. the FBI grades each translation. The problem was that the FBI didn't have a standardized metric for what counted as a complex passage in Vietnamese (or the many many other languages that they hire translators for). They relied on experienced translators to recommend passages from work they had done in the past. Turns out, that was a lousy way to find example passages. The actual complexity of passages was wildly uneven, and there was no consistency across languages.

Thursday, January 7, 2016

LSA 2016 Evening Recomendations

With the LSA's annual convention officially underway, I've thrown together a list of a few restaurants and bars within a short walking distance of the convention center that grad students and attendees might want to enjoy. My walking estimates assume you are standing in front of the convention center.

Busboys and Poets (4 blocks west at 5th & K) - A DC Institution. You will not be forgiven if you do not make at least one pilgrimage here.

Maddy’s Taproom (4 blocks east at 13th & L) - Good beer selection.

RFD Washington (4 blocks south at 7th & H) - Large bottled beer selection, good draft beer selection (food ain't that great).

Churchkey (6 blocks northeast at 14th & Rhode Island) - Officially, one of the best beer rooms in the US.

Stan's Restaurant (7 blocks east at L & Vermont) - Downstairs, casual. very strong drinks. Supposedly good wings (I'm a vegetarian, so I hold no opinion)

Daikaya - Ramen - Izakaya (7 blocks Southwest at 6th & G) - Upstairs bar can be easier to get into sometimes. It's a popular place.

Teaism, Penn Quarter (8 blocks south at 8th & G) - Great snack place mid-way to the national Mall. Large downstairs dining area. great place to have some tea, a snack, and catch up on conference planning.

There are, of course, lots of other places within a short walk. I recommend 14th street in general. 9th street has some good stuff, especially as you get closer to U, but it's a little sketchy of a walk.

TV Linguistics - and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...