Sunday, May 17, 2015

The Language Myth - Preliminary Thoughts

I started reading The Language Myth: Why Language Is Not an Instinct by Vyvyan Evans. This book argues that Noam Chomsky is wrong about the basic nature of language. The book has sparked controversy and there have probably been published more words in blogs and tweets in response than are contained in the actual book.

I'm two chapters in, but before I begin posting my review, I wanted to do a post on academic sub-culture, specifically the one I was trained in. I did my (not quite completed) PhD in linguistics at SUNY Buffalo in 1998-2004. The students only half-jokingly called it Berkeley East because, at the time, about half the faculty had been trained at Berkeley (and several others were closely affiliated in research), and Berkeley is one of the great strong-holds of anti-Chomsky sentiment. Buffalo was clearly a "functionalist" school (though no one ever really knew what that meant, functionalism never really being a field, more a culture).

In any case, we were clearly, undeniably, virulently, anti-Chomsky. And that's the culture I want to describe to provide some sense of how different the associations are with the name "Chomsky" for me (and I suspect Evans), than for non linguists, and for non-Chomskyan linguists.

So what was it like to be a grad student in a functionalist linguistics department, with respect to Noam Chomsky?

[SPOILER ALERT - inflammatory language below. Most of this post is intended to represent a thought climate within functionalist linguistics, not factual evidence]

I never quite drank the functionalist Kool-Aid (nor the Chomskyean Kool-Aid either, to be clear); nonetheless I remain endowed with a healthy dose of Chomsky skepticism.

Here is how I remember the general critique of Chomsky echoed in the halls of SUNY Buffalo linguistics (this is my memory of ten+ years ago, not intended to be a technical critique; this is meant to give the impression of what the culture of a functionalist department felt like).

The Presence of Chomsky

  • First, we didn't talk about Chomsky much, he was peripheral. What little we said about him was typically mocking and belittling (grad students, ya know).
  • The syntax courses, however, were designed to teach Chomsky's theories for half a semester, then each instructor was given the second half to teach whatever alternative theory they wanted. For my Syntax I course, we used one of Andrew Radford's Minimalism textbooks (then RRG for the second half). For my Syntax II, we used Elizabeth Cowper's GB textbook (then what Matthew Dryer called "Basic Theory", which I always preferred above all else).
  • We had a summer reading group for years. One summer we read Chomsky’s The Minimalist Program because we felt responsible for understanding the paradigm (we wanted to try to understand the *other*). The group included two senior faculty, both with serious syntax background. 

The Perception of Chomsky 
(amongst my cohort, this is what my professors and fellow grad students, and I, thought about the guy. Whether we were accurate or not is another thing)

  • Noam Chomsky is a likable man, for those who get to meet him in person.
  • Chomsky did linguistics a great service by taking linguistics in the general direction of hard science.
  • Chomsky's ideas have never been accepted by a majority of linguists, if you include semanticists, discourse, sociolinguistics, international linguists, psycholinguists, anthropological linguists, historical linguists, field linguists, philologists, etc. Outside of American syntacticians, Chomsky is a footnote, a non factor.
  • Many of his fiercest critics were former students or colleagues.
  • Chomsky radically changes his theories every ten years or so, simply ignoring his previous claims when they're proved wrong.
  • Chomsky has never made a serious attempt to understand other theories or engage in linguistic debate; he lives in a cocoon.
  • He bases major theoretical mechanisms on scant evidence, often obsessing over a single sentence in a language he himself has never studied, based only on evidence from an obscure source (like a grad student thesis).
  • He condescendingly dismisses most linguistic evidence (like spoken data) with the unfounded distinction between narrow syntax and broad syntax. This allows him to cherry pick data that suits him, and ignore data that refutes his claims.
  • When critiques are presented by serious linguists with evidence, the evidence is discarded as *irrelevant*, the linguists are derided as foolish amateurs, and the critiques are dismissed as naive. But rarely are the points taken as serious debate.
  • Chomsky only debates internal mechanisms of his own theories; anyone who argues using mechanisms outside of those Chomsky-internals is derided as ignorant. In other words, there is only one theoretical mechanism, only one set of theoretical terms and artifacts; only these will be recognized as *legitimate* linguistics. Anything else is ignored. 
  • Chomsky doesn't engage with the wider linguistics community. 
  • Chomsky expects to be taken seriously in a way that he himself would never allow anyone else to be taken seriously: lacking substantial evidence, lacking external coherence, and lacking anything approximating collegiality.
  • Oh, and Chomsky himself hasn't done serious linguistic analysis since the 80s. He has devoted most of the last 30 years to stabbing at political windmills. At most, he spends maybe 10% of his time on linguistics. 

That’s the image of the man as I recall from the view of a functionalist department devoted to descriptive linguistics. Let the verbal assaults begin!!!

UPDATE (May 5): This post prompted a spirited Reddit discussion, well word reading.

Thursday, March 12, 2015

Jobs with IBM Watson

IBM Watson is currently recruiting Washington DC area engineers for "Natural Language Processing Analysts". We're looking for engineers who like to build stuff and travel.You can apply through the link, or feel free to contact me if you want more info (use the "View my complete profile" link to the right for my contact).

Here's the official posting (hint: there is wiggle room)

Job description
Ready to change the way the world works? IBM Watson uses Cognitive Computing to tackle some of humanity's most challenging problems - like revolutionizing how doctors research cancer or transforming how businesses engage with their customers. We have an exciting opportunity for a Watson Natural Language Processing Analyst responsible for rigorous analysis of system performance phases including search, evidence scoring, and machine learning.

Natural Language Processing (NLP) Analysts evaluate system performance, and identify steps to drive enhancements. The role is part analyst and part developer. Analysts are required to function independently to dive deep into system components, identify areas for improvement, and devise solutions. Analysts are expected to drive test and evaluation of their solutions, and empirically identify follow on steps to implement continuous system improvement. Natural Language Processing is an explosively dynamic field; analysts must expect ambiguity, and demonstrate the ability to develop courses of action on the basis of data driven analysis. Must be able to work independently and demonstrate initiative. Demonstrated analytical skills, security clearances preferred but not required.

We live in a moment of remarkable change and opportunity. The convergence of data and technology is transforming industries, society and even the workplace. New roles are being created that never existed before to meet the demands of this transformation. And IBM Watson is now looking for talent in healthcare, life sciences, financial services, the public sector and others to new roles destined to usher in the next era of cognitive computing. Embark on the journey with us at IBM Watson.
  • Bachelor's Degree
  • At least 2 years experience in Text Search Engines (such as Lucene)
  • At least 2 years experience in Java Development Proficiency
  • Basic knowledge in Natural Language Processing
  • Basic knowledge in Text Analytics/ Info Retrieval
  • Basic knowledge in Unstructured Data
  • Readiness to travel 50% travel annually
  • U.S. citizenship required
  • English: Fluent
  • Master's Degree
  • At least 5 years experience in Text Search Engines (such as Lucene)
  • At least 5 years experience in Java Development Proficiency
  • At least 2 years experience in Natural Language Processing
  • At least 2 years experience in Text Analytics/ Info Retrieval
  • At least 2 years experience in Unstructured Data
IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

Monday, March 2, 2015

The Linguistics behind IBM Watson

I will be talking about the linguistics behind IBM Watson's Question Answering on March 11 at the DC Natural Language Processing MeetUp. Here's the blurb:

In February 2011, IBM Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge. Today, Watson is a cognitive system that enables a new partnership between people and computers that enhances and scales human expertise by providing a more natural relationship between the human and the computer. 

One part of Watson’s cognitive computing platform is Question Answering. The main objective of QA is to analyze natural language questions and present concise answers with supporting evidence, rather than a list of possibly relevant documents like internet search engines.

This talk will describe some of the natural language processing components that go into just three of the basic stages of IBM Watson’s Question Answering pipeline:

  • Question Analysis
  • Hypothesis Generation
  • Semantic Types

The NLP components that help make this happen include a full syntactic parse, entity and relationship extraction, semantic tagging, co-reference, automatic frame discovery, and many others. This talk will discuss how sophisticated linguistic resources allow Watson to achieve true question answering functionality.

Tuesday, February 24, 2015

toy data and question answering

Some first impressions of a really interesting paper on AI and Question Answering: Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (pdf)

FWIW, I spent most of the time mis-reading the first author name as Watson, instead of Weston, assuming a false sense of irony :-)

The basic idea is that in order to facilitate rapid development of artificial intelligence question answering applications, someone ought to create a set of standard data sets that are highly constrained as to what they test (gee, who might be the best people to create those? the authors maybe?).

Basically, these are regression tests for reasoning. Each test contains a set of 2-5 short statements (= the corpus) and a set of 1-2 questions. The point is to expose the logical requirements for answering the questions given the way the answer is encoded in the set of statements.

They define 20 specialized tests. All of them are truly interesting as AI tests.
Their two simplest tests:

3.1 is a fairly straight forward factoid Question-Answer pairing, but 3.2 requires that a system extract, store, and connect information across sentences before discovering the answer. Cool task. Non-trivial.

3.17 is an even more challenging task:

This immediately reminded me of a textual Tarski's World.

The authors admit that few of these tasks are immediately relevant to an industrial scale question answering system like Watson, but argue that tasks like these focus research on identifying the skills needed to answer these questions
Our goal is to categorize different kinds of questions into skill sets, which become our tasks. Our hope is that the analysis of performance on these tasks will help expose weaknesses of current models and help motivate new algorithm designs that alleviate these weaknesses. We further envision this as a feedback loop where new tasks can then be designed in response, perhaps in an adversarial fashion, in order to break the new models.
My Big Take-Aways:

  • Not realistic QA from large data.
  • They're testing artificial recombination of facts into new understanding, not NLP or information extraction per se. It's an AI paper after all, so ignoring contemporary approaches to information extraction is fine. But, can one assume that many of these answers are discoverable in large data collections using existing NLP techniques? Their core design encodes the answer in one very limited way. But industrial scale QA systems utilize large data sets where answers are typically repeated, often many many times, in many different forms in the corpora.
  • Their test data is so specific, I worried that it might encourage over-fitting of solutions: Solve THAT problem, not solve THE problem. 
Oh, BTW, I loved reading this paper, in case that wasn't clear.

Monday, September 15, 2014

can you still do linguistics without math?

A reader emailed me an interesting question that's worth giving a wider audience to:
It nearly broke my heart to hear that maths may be a required thing in linguistics, maths has pulled me back from a few opportunities in the past before linguistics, I'd been interested in engineering, marine biology, etc. I was just wondering if there was any work around, anything that would help me with linguistics that didn't require maths. Just. any advice at all, for getting into the field of linguistics with something as troubling as dyscalculia.
The reader makes a good point I hadn't thought about. I remember my phonetics teacher telling us that she often recruited students into linguistics by telling them that it's one of the few fields that teach non-mathematical data analytics. That was something that appealed to me.

I'm not familiar with dyscalculia so I can't speak to how it impacts the study of linguistics directly. But even linguists who don't perceive themselves as "doing math" often still are, in the form of complicated measurements and such, like in phonetics and psycholinguistics. Generally though, I think that there are still many opportunities to do non-mathematical linguistics, especially in fields like sociolinguistics, language policy, and language documentation. Let us not forget that the vast majority of the world's languages remain undocumented so we need an army of linguists to work with speakers the world over to record, analyze, and describe the lexicons, grammars, and sound systems of those languages. We also need to understand better child language acquisition, slang, pragmatic inferences, and a host of other deeply important linguistic issues. It still requires a lot of good old fashioned, non-mathematical linguistics skills to study those topics.

Unfortunately, those are woefully underpaid skills as well. One of the reasons math is taking over linguistics is simple economics: that's where the money is. Both the job market and the research grant market are trending heavily towards quantitative skills and tools, regardless of the discipline. That's just a fact we all have to deal with. I didn't go to grad school in order to work at IBM. That's just where the job is. I couldn't get hired at a university to save my life right now, but I can make twice what a professor makes at IBM. So here I am (don't get me wrong. I have the enviable position of getting paid well to work on real language problems, so I ain't complaining).

Increasingly, the value of descriptive linguistic skills is in the creation of corpora that can be processed automatically with tools like AntConc  and such. You can do a lot of corpus linguists these days without explicit math because the software does a lot of the work for you. But you will still need to understand the underlying math concepts (like why true "keywords" are not simply frequency searches). For details, I can highly recommend Lancaster University's MOOC "Corpus linguistics: method, analysis, interpretation" (it's free and online right now)

The real question is; what do you want to do with linguistics? Do you want to get a PhD then become a professor? That's a tough road (and not just in linguistics. The academic market place is imploding due to funding issues). There aren't that many universities who hire pure descriptive linguists anymore. Those jobs do exist, but they're rare. SUNY Buffalo, Oregon, and New Mexico are three US schools that come to mind as still having descriptive field linguist faculties. But the list is short.

If you want to teach a language, that's the most direct route to getting a job, but you'll need the TESOL Certificate too and frankly, those tend to be low paid, part-time jobs. Hard to build a secure career off of that.

That leaves industry. There are industry jobs for non-quantitative linguists, but they're unpredictable. Marketing agencies occasionally hire linguists to do research on cross-linguistic brand names and such. Check out this old post for some examples.

I hope this helps. I recommend asking this question over at The Linguist List too because I have my own biases. It's smart to get a wide variety of perspectives.

Tuesday, September 2, 2014

neural nets and question answering

I just read A Neural Network for Factoid Question Answering by Iyyer et al  (presented at EMNLP 2014).

I've been particularly keen on research about question answering NLP for a long time because my first ever NLP gig was as a grad student intern at a defunct question answering start-up in 2000 (QA was all the rage during the 90s tech bubble). QA is somewhat special among NLP fields because it is a combination of all of the others put together into a single, deeply complex pipeline.

When I saw this paper Tweeted by Mat Kelcey, I was excited by the title, but after reading it, I suspect the constraints of their task make it not quite applicable to commercial QA applications.

Here are some thoughts on the paper, but to be clear: these comments are my own and do not represent in any way those of my employer.

What they did:
Took question/answer pairs from a college Quiz Bowl game and trained a neural network to find answers to new questions. More to the point, "given a description of an entity, [they trained a neural net to] identify the person, place, or thing discussed".

The downside:
  1. They used factoid questions from a game called Quiz Bowl
  2. Factoid questions assume small, easily identifiable answers (typically one word or maybe a short multi-word phrase)
  3. If you’re unfamiliar with the format of these quiz bowl games, you can play something similar at bars like Buffalo Wild Wings. You get a little device for inputting an answer and the questions are presented on TVs around the room. The *questions* are composed of 4-6 sentences, displayed one at a time. The faster you answer, the more points you get. The sentences in the question are hierarchically ordered in terms of information contained. The first sentence gives very little information away and is presented alone for maybe 5 seconds. If you can’t answer, the second sentence appears for 5 seconds giving a bit more detail. If you still can’t answer, the third sentence appears providing even more detail, but fewer points. And so on.
  4. Therefore, they had large *questions* composed of 4-6 sentences, providing more and more details about the answer. This amount of information is rare (though they report results of experimental guesses after just the first sentence, I believe they still used the entire *question* paragraph for training).
  5. They had fixed, known answer sets to train on. Plus (annotated) incorrect answers to train on.
  6. They whittled down their training and test data to a small set of QA pairs that *fit* their needs (no messy data) - "451 history answers and 595 literature answers that occur on average twelve times in the corpus".
  7. They could not handle multi-word named entities (so they manually pre-processed their corpus to convert these into single strings).
The upside:

  1. Their use of dependency trees instead of bag o' words was nice. As a linguist, I want to see more sophisticated linguistic information used in NLP.
  2. They jointly learned answer and question representations in the same vector space rather than learning them separately because "most answers are themselves words (features) in other questions (e.g., a question on World War II might mention the Battle of the Bulge and vice versa). Thus, word vectors associated with such answers can be trained in the same vector space as question text enabling us to model relationships between answers instead of assuming incorrectly that all answers are independent."
  3. I found their error analysis in sections “5.2 Where the Attribute Space Helps Answer Questions” and 5.3 "Where all Models Struggle” especially thought provoking. More published research should include these kinds of sections.
  4. Footnote 7 is interesting: "We tried transforming Wikipedia sentences into quiz bowl sentences by replacing answer mentions with appropriate descriptors (e.g., \Joseph Heller" with \this author"), but the resulting sentences suffered from a variety of grammatical issues and did not help the final result." Yep, syntax. Find-and-replace not gonna cut it.

Friday, August 1, 2014

for linguists, by linguists

The Speculative Grammarian is at it again, offering a happy hour discount on an already ridiculously inexpensive book of linguistic fun: The Speculative Grammarian Essential Guide to Linguistics.
Speculative Grammarian is the premier scholarly journal featuring research in the oft neglected field of satirical linguistics—and it is now available in book form!

a sidelong look at all that is humorous about the field. Containing over 150 articles, poems, cartoons, humorous ads and book announcements—plus a generous sprinkling of quotes, proverbs and other witticisms—the book discovers things to laugh about in most major subfields of Linguistics. It pokes good-natured fun at linguists (famous or otherwise), linguistic theory, and many aspects of language. The authors and editors are linguists who love their field, but who at the same time love to celebrate the funny aspects of Linguistics. The book invites readers to laugh along.

Sunday, June 29, 2014

Facebook "emotional contagion" Study: A Roundup of Reactions

In case you missed it, there was a dust-up this weekend around the web because of a social science study involving manipulation of Facebook news feeds of users (which might include you, if you are an English language user). Here are three points of contention (in order of intensity):
  • Ethics - Was there informed consent?
  • Statistical significance - The effect was small, but the data large, what does this mean?
  • Linguistics - How did they define and track "emotion "?
First, the original study itself:

Experimental evidence of massive-scale emotional contagion through social networks. Kramer et al. PNAS. Synopsis (from PNAS)
We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction between people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues.
My two cents: We'll never see the actual language data, so the many questions this study raises are destined to be left unanswered.

The Roundup

In Defense of Facebook: If you can only read one analysis, read Tal Yarkoni's deep dive response to the study and its critics. It's worth a full read (comments too). He makes a lot of important points, including the weakness of the effect, the rather tame facts of the actual experiments, and the normalcy of manipulation (that's how life works) but for me, this take-down of the core assumptions underlying the study is the Money Quote:
the fact that users in the experimental conditions produced content with very slightly more positive or negative emotional content doesn’t mean that those users actually felt any differently. It’s entirely possible–and I would argue, even probable–that much of the effect was driven by changes in the expression of ideas or feelings that were already on users’ minds. For example, suppose I log onto Facebook intending to write a status update to the effect that I had an “awesome day today at the beach with my besties!” Now imagine that, as soon as I log in, I see in my news feed that an acquaintance’s father just passed away. I might very well think twice about posting my own message–not necessarily because the news has made me feel sad myself, but because it surely seems a bit unseemly to celebrate one’s own good fortune around people who are currently grieving. I would argue that such subtle behavioral changes, while certainly responsive to others’ emotions, shouldn’t really be considered genuine cases of emotional contagion

the Empire strikes back: Humanities Professor Alan Jacobs counters Yarkoni, using language that at times seemed to verge on unhinged, but hyperbole aside, he takes issue with claims that the experiment was ethical simply because users signed a user agreement (that few of them ever actually read). Money Quote:
This seems to be missing the point of the complaints about Facebook’s behavior. The complaints are not “Facebook successfully manipulated users’ emotions” but rather “Facebook attempted to manipulate users’ emotions without informing them that they were being experimented on.” That’s where the ethical question lies, not with the degree of the manipulation’s success. “Who cares if that guy was shooting at you? He missed, didn’t he?” — that seems to be Yarkoni’s attitude

Facebook admits manipulating users' emotions by modifying news feeds: Across the pond, The Guardian got into the kerfuffle. Never one to miss a chance to go full metal Orwell on us, the Guardian gives us this ridiculous Money Quote with not a whiff of counter-argument:
In a series of Twitter posts, Clay Johnson, the co-founder of Blue State Digital, the firm that built and managed Barack Obama's online campaign for the presidency in 2008, said: "The Facebook 'transmission of anger' experiment is terrifying." He asked: "Could the CIA incite revolution in Sudan by pressuring Facebook to promote discontent? Should that be legal? Could Mark Zuckerberg swing an election by promoting Upworthy [a website aggregating viral content] posts two weeks beforehand? Should that be legal?"
This Clay Johnson guy is hilarious, in a dangerously stupid way. How does his bonkers ranting rate two paragraphs in a Guardian story?

Everything We Know About Facebook's Secret Mood Manipulation Experiment: The Atlantic provides a roundup of sorts and a review of the basic facts, and some much needed sanity about the limitations of LIWC (which is a limited, dictionary tool that, except for the evangelical zeal of its creator James Pennebaker, would be little more than a toy for undergrad English majors to play with). Article also provides important quotes from the study's editor, Princeton's Susan Fiske. This also links to a full interview with professor Fiske.

Emotional Contagion on Facebook? More Like Bad Research Methods: If you have time to read two and only two analyses of the Facebook study, first read Yarkoni above, then read John Grohol's excellent fisking of the (mis-)use of LIWC as tool for linguistic study. Money Quote:
much of human communication includes subtleties ... — without even delving into sarcasm, short-hand abbreviations that act as negation words, phrases that negate the previous sentence, emojis, etc. — you can’t even tell how accurate or inaccurate the resulting analysis by these researchers is. Since the LIWC 2007 ignores these subtle realities of informal human communication, so do the researchers.
Analyzing Facebook's PNAS paper on Emotional Contagion: Nitin Madnani provides an NLPers
detailed fisking of the experimental methods, with special attention paid to the flaws of LIWC (with bonus comment from Brendan O'Connor, recent CMU grad and new U Amherst professor). Money Quote:
Far and away, my biggest complaint is that the Facebook scientists simply used a word list to determine whether a post was positive or negative. As someone who works in natural language processing (including on the task of analyzing sentiment in documents), such a rudimentary system would be treated with extreme skepticism in our conferences and journals. There are just too many problems with the approach, e.g. negation ("I am not very happy today because ..."). From the paper, it doesn't look like the authors tried to address these problems. In short, I am skeptical the whether the experiment actually measures anything useful. One way to address comments such as mine is to actually release the data to the public along with some honest error analysis about how well such a naive approach actually worked.

Facebook’s Unethical Experiment: Tal Yarkoni's article above provides a pretty thorough fisking of this Slate screed. I'll just add that Slate is never the place I'd go to for well reasoned, scientific analysis. A blow-by-blow deep dive into the last episode of Orange Is The New Black? Oh yeah, Slate has that genre down cold.

Anger Builds Over Facebook's Emotion-Manipulation Study: The site that never met a listicle it didn't love, Mashable provides a short article that fails to live up to its title. They provide little evidence that anger is building beyond screen grabs of a whopping four Twitter feeds. Note, they completely ignore the range of people supporting the study (no quotes from the authors, for example). As far as I can tell, there is no hashtag for anti-Facebook study tweets.

Facebook Manipulated User News Feeds To Create Emotional Responses: Forbes wonders aloud about the mis-use of the study by marketers. Money Quote:
What harm might flow from manipulating user timelines to create emotions?  Well, consider the controversial study published last year (not by Facebook researchers) that said companies should tailor their marketing to women based on how they felt about their appearance.  That marketing study began by examining the days and times when women felt the worst about themselves, finding that women felt most vulnerable on Mondays and felt the best about themselves on Thursdays ... The Facebook study, combined with last year’s marketing study suggests that marketers may not need to wait until Mondays or Thursdays to have an emotional impact, instead  social media companies may be able to manipulate timelines and news feeds to create emotionally fueled marketing opportunities.
You don't have to work hard to convince me that marketing professionals have a habit of half-digesting science they barely understand to try to manipulate consumers. That's par for the course in that field, as far as I can tell. Just don't know what scientists producing the original studies can do about it. Monkey's gonna throw shit. Don't blame the banana they ate.

Creepy Study Shows Facebook Can Tweak Your Moods Through ‘Emotional Contagion’. The Blaze witer Zach Noble summed up the negative reaction this way: a victory for scientific understanding with some really creepy ramifications. But I think it only seems creepy if you mis-understand the actual methods.

Final Thought: It's the bad science that creeps me out more than the questionable ethics. Facebook is data, lets use it wisely.

Friday, June 6, 2014

would you like vocal fries with that?

Actual linguist Christian DiCanio debunks non-linguists' study about perceptions of fake-vocal fry (if The Onion did linguistics parodies, surely this would be it): Vocal fry doesn't harm your career prospects, but not being yourself just might.

Money quote:
...listeners judge the female speakers with vocal fry as sounding "untrustworthy", there is a good possibility that they are simply making such a judgment based on the speaker not sounding like herself. The better lesson that one might take home instead here is that one's job prospects are harmed if you try to talk (or act) like someone who you are not.
Read the full take-down here (including bonus spectrogram!)

PS: I knew Christian briefly when he was an undergrad at SUNY Buffalo. He was talented and motivated. Now he's slumming it at some shady, slacker *university* in Connecticut. Damn waste.

Tuesday, May 27, 2014

mathematical linguistics for high school students

I received the following email this weekend:

I'm a high school junior from southern California.

For our final project in AP Calculus class, I'm doing a presentation on the connection between mathematics and linguistics, and I stumbled on your blogpost "Why Linguists Should Study Math" while researching my topic.

I was wondering if you could point me towards some resources (that are relatively easy to understand) about how math is present in and affects our written and spoken language.
Some things that I am considering are:
- the occurrences of words in our language
- how grammar uses mathematical principles
- algorithms we use to construct sentences

My [edited] response (suggestions from y'all as to better resources are much appreciated; I'll forward; I wanted to get a response out quickly because the final is presumably fast approaching):


Thanks for reaching out to me. Of course, I think you’ve chosen a good topic. There are two broad ways in which linguistics and math intersects:
  • How the human brain uses math in natural language (psycholinguistics)
  • How linguists use math to study and model languages (computational linguistics)
From your email, it appears you are mostly interested in #1. However, in contemporary linguistics, the two are fast becoming one. Most contemporary linguists use math as a tool.

Let me address your three areas of interest with respect to how the human brain might use math to process and produce language:

The occurrences of words in our language: For the most part, this means “frequency” which really means counting. Linguists love to count. We use large corpora of texts to count words and phrases. Lancaster University in the UK is a well-known corpus linguistics school. Their web page has a lot of good introductory information (although I find it a bit clunky looking).

UPDATE: I forgot to include the one item that most directly answers the basic question: frequency effects in language. Human's are very aware of how often they hear words. In some way, we count words automatically, even if it's not quite a specific count like 75, somehow we know which words, phonemes, syntactic structures we hear/read more than others. This gives rise to a variety of frequency effects in language processing. This is the clearest example of how the brain uses math for language.

For example, we recognize high frequency words much faster than low frequency words. The website for Paul Warren's book "Introducing Psycholinguistics" has an online demo for a word frequency task you can walk through to see how linguists study this.
What do linguists count?
  • Words: I’m sure you’ve seen word clouds like Wordle. This is composed of simple word frequency counts. One of the most enduring facts about word counts is Zipf’s Law which says “the most frequent word [in a corpus of texts] will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.” Why would this be true? Linguists have been studying this for decades.
  • Ngrams: sets of two-word, three-words, four-word strings, etc. This helps provide more context than mere single word frequencies. Have some fun playing around with Google’s Ngram Viewer if you haven’t already. Try plotting the change in frequency of “mathematical linguistics” and “corpus linguistics” (paste those two phrases into the search box with no quotes and only a comma separating them). Scholars are trying to use this to plot changes in culture. For example, take a look at this PDF.
  • Other: We also count many other things too, like parts of speech (verbs, nouns, prepositions, etc). We also count the co-occurrence of linguistics items that are not right next to each other. If you want to dig into more frequency fun, check out the more advanced tools at BYU. You can read more about how these tools help us study language here.

How grammar uses mathematical principles: One of the most commonly studied types of mathematical principle in language is statistical learning. A good example of this is transitional probabilities, which are sets of probabilities for what linguistic item might come next given a string of items (e.g., words or phonemes). For example, if you read “The author signed the _______”, you could guess what the blank word is based on the previous four words (most likely, it’s “book”).  This is based on the psycholinguistic tests called “Cloze tests”. Linguists have discovered that the brain tracks transitional probabilities for all kinds of linguistic items. In fact, this is one of the most robust areas of study in language acquisition. Linguists study how babies use transitional probabilities to learn language. For example, one of the most challenging problems is figuring out how babies learn to separate a continuous stream of audio noise coming in to their ears into separate words, without any knowledge of what words are or what they mean. One theory is that babies quickly learn transitional probabilities of sounds that tell them where one word ends and another begins. But transitional probabilities alone are not enough. For a challenge, try reviewing this PDF:

Algorithms we use to construct sentences: This is the most controversial area you’ve asked about. The fact is, we linguists don’t really know how the brain constructs sentences. As I mentioned above, there are models based on transitional probabilities like Markov models, a computer algorithm designed to make those same kinds of guesses we made about “book”. Markov models and Cloze tests are a good example of psycholinguistics and computational linguistics coming together. As a theoretical contrast to statistical models, there are rule-based models like formal grammars. These are not mathematical in a typical sense, but they are based on formal logic, which is the underlying foundation of mathematics. Linguistics is in the middle of a war between the formal grammar camp and the statistical grammar camp. There’s no consensus on which is the *correct* model of language. However, in the last decade or so, the statistical side seems to have gained the advantage. If you really want to dig in to this war, here’s a challenging read.

Additional Reading:
Linguists who count (the comments are especially engaging; your teacher might be particularly interested in the calculus vs. algebra debate that ensues).

I hope this gets you off to a good start. Please don’t hesitate to ask for clarifications or more resources (especially let me know if you need more intro level or more advanced level; I wasn’t sure if I hit the level right or not). I’m happy to be of more assistance if I can. As a smart, dedicated student, I’m sure you’re ready to dig in to ngrams and Markov models. But, as a high school junior in southern California with June fast approaching, I’m also sure you’re ready for the beach. Both are required for a healthy life of the mind.