Tuesday, May 27, 2014

mathematical linguistics for high school students

I received the following email this weekend:

I'm a high school junior from southern California.

For our final project in AP Calculus class, I'm doing a presentation on the connection between mathematics and linguistics, and I stumbled on your blogpost "Why Linguists Should Study Math" while researching my topic.

I was wondering if you could point me towards some resources (that are relatively easy to understand) about how math is present in and affects our written and spoken language.
Some things that I am considering are:
- the occurrences of words in our language
- how grammar uses mathematical principles
- algorithms we use to construct sentences

My [edited] response (suggestions from y'all as to better resources are much appreciated; I'll forward; I wanted to get a response out quickly because the final is presumably fast approaching):


Thanks for reaching out to me. Of course, I think you’ve chosen a good topic. There are two broad ways in which linguistics and math intersects:
  • How the human brain uses math in natural language (psycholinguistics)
  • How linguists use math to study and model languages (computational linguistics)
From your email, it appears you are mostly interested in #1. However, in contemporary linguistics, the two are fast becoming one. Most contemporary linguists use math as a tool.

Let me address your three areas of interest with respect to how the human brain might use math to process and produce language:

The occurrences of words in our language: For the most part, this means “frequency” which really means counting. Linguists love to count. We use large corpora of texts to count words and phrases. Lancaster University in the UK is a well-known corpus linguistics school. Their web page has a lot of good introductory information (although I find it a bit clunky looking).

UPDATE: I forgot to include the one item that most directly answers the basic question: frequency effects in language. Human's are very aware of how often they hear words. In some way, we count words automatically, even if it's not quite a specific count like 75, somehow we know which words, phonemes, syntactic structures we hear/read more than others. This gives rise to a variety of frequency effects in language processing. This is the clearest example of how the brain uses math for language.

For example, we recognize high frequency words much faster than low frequency words. The website for Paul Warren's book "Introducing Psycholinguistics" has an online demo for a word frequency task you can walk through to see how linguists study this.
What do linguists count?
  • Words: I’m sure you’ve seen word clouds like Wordle. This is composed of simple word frequency counts. One of the most enduring facts about word counts is Zipf’s Law which says “the most frequent word [in a corpus of texts] will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.” Why would this be true? Linguists have been studying this for decades.
  • Ngrams: sets of two-word, three-words, four-word strings, etc. This helps provide more context than mere single word frequencies. Have some fun playing around with Google’s Ngram Viewer if you haven’t already. Try plotting the change in frequency of “mathematical linguistics” and “corpus linguistics” (paste those two phrases into the search box with no quotes and only a comma separating them). Scholars are trying to use this to plot changes in culture. For example, take a look at this PDF.
  • Other: We also count many other things too, like parts of speech (verbs, nouns, prepositions, etc). We also count the co-occurrence of linguistics items that are not right next to each other. If you want to dig into more frequency fun, check out the more advanced tools at BYU. You can read more about how these tools help us study language here.

How grammar uses mathematical principles: One of the most commonly studied types of mathematical principle in language is statistical learning. A good example of this is transitional probabilities, which are sets of probabilities for what linguistic item might come next given a string of items (e.g., words or phonemes). For example, if you read “The author signed the _______”, you could guess what the blank word is based on the previous four words (most likely, it’s “book”).  This is based on the psycholinguistic tests called “Cloze tests”. Linguists have discovered that the brain tracks transitional probabilities for all kinds of linguistic items. In fact, this is one of the most robust areas of study in language acquisition. Linguists study how babies use transitional probabilities to learn language. For example, one of the most challenging problems is figuring out how babies learn to separate a continuous stream of audio noise coming in to their ears into separate words, without any knowledge of what words are or what they mean. One theory is that babies quickly learn transitional probabilities of sounds that tell them where one word ends and another begins. But transitional probabilities alone are not enough. For a challenge, try reviewing this PDF:

Algorithms we use to construct sentences: This is the most controversial area you’ve asked about. The fact is, we linguists don’t really know how the brain constructs sentences. As I mentioned above, there are models based on transitional probabilities like Markov models, a computer algorithm designed to make those same kinds of guesses we made about “book”. Markov models and Cloze tests are a good example of psycholinguistics and computational linguistics coming together. As a theoretical contrast to statistical models, there are rule-based models like formal grammars. These are not mathematical in a typical sense, but they are based on formal logic, which is the underlying foundation of mathematics. Linguistics is in the middle of a war between the formal grammar camp and the statistical grammar camp. There’s no consensus on which is the *correct* model of language. However, in the last decade or so, the statistical side seems to have gained the advantage. If you really want to dig in to this war, here’s a challenging read.

Additional Reading:
Linguists who count (the comments are especially engaging; your teacher might be particularly interested in the calculus vs. algebra debate that ensues).

I hope this gets you off to a good start. Please don’t hesitate to ask for clarifications or more resources (especially let me know if you need more intro level or more advanced level; I wasn’t sure if I hit the level right or not). I’m happy to be of more assistance if I can. As a smart, dedicated student, I’m sure you’re ready to dig in to ngrams and Markov models. But, as a high school junior in southern California with June fast approaching, I’m also sure you’re ready for the beach. Both are required for a healthy life of the mind.


Alexa said...

It's so great to see high school students interested in comp ling!

May I suggest edit distance, also? We recently did a ling olympiad problem involving the role of edit distance in NLP programs like spell check, autocorrect, etc.

Chris said...

Alexa, yeah that's a good one. Didn't think about it. I should look at the Olympiad problem sets and see what else is there. Thanks!

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

I recently watched Andrew Ng's excellent lecture from 2016 Nuts and Bolts of Applying Deep Learning and took notes. I post them as a he...