Saturday, May 31, 2008

Linguists and The Semantic Web

Via Sitemeter, I discovered that someone from Chrahnoh ('Toronto') stumbled onto my blog by Googling "How would a linguist respond to the semantic web". Having never actually posted on the topic, I nonetheless found it an intriguing question worth some follow-up.

As a primer, the semantic web is a movement, of sorts. It's goal is to make data on the web more easily processed by computers by categorizing it better. The point is to make humans do less and computers do more. This should make the web more efficient for humans because it will make finding things and doing things online easier and faster.

There are several semantic web strategies, but they mostly involve categorization, as far as I can tell. The idea being that a pre-categorized set of web pages is easier to automatically sort and process than non-categorized pages. Just like a library. If a library is composed of a pile of books on a floor, it will be difficult to find what you want. But, if that library is organized alphabetically and cross referenced every which way, it is much easier to use. So, the semantic web is an attempt by humans to über-categorize web pages. This can be done by enforcing mark-up standards like HTML which already requires web page source code to look a certain way. It could also be accomplished by post-processing. After someone has put up a web page, a bot comes along, processes it, and then assigns some categorization/indexing (this is Google-like). We're getting into heavy philosophical territory here, the kind that befuddled the greatest minds in history including Wittgenstein, Russell, and Aristotle. There is a long and difficult history behind the idea of trying to categorize the way the world is -- ontology.

My first impression is that linguists would love this. Linguists love ontologies and rules and categorization. Yippie! Linguists would insist on a certain cognitively natural ontology, but the basic idea fits nicely into the zeitgeist of traditional linguistics.

Having said that, this lone lousy linguist has trepidations. It seems ass-backwards. Imagine I started a movement to make the world an easier place to live in, so my strategy was to walk around sticking post-it notes onto EVERYTHING. If we could just put all the necessary post-it notes onto everything in the whole world, then everyone would know what everything is just by looking at the notes. Cool idea, huh!

No, bad idea. It's a classic fool's errand. While there may be a universal ontology, no one knows what it is. More to the point, it puts effort in the wrong place. We humans don't have post-it notes on everything to look at. We have a cognitive capacity that helps us look at new things and figure out what to do with them. We all have a super-Google in our heads, developed by evolution over a million years. It's not clear how much categorization information we store, but we clearly store associations between things. But I think it is the strategies for dealing with new things that makes human cognition so powerful, not a reliance on fitting things into an ontology.

I think that's the right model for the web. Let everyone put everything online. Develop smart search technologies that can look at new, uncategorized things and figure out what to do with them right now, on the fly.



Friday, May 30, 2008

Globalization and Language

Freakonomics has a nice post asking the question "What Will Globalization Do to Languages?". They ask four language professionals (including Language Log's Mark Liberman) to respond. They have interesting views well worth the read (Liberman gives the most nuanced and linguistically savvy response, no surprise), but they all seem to agree that globalization and internet technologies are NOT going to allow English to "take over". Liberman offers this nugget of wisdom:

It’s obvious that globalized communications and popular culture will tend to homogenize local language varieties — but some varieties of English seem to be diverging more rapidly than ever.

I like John Hayden's point too:

English is a tool, just like a piece of technology. Much of the world’s economy is tied up in English-speaking countries and for that reason, English is like a cell phone provider offering the best plan. But if the dollar continues to drop, the most viable option could shift. Mexico and Korea don’t need English to communicate if Korea begins to find it profitable to learn Spanish.

Languages evolve via as-yet-unknown cognitive mechanisms. I suspect that "globalized communications and popular culture" will not change the way languages evolve. At best they will simply speed up the existing process.

Saturday, May 24, 2008

On the strengths and weaknesses of “theoretically”

Do scientists use the word “theoretically” the opposite of the way non-scientists do? Let’s see. Below, I walk through a cheap and dirty corpus linguistics method for investigating distributional differences.

This week a colleague of mine brought up an interesting point about the lay person’s use of the word “theoretically” and I thought it merited some investigation and a post. My colleague’s point sprang from a quote that a political pundit made regarding the West Virginia primary and the impact of racism. The pundit was making what is now a rather conventional observation: roughly 20% of white Clinton voters in West Virginia were willing to publicly admit that race was a factor in their voting decision (I don’t know if this is true or not, but it has been repeated in the mainstream media many times, here’s one AP example); if we assume that most people who are at least slightly racist are also at least slightly private in their racism, then we can derive the deduction that more than 20% actually considered the race of the candidates when making their decisions, but some of them refused to admit it.

The pundit chose to express this reasoning by saying something like this (I don’t have a direct quote): ‘In WV, 20% of white Clinton voters admitted that race was a factor; theoretically, this means it’s possible that more than 20% actually considered race important’.

My colleague’s point was that this use of theoretically does not really mean the speaker is referencing some specific scientific theory which predicts a particular outcome. It means something closer to arguably or probably. I agreed and proposed that it might be acting like an inferential evidential, couching the claim in the guise of a fact derived by deductive reasoning.

This got me to wondering about how the word theoretically patterns in common usage. I decided to conduct a brief experiment comparing the words theoretically, arguably, and probably. In order to do this efficiently, I chose to use the freely available and wholly online Collins Cobuild Concordance and Collocations Sampler. This handy dandy online tool allows anyone to extract distributional facts about words quickly from a corpus of 56 million words composed of three subcorpora:

British books, ephemera, radio, newspapers, magazines (36m words)
American books, ephemera and radio (10m words)
British transcribed speech (10m words)

There are two basic tools available:

1) Corpus Concordance Sampler: provides the search word and the sentence it occurred in (well, not quite the sentence, but close enough for Saturday afternoon)

2) Collocation Sampler: provides the words that are statistically significantly associated with the search words (Mutual Information Score plus t-score of significance)

Let’s start with collocates. Below I’ve pasted the ten most significant collocates for the three words under investigation. You could interpret this table as saying something like this: the pronoun it is 3.5 times more likely to occur within 8 words of theoretically than you would assume from random co-occurrence. It is standard within corpus linguistics to interpret this non-random co-occurrence to mean that these two words have some semantically meaningful association (pssst, we should be careful not to over-interpret the meaning of the co-occurrence behavior of pronouns).

theoretically: Collins t-score

Collocate

Corpus Freq

Joint Freq

Significance

it

494702

33

3.555134

could

59556

12

3.027003

least

12333

8

2.717569

possible

10266

6

2.342936

be

234656

15

2.332595

can

113012

9

2.042261

was

340423

16

1.83627

should

35882

5

1.828091

less

14186

4

1.819667

is

407114

18

1.803011

arguably: Collins t-score

Collocate

Corpus Freq

Joint Freq

Significance

the

2313407

179

7.308255

most

43653

53

7.069587

is

407114

56

5.573264

best

20161

32

5.531725

greatest

2506

13

3.581149

important

13468

12

3.327601

was

340423

29

3.165729

finest

1067

7

2.631592

beautiful

4076

7

2.591662

more

94468

11

2.316599

probably: Collins t-score

Collocate

Corpus Freq

Joint Freq

Significance

would

97660

1076

26.18472

it

494702

2333

25.53529

will

111798

1092

25.52538

i

512080

2204

22.70137

think

70465

779

22.29877

you

421797

1797

20.274

ll

34908

545

20.02152

be

234656

1159

18.72308

d

43704

462

16.97461

most

43653

457

16.83863

Note that “d” is likely the contraction for “would” and “ll” is the contraction for “will”.

Analysis: the fact that jumps out at me is the pervasiveness of modal verbs on the lists for theoretically and probably. They both have four modals in their top ten. In all cases, they seem to be expressing epistemic modality in order to hedge the certainty of whatever is being claimed. On the other hand, the word arguably has zero modals in its top ten, but it does have four superlatives, while theoretically has zero superlatives and probably has only one. A picture appears to be emerging.

First Pass Interpretation: All three words appear to be used as hedges. But theoretically and probably appear to pattern closely together as generic hedges, while arguably seems to be a hedge that is strongly associated with superlative claims.

It seems to me that this use of theoretically as a hedge in lay discourse is in contrast with its use in scientific discourse. I would assume scientists are more likely to use theoretically in a sentence to strengthen their claim, not weaken it. However, in non-scientific circles I would guess that scientific theory is regarded with some suspicion. When a claim is based on a theory, non-scientist are more likely than not to consider it not really true. Using the word theoretically, for the lay person, is a way to say “I’m not sure…maybe not.”

Now let’s looks at the concordance. I took the concordance output and performed a little judgment task. For each output set, I created two alternative documents by replacing the target word with one of its alternatives. So the theoretically document had a theoretically_into_probably alternative and a theoretically_into_arguably alternative. I used a five point scale to judge the synonymy of each substitution (i.e., I asked myself ‘does this sentence mean the same thing with the substitution?’):

1. clearly the same meaning
2. kind of the same meaning
3. can’t decide
4. kind of different meaning
5. clearly different meaning

The winner, so to speak, of the most similar prize seems to be arguably into probably. This is to say that I find it generally synonymous to replace the word arguably with probably. Although the reverse, probably into arguably, was pretty good too.

The cases where the theoretically into probably substitution works best seems to be the sentence initial cases where the word is acting as an adverbial hedge over the proposition encoded by the whole clause. But these cases are few. Although these two words seemed to have similar collocates, they do not seem very similar in their actual distributions. This is driven, I think, by the existence of the semantically dissimilar technical use for theoretically. Neither probably nor arguably have this sort of clearly technical usage. Statisticians would likely use the noun probability rather than the adverb probably. I doubt logicians use arguably at all.

theoretically into probably

score

sentence

5

on the way much of the material treated theoretically probably in the lectures.

5

for me the course was very stimulating theoretically probably and practically.

4

Theoretically Probably, this means that at one point during the

4

Theoretically Probably, HMS Champion has been reduced to a lot

5

Let's say it could be done -- theoretically probably."

2

theoretically probably followed in the footsteps of the more

5

Theoretically Probably, underdevelopment could last forever. His

3

Theoretically Probably, it is based on the false notion that

5

`It" exists simply because it is theoretically probably defined,

2

The Social Democrats theoretically probably could now force the Kohl government to

theoretically into arguably

score

sentence

5

on the way much of the material treated theoretically arguably in the lectures.

5

for me the course was very stimulating theoretically arguably and practically.

2

Theoretically Arguably, this means that at one point during the

2

Theoretically Arguably, HMS Champion has been reduced to a lot

4

Let's say it could be done -- theoretically arguably."

4

Theoretically Arguably, underdevelopment could last forever.

5

probably claim that their disciplined, theoretically arguably-informed way of perceiving the social

5

Theoretically Arguably, it is based on the false notion that

5

`It" exists simply because it is theoretically arguably defined, and the ensuing model is then

4

The Social Democrats theoretically probably could now force the Kohl government to

arguably into probably

score

sentence

2

The finishes are arguably probably the finest in Tudor Court. [p] Adjacent to

2

Arguably Probably one of the finest album releases this year.

2

Orfeo is arguably probably the greatest dance work created in Britain

1

A really dark orange, arguably probably the darkest we have ever seen.

1

for instance the first, and arguably probably most important, aim to develop lively

3

Indeed, songs like `Green', arguably probably the standout track of their debut,

2

it belongs to arguably probably the toughest bunch of surfers in the world.

2

Wayne Westner, arguably probably the longest hitter on the European Tour,

2

Arguably Probably, Sheila Henry's behaviour in holding the

2

A week there is arguably probably the most interesting seven days currently

arguably into theoretically

score

sentence

5

The finishes are arguably theoretically the finest in Tudor Court.

5

Arguably Theoretically one of the finest album releases this year.

5

Orfeo is arguably theoretically the greatest dance work created in Britain

5

A really dark orange, arguably theoretically the darkest we have ever seen.

5

for instance the first, and arguably theoretically most important, aim to develop lively

5

Indeed, songs like `Green', arguably theoretically the standout track of their debut,

5

it belongs to arguably theoretically the toughest bunch of surfers in the world.

5

Wayne Westner, arguably theoretically the longest hitter on the European Tour,

5

Arguably Theoretically, Sheila Henry's behaviour in holding the

5

A week there is arguably theoretically the most interesting seven days currently

probably into arguably

score

sentence

2

your property is probably arguably worth far more than you owe on it.

5

You've probably arguably guessed that we couldn't resist using white

2

It probably arguably would, but you'd just have been postponing

4

It probably arguably sounds silly,' she adds,

3

We came to agreement at 1am, probably arguably at the highest pitch of drunkenness

4

a member and an ANC team which it said would probably arguably include Mr Mandela.

5

I'm probably arguably all wrong," he said.

4

it was the last that probably arguably accounted for their rapid answers.

2

Those in the high intelligence group are probably arguably similarly subject to difficulties

1

It was probably arguably the most comprehensive compilation of facts

probably into theoretically

score

sentence

4

your property is probably theoretically worth far more than you owe on it.

5

You've probably theoretically guessed that we couldn't resist using white

3

It probably theoretically would, but you'd just have been postponing

5

It probably theoretically sounds silly,' she adds, `but when it was

5

We came to agreement at 1am, probably theoretically at the highest pitch of drunkenness

4

a member and an ANC team which it said would probably theoretically include Mr Mandela.

5

I'm probably theoretically all wrong," he said.

5

it was the last that probably theoretically accounted for their rapid answers.

5

Those in the high intelligence group are probably theoretically similarly subject to difficulties

5

It was probably theoretically the most comprehensive compilation of facts

Saturday, May 17, 2008

“hypercompetent”

In a recent article on Slate here, I ran across the following sentence:

“One of the most important figures in the presidential campaign this fall is a controversial, hypercompetent blonde.” (my emphasis)

I find the term “hypercompetent” to be a bit oxymoronic. It’s like saying hypermediocre. Is it really a compliment? Is this ultimately sexist? Was the author trying to avoid using some male-oriented word to describe an ambitious, successful woman? Why reference her hair color?

Recently, polyglot conspiracy has posted about the sexism in the current political media coverage, and this may be an example

Saturday, May 10, 2008

"Love Means Never Having to Say ..."

There is a talented Cuban blogger named Yoani Sánchez at Generation Y. She's a wonderful writer and thoughtful blogger. The fact that she's managed to maintain her blogging life while explicitly repressed by her government (they've taken away her passport, amongst other things) is inspiring. (HT Daily Dish)

But I'm a linguist, so let's get down to business. As far as I can tell, she blogs in THREE languages!! Spanish, German, and English. Her most up-to-date posts appear to be in Spanish, so I presume this is her blog language of choice. However, as the weeks and months go by, some of her older posts appear in either English or German translations. I'm curious to know if she is translating these herself, or getting someone to translate for her? Some of the English is quite good and enjoyable (with occasional stutters, of course).

The current English post (from March 5) is on apologies. Linguist have long been interested in the apology as a speech act, of course. There are whole subfields of Sociolinguistics and discourse pragmatics devoted it.

I've long felt that my use of the casual apology has little to do with any attempt on my part to ask for forgiveness. The most common situation in which I use "I'm sorry" or "excuse me" is one where someone else has made a mistake of some sort. Imagine I'm walking in to a store and someone has mistaken used the entrance as the exit and he bumps into me. I would most likely mumble lightly, "oops, sorry". Clearly, I am not at fault, yet I issue the apology. Why?

Here is my I-haven't-read-Grice-in-years analysis: by taking blame, so to speak, I am able to quickly signal to the offender that I am not issuing blame to them. Since they know they are to blame and not me (and they know that I know, blah blah blah), they can infer via the Maxim of Quality that I must be saying something else, like an indirect speech act. Using some chain of Gricean inference, they can probably construct the interpretation that I'm really saying "no apology is necessary".

It is an easy way for me to diffuse their trepidation about MY reaction. At around 6 foot 4 inches and 260lbs, I know I'm an intimidating presence. I don't want the other person to feel that their small mistake will be turned into a big one by the overreaction of some lumbering giant (actually, I'm quite quick on my feet, I was a helluva wrestler once, ya know).

So, here's 11 ways to say you're sorry (HT SenseList)

Catalan: Ho sento
Croatian: Žao mi je
Czech: Promiňte
Danish: Undskyld
Finnish: Anteeksi
Flemish: Het spijt me
Hungarian: Sajnos
Luxembourgish: Et deet mer leed
Maltese: Ma nitkellimx bil-Malti
Norwegian: Beklager
Polish: Przepraszam

Cheers.

Monday, May 5, 2008

The Perils of Planning

I just can’t get this construction to work for me:

“Such a decision would give Clinton an estimated 55 or more delegates than Obama, according to Clinton campaign operatives.”

This comes from Huffington Post contributer Thomas B. Edsall. To me, the construction “or more” must always be optional. In other words, you should always be able to delete “or more” and the sentence should mean roughly the same thing. But in this case, deleting “or more” would cause clear ungrammaticality to ensue:

*Such a decision would give Clinton an estimated 55 delegates than Obama…

Ugh!

My guess is that Edsall had a “X more Y than Z” construction planned, then decided to throw in a little “or more” for flavor, but he was faced with the catastrophic prospect of TWO more’s in ONE sentence, right next to each other! Gasp! That can’t be right, right?

*Such a decision would give Clinton an estimated 55 or more more delegates than Obama…

Well, for this speaker of Northern California English (truly, the finest of all the Englishes, as well you may know), “or more more” sounds better than the original.

Sunday, May 4, 2008

Iron Man Linguistics

I just saw Iron Man (no no, this is not another movie review ... but you can still read my Forgetting Sarah Marshall and Juno discussions). There is an interesting linguistic side-point to be made about language diaspora in Afghanistan. As the movie opens, our hero, Tony Stark, is kidnapped near Bagram Air Base in northwest Afghanistan's Parwan Province. He is held captive with one other prisoner, a local Afghani doctor named Yinsin (a name carried over from the original comic book I believe, so not particularly Afghani) who says he's from a small town named "Gulmira" (I couldn't find any real town by that name, though it seems to be a fairly common given name). Luckily for Stark, Yinsin speaks "many languages", so he's able to understand some of their captors' shouts and orders, but not all (an interesting aside, the actor who plays Yinsin, Shaun Toub, has a backstory worthy of its own screenplay).

You see, the group which has kidnapped the unfortunate pair goes undefined throughout the movie. We are largely left to draw our own conclusions about their origin, ideology, and motivation (though we get some minor clarification late in the movie). The one thing we learn about their diversity is that they speak a wide variety of languages, as Yinsin lists some of them for Stark. I don't remember the full list, but I believe they included "Arabic, Ashkun, Farsi, Pashto" amongst others. So, kudos to the screenwriters for, in the very least, scanning Ethnologue for an appropriate set of languages to list.

But there's one other language that Yinsin mentions, and it got my attention: Hungarian. A few scenes after Yinsin lists the various languages the group speaks (a list that does not include Hungarian), he and Stark are being yelled at by an unnamed thug. Stark asks Yinsin what he's saying and Yinsin says something like "I don't know. He's speaking Hungarian."

This was meant as a bit of comic relief, I believe. So the screenwriters may have chosen Hungarian at random. Perhaps any language that American audiences would perceive as unusual or unexpected would have done the trick. Perhaps it would have been even funnier if he said "I dunno, he's speaking Comanche (ba dum boom!)." I don't know, but my linguistics radar picked it up and I went searching for any connections Hungary might have with Afghanistan.

Alas, I have found few. I would have to make some serious leaps of logic to connect the dots, and I don't think the movie was going for that. The clarifying scenes late in the movie suggest that this groups' motivations are largely financial, not ideological or political, so we might assume this was some random Hungarian mercenary. As far as I can tell, this is the most logically consistent interpretation (unless I've misunderstood the movie's plot or dialogue, in which case ... never mind).

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...