Monday, September 29, 2008

Txt Parsing

David Crystal's newest book, Txtng: The Gr8 Db8, has just recently been released. I'm looking forward to reading it (though I'll likely wait until the paperback is available ... I'm staunchly anti-hardback). From the book's Amazon synopsis:

"Does texting spell the end of literacy? Is there a panic in the media? David Crystal looks at the evidence. He investigates how texting began and who uses it, why and what for. He shows how to interpret its mix of pictograms, logograms, abbreviations, symbols, and wordplay, and how it works in different languages.He explores the ways similar devices have been used in different eras and discovers that the texting system of conveying sounds and meaning goes back a long way, all the way in fact to the origins of writing - and he concludes that far from hindering literacy, texting may turn out to help it."

My colleagues and I were wondering if there was any NLP work being done on parsing text messages? I haven't been able to find anything. Since there is an growing market for thinks like machine translation of text messages, I gotta believe somebody out there is researching this. But, has anything been published?

The linguistics of texting was, in fact, the topic of my very first post on this blog here.

My basic point last year was this: "I've noticed that, in the context of email and online slang/abbreviations, the character "8" is the only number or character that gets used to replace a phonological rime (a nucleus plus a coda). Most other replacements either replace whole syllables, or just consonant clusters.

For example (from Wikipedia's "List of Internet slang phrases" [note: this page no longer exists on Wikipedia so I linked to the Simple English page that copied it])

2L8 -- too late
GR8 -- great
H8 — Hate
L8R — Later (sometimes abbreviated to L8ER)
M8 — Mate
sk8/sk8r — skate/skater
W8 — Wait"

I hope Crystal discusses the linguistics of text formation.


Anonymous said...

How about other languages?
The US is several years behind the EU in its adoption of texting, so it's likely that there are other variants extant.

I've seen 7 (sept, pron."set") used for C'est (it is) and cette in French. That's similar to the '70's & '80's contraction of cassette (the taped music format)to K7.

You need some Canadian readers...


CJ said...

There's a patent you might find interesting:

Chris said...

Moses, you make a good point about other languages. I don't have any idea how texting affects the representation of words/phrases in French (or Japanese, Quechua, Russian, etc). I hope Crystal's books (which deals mostly with European texting conventions) deals with this.

But if I understand the use of 7 you mentioned, it's replaces a whole syllable, not part of a syllable, right? I think 8 is still the only character I've seen that replaces part of a syllable. I make a case for why in my original post.

CJ, thanks for the link. I'll take a look.

Jason M. Adams said...
This comment has been removed by the author.
Jason M. Adams said...

I know of work dealing with predicting texting language. I haven't seen (or were able to find) anything dealing with parsing it.. [pdf]

Moses said...

It would be interesting to have a look at German, where the part-syllable "ein" is a very common word component, as in "Stein", "Mein" etc. Do they use St1 and m1 in txt-spk?
If not, why not?

Similarly, the "acht" =8 componnent is common, as in "Nacht", "wacht"

In the UK at any rate, the rise of predictive text means that text-speak is becoming less common, rather than more so, as the latter involves more key-presses and is slower.

Chris said...

Jason, thanks for the link. I'll take a look.

Moses, those are good suggestions. Also, interesting that texting might be unduly influenced by improved NLP.

So, what is French for LOL? German for BRB?

Jason M. Adams said...

It looks like Germans use BRB, according to Wikipedia.ürzungen_(Netzjargon)

So it would seem that Germans largely use English acronyms..? That page is a frickin gold mine of German slang, btw.

BRB appears to be an abbreviation for Brandenburg too, so I'm having trouble confirming its actual usage. I did find this chat snippet (emphasis mine):

bluepixel: naja aber n notebook hat schon etwas mehr einsatzmoeglichkeiten
bluepixel: brb - gnome booten
bluepixel: so
rebugger: Nabend!

Chris said...

Jason, nice detective work. I'd like to follow-up on this, but beer and politics is dominating my life right now. Soon, I'll be sure to ruminate on the role of English txt slang in world communication.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...