Saturday, May 31, 2008

Linguists and The Semantic Web

Via Sitemeter, I discovered that someone from Chrahnoh ('Toronto') stumbled onto my blog by Googling "How would a linguist respond to the semantic web". Having never actually posted on the topic, I nonetheless found it an intriguing question worth some follow-up.

As a primer, the semantic web is a movement, of sorts. It's goal is to make data on the web more easily processed by computers by categorizing it better. The point is to make humans do less and computers do more. This should make the web more efficient for humans because it will make finding things and doing things online easier and faster.

There are several semantic web strategies, but they mostly involve categorization, as far as I can tell. The idea being that a pre-categorized set of web pages is easier to automatically sort and process than non-categorized pages. Just like a library. If a library is composed of a pile of books on a floor, it will be difficult to find what you want. But, if that library is organized alphabetically and cross referenced every which way, it is much easier to use. So, the semantic web is an attempt by humans to über-categorize web pages. This can be done by enforcing mark-up standards like HTML which already requires web page source code to look a certain way. It could also be accomplished by post-processing. After someone has put up a web page, a bot comes along, processes it, and then assigns some categorization/indexing (this is Google-like). We're getting into heavy philosophical territory here, the kind that befuddled the greatest minds in history including Wittgenstein, Russell, and Aristotle. There is a long and difficult history behind the idea of trying to categorize the way the world is -- ontology.

My first impression is that linguists would love this. Linguists love ontologies and rules and categorization. Yippie! Linguists would insist on a certain cognitively natural ontology, but the basic idea fits nicely into the zeitgeist of traditional linguistics.

Having said that, this lone lousy linguist has trepidations. It seems ass-backwards. Imagine I started a movement to make the world an easier place to live in, so my strategy was to walk around sticking post-it notes onto EVERYTHING. If we could just put all the necessary post-it notes onto everything in the whole world, then everyone would know what everything is just by looking at the notes. Cool idea, huh!

No, bad idea. It's a classic fool's errand. While there may be a universal ontology, no one knows what it is. More to the point, it puts effort in the wrong place. We humans don't have post-it notes on everything to look at. We have a cognitive capacity that helps us look at new things and figure out what to do with them. We all have a super-Google in our heads, developed by evolution over a million years. It's not clear how much categorization information we store, but we clearly store associations between things. But I think it is the strategies for dealing with new things that makes human cognition so powerful, not a reliance on fitting things into an ontology.

I think that's the right model for the web. Let everyone put everything online. Develop smart search technologies that can look at new, uncategorized things and figure out what to do with them right now, on the fly.


Chris said...

I'll be the first to criticize my own post: Isn't it the ability to categorize on the fly that makes human cognition able to deal with new things? Isn't this what neural nets do? Aren't they just categorization machines?

Yes, they are in the sense that they come to be functionally associated by recognizing input and producing a predictable output (the letters d-o-g are recognized and "categorized" a single word). So isn't it the case that "the ability to deal with new things" and "the ability to categorize" are really the same thing? Sure. So exactly what is my critique of the semantic web?

I guess what I find most unsavory about the semantic web is any attempt to enforce a "single", fixed ontology. I'd rather see a technology that finds functional associations between things and uses those associations for current decisions, then happily abandons the associations unless there is good reason to store them (e.g., frequency).

Moses said...

Well, yes, such software is under development; or at least that's the goal of many institutions and companies.
But in the interval, the imposition of standard schema and standards such as XML make this easier. For instance, adherance to standards such as the Dublin code allow automatic resolution of (for instance) whether a text is written by a Ms Washington, in Washington (DC or County Durham), or about Washington (person or place).

Moses said...

The other thing is, that even our own recognition abilities are not that great. That;s why we assign part numbers and diagrams to engineering components and why we do use filing systems. It's better to accept a partial solution available now, than wish for perfection somewhere down the line. In my opinion, anyway.

Chris said...

Certainly true. Professional engineers often have the the goal of making the interim easier. It's academics who try to solve the big picture all-at-once.

We can't get away with pure chaos (not yet). Browser's need certain information just to display the page. If I want a word to appear boldfaced, I need a way to tell the browser to display the text in boldfaced (a mark-up convention) and it's preferable to have a mark-up convention that is standard so my word appears boldfaced in Explorer, Firefox, Opera, or whatever.

But I think it's also wise to keep those conventions to a minimum. Whereas the semantic web seems to want to maximize the conventions.

I could be wrong, of course.

Putting the Linguistics into Kaggle Competitions

In the spirit of Dr. Emily Bender’s NAACL blog post Putting the Linguistics in Computational Linguistics , I want to apply some of her thou...