The Lousy Linguist: September 2013

Sunday, September 22, 2013

Ask Ziggy What?

Every now and then a new company or tech tool blips on my radar that piques my curiosity. Recently, I ran across Ask Ziggy, a Sacramento California based NLP start-up. No, not a typo, they really are based in Sacramento (well, technically Rocklin, the Arlington of Sacramento).

This piqued my curiosity first and foremost because that's about 90 minutes from where I grew up in the Sacremento Valley, an area well known as a hot bed of dry summer dust, but not known as a hot bed of NLP start-ups. Then again, one of the new hot beds of tech is Austin Texas. So hey, if it can happen in Austin, why can't it happen in Sacramento?

Before I go on, let me make it clear that I do not work for Ask Ziggy in any way and this is not a sponsored blog in any way. These thoughts are entirely my own. This is my personal blog and all content is my own and reflects my honest, personal opinions.

As I flipped through Ask Ziggy's web pages, four things occurred to me:

"Ask Ziggy" as a brand is eerily reminiscent of "Ask Jeeves".
Their core goal is making it easier for app developers to use NLP speech science.
They have received $5 million in VC funding.
Is this the start of a Sacramento NLP community?

1) Ask Jeeves: Most folks in the NLP community recall Ask Jeeves, a question answering search engine from the 1990s that was going to revolutionize search. Unfortunately, Google revolutionized search way better than they did, and Ask Jeeves was forced into a series of lay offs, booms, lay offs, booms "business cycle." Today, they're best known for that annoying Yahoo! tool bar extension.

2) Making Speech Science Easy: Since Ask Ziggy is currently in "private beta," I'm actually not exactly sure what they do, but it seems like they empower an app developer to allow a user to make relatively unconstrained natural language voice commands, and their NLP technology magically "figures out" what action is appropriate (given the app's basic goals and functionality). So, maybe a music app could allow a user to speak aloud "I wonder what MIA's new song sounds like?" and Ask Ziggy's tech figures out that that's equivalent to the action command [PLAY MIA New Song].

If that's true, then that would be awesome. It is a common complaint against Siri that it doesn't "understand" a lot of commands. Maybe Ask Ziggy is applying some bleeding edge NLP, informed by contemporary psycholinguistics, to bridge the gap. Dunno. It's not clear what their special sauce is from their promotional materials, but I like the idea of relieving average app developers of the burden of learning speech science just to add voice activation to their app.

3) Five Million Dollars! Maybe I'm jaded at this point, but $5 million in VC funding is a drop in the bucket in serious NLP development-land. $5 million equals maybe 2-3 years for a modest sized group, maybe 5 years for a really small group. They received this funding near the end of 2012, it's now near the end of 2013. They'd be lucky to have $3.5 million left, with the clock ticking. It's great to get VC funding, but it's greater to get customers. What is their plan for 2015? That's the money year, as far as I can tell.

4) Sacramento is the New Google? It's great to see Sacramento developing a tech community, especially in NLP. Unlike the energy industry, the computer tech industry doesn't need natural resources nearby, so it's not tied to geography like coal, oil, or natural gas. Any two-bit town can become a tech powerhouse (I'm looking at you, Redmond Washington). Any community of practice fosters creativity and innovation. There is no a priori reason that Sacramento could not become a new generator of NLP technologies and innovation. It only requires the techies in that area to know each other, meet regularly, be open minded, and ... oh yeah, have access to that $5 million in VC capital, that helps too.

Best of luck Ziggy.

Saturday, September 14, 2013

clash of publishing cultures: NLP and literary study

Language Log recently posted a clash of cultures guest post: Computational linguistics and literary scholarship. I am sympathetic to both sides (having lived in both worlds). The core issue was an NLP team asking NLP-type questions about film, and a humanities team asking humanities-type questions about data. And the two talked past each other. I believe this is largely due to two very different academic cultures, particularly with respect to the question: What counts as publishable?

The basic issue was that a group of computational linguists from CMU (David Bamman, Brendan O’Connor, and Noah A. Smith) presented a paper about automatically learning character personas from freely available movie plot summaries at this summer's Association for Computational Linguistics conference in Bulgaria (full paper here).

Unfortunately, a couple of UT Austin scholars (Hannah Alpert-Abrams from comparative lit, and Dan Garrette from computer science) thought the paper made fatal flaws with respect to literary studies and asked LL to post their reply. In particular, they felt the the CMU team failed to use contemporary literary theory (or film theory), and instead relied on outdated ideas of persona. They made one other crucial complaint, that the data the CMU team used was flawed.

NLP engineers are good at finding data and working with it, but often bad at interpreting it. I don't mean they're bad at interpreting the results of complex analysis performed on data. I mean they are often bad at understanding the nature of their data to begin with. I think the most important argument the UT Austin team make against the CMU team is this (important point underlined and boldfaced just in case you're stupid):

By focusing on cinematic archetypes, Bamman et al.’s research misses the really exciting potential of their data. Studying Wikipedia entries gives us access into the ways that people talk about film, exploring both general patterns of discourse and points of unexpected divergence.

In other words, the CMU team didn't truly understand what their data was. They didn't get data about Personas or Stereotypes in film. Rather, they got data about how a particular group of people talk about a topic. This is a well known issue in humanities studies of all kinds, but it's much less understood in sciences and engineering, as far as I can tell.

To his credit, CMU team member O'Connor addressed part of this in a response by saying:

We did not try to make a contribution to contemporary literary theory. Rather, we focus on developing a computational linguistic research method of analyzing characters in stories. We hope there is a place for both the development of new research methods, as well as actual new substantive findings.

And here is where the culture clash erupts. While engineers and scientists are quite used to the idea that "proof of concept" methodology development is an acceptable topic for a refereed conference paper, it is almost unheard of in the humanities (the social sciences falls somewhere in between, and O'Connor notes this).

However, O'Connor didn't address their more substantive point that their underlying data was flawed. Again, with proof of concept papers, this is less of an issue. The UT Austin team made the point that the CMU team didn't ask questions that 'fit into academic discourse about film' (slight paraphrase). O'Connor countered that that was because they didn't even try. That was not their goal. As far as I can tell, the CMU team didn't give a hoot about the data at all. It happened to be a convenient data set that they could scrape freely and play with. If anyone has a movie plot data set that is balanced for things like gender, perspective, class, race, etc, I'm confident the CMU team would be happy to apply their process to it. But, the CMU team, as represented by O'Connor's reply, runs the risk as seeming aloof (at best). Showing such blatant disregard for the goals of the very humanities scholars they're trying to develop a method for will not win them many friends in English and comparative literature departments.

O'Connor mentioned that he believed "it’s most useful to publish part of the work early and get scholarly feedback, instead of waiting for years before trying to write a “perfect” paper." While I agree with the interactive feedback notion underlying his point, I have to say that he comes across as a bit smug and arrogant by saying it in this way. He was certainly not showing much respect to the traditions within humanities by adding the snide remark about a "perfect paper." Humanities is its own academic culture, with it's own traditions of what counts as publishable. Simply declaring his own academic traditions as preferable is not particularly respectful.

I also believe that the UT Austin team's response posted on Language Log was somewhat condescending and disrespectful of the CMU team (and some of the LL commenters called them out on it as well). This is a clash of academic cultures. Again, I am sympathetic to both sides. But they will continue to talk past each other until each understands the others' cultures better.

Accomplishments versus Quests

There is a much larger point to be made about the kind of personalities that engineering tends to draw versus humanities. I'm speculating, but it's been my experience that engineers tend to be driven by accomplishment. Not solving big problems, just solving any problem. They spend a few hours getting a Python script to properly scrape and format plot summaries from an online database, and that makes them happy. They accomplished something. Humanities people tend to be driven by quests. Large scale goals to answer vague and amorphous questions.

Wednesday, September 4, 2013

British English and preposition dropping with barrier verbs

This is yet another in a series of posts detailing data and analysis from my not-quite-entirely-completely-achieved linguistics dissertation (list of previous posts here).

Recall that if an entity wants to achieve a certain outcome, yet is impeded by some force, this situation can be encoded by a barrier verb in English, such as prevent, ban, protect.

preposition dropping and phrase length

Professor Katsuko Tomotsugu presented corpus data about preposition dropping and the NP (from) V‐ing construction, particularly with respect to British English and barrier verbs at this year's International Cognitive Linguistics Conference in Alberta. Here are three examples from her poster:

The ozone layer still prevents any lethal UVC radiation reaching the earth. (FBL 3222)
Closed doors stopped the fire taking over the whole building in Borough Road. (K4W 266)
This somehow inhibits copies of viral DNA being made, and is the basis of acyclovir's anti‐viral activity. (B72 593)

I had noticed this preposition dropping and did a little leg work on it as well back in 2008 (all unpublished), so I thought I would add my two cents to Tomotsugu's data. Note, there is one glaringly obvious pattern to preposition dropping that I'll make plain at the end.

To begin, my focus was different. Tomotsugu was studying causation types and preposition dropping, but I wanted to know if heaviness (length of constituent phrase in number of words) was a factor in the occurrence of barrier verb sentences that dropped the preposition. I made the assumption this phenomenon was associated with British English, so I didn't associate my BNC extraction results with origin, but I think it's clearly a British English thing.

As I began looking in to this, it seemed like object pronouns had a high rate of co-occurrence with the prep drop sentences, so I counted that too (… to prevent them getting damaged). Note that there were no pronoun complements because I only looked at sentential complements. In order to find these kinds of constructions, I had to search a parse tree (using Tgrep2) for an S complement that was sister to an object NP (with no prep in between), so there are no passives in my data. Tomotsugu notes in her poster that passives are common:

A significantly higher frequency of complements using the passive form “being __” was found in the from-less variant of prevent and stop, as well as with verbs of occurrence (happen, arise, occur) in the from‐less variant of prevent.

I simply didn't study this. Note that automatically extracting examples of the prep drop condition with Tgrep2 was tricky, so I settled on one pattern that worked and stuck with it. I may have missed others.

I found 211 examples of 'prevent X Ying', so I took 211 random samples from my 2152 original prevent from S returns as comparison. and counted the heaviness of the objects and from comps. The table below present the length of object and comp constituents occurring in the barrier verb construction with the construction prevent from S (note, there were zero valid prevent against S examples). Let me repeat my admission from the first post in this series that I am cutting and pasting much of this from chapters I wrote circa 2008. This data should be taken as suggestive only.

The number in the length column represents the number of tokens. The number in the Obj and Comp columns represent the number of sentences matching the length condition. For example, in the first row, it says that 104 ‘prevent X from Ying’ sentences had a verb object of only one token (this includes the 68 pronouns reported in orange above). Whereas, 178 'prevent X Ying' sentences had one word objects (of which, 160 were pronouns). On the other hand, only 4 ‘prevent X from Ying’ sentences had a verb object of 6 words, and only one ‘prevent X Ying’.

First pass interpretation: The verb prevent is highly frequent, plus its association with the Barrier Verb Construction from is more frequent than other verbs. This may account for its openness to preposition dropping (but the verb stop also allows prep dropping, even though its association with BVC from is weak).

More importantly, the prep drop sentences clearly had a bias for pronoun objects and they appear to have a bias for shorter comps too. 76% of the prep drop sentences had a pronoun object and 84% overall had a one word object. Of the 211 prep drop sentences, only 12 had objects of 3 words or more (5%); whereas, of the 211 sentences with a preposition, 42 did (20%).

In the from Y-ing sentences, complements on average are about 59% longer than direct objects (1.93/4.7 = .41); whereas in the preposition drop sentences, complements tend to be 67% longer (1.3/3.9 = .33). Is this difference significant? If it is, one could say preposition dropping is driven in part by length concerns.

Glaringly Obvious

And now for the glaringly obvious. Tomotsugu explicitly studied NP (from) V‐ing constructions. I did not. My Tgrep2 search extracted every S complement that was sister to an object NP (with no prep in between), regardless of POS. I believe I specified these POSs within my tgrep2 search:

VB|VBB|VBD|VBG|VBI|VBN|VBZ|VVB

But, every example I retrieved, all 211 in the prevent X S query, involved a VBG complement. Maybe my search query was bad (I can't find the actual Tgrep2 query at the moment, just a description of it within a document).

Here is a representative example of my BNC returns:

Provided-that all the controls can be locked to prevent them getting damaged by slamming against the stops, parking the aircraft facing down wind will be safest, because then the wing is meeting the airflow at a negative angle.
Although many gliders have a spring or bungee in the circuit to reduce the snatching loads at higher speeds on the approach, this is seldom powerful enough to prevent them sucking open if they are unlocked
how can I prevent it happening again?
It is free of charge and can help to detect early signs of health problems and prevent them developing.
Even-if you decide you don't have a problem now, it makes sense to do all you can to prevent it happening in the future.
Their main concern was that independent arbitration would drag out negotiations and prevent them complying with the MMC proposals to free pubs from the tie by the deadline of November 1992.
That has not prevented them exercising a great influence on our cultural development.
He got off the mark with an uppish straight drive for four, which might have given a less myopic bowler than Malcolm a return catch, and in Malcolm 's next over, he attempted a square slash which, if he had got an edge, might have prevented him ever setting foot in India again.
“The reason that Hollywood keeps selling all its film companies to the Australians, the Japanese, and-so-on, is to prevent them falling into the hands of people from New York.”
Her employers, the Northern regional health authority, want to prevent her returning there, to end her secondment as a neo-natologist in Newcastle-upon-Tyne, and for the foreseeable future prevent her working in child abuse.
Even a nervous pull into the greenside bunker with his third shot at the par-five 18th, which was to open the door for Stewart and Olazabal, could not prevent it being Langer's day.

This deserves more work, to be sure.

Tuesday, September 3, 2013

I walk not alone through the valley of barriers

It's nice to not be alone. For years I thought I was the only one interested in barrier verbs. Happily, I have discovered several scholars who have published on this verb class recently. Here's a brief annotated, chronological, bibliography:

Landau, Idan. 2002. (Un)interpretable Neg in Comp. Linguistic Inquiry. Volume 33, Number 3, Summer. pp. 465-492. (This is a Minamalist Syntax treatment of Hebrew negation with just a short treatment of English prevent at the end). Infinitival complements to negative verbs (refrain, prevent) display a number of surprising syntax-semantics correlations. Those are traced to the operation of negative features in the Comp position. The analysis also provides insight into the recalcitrant prevent DP from V-ing construction in English.

Mair, Christian. 2002. Three changing patterns of verb complementation in Late Modern English: A real-time study based on matching text corpora. English Language and Linguistics, 6(1), 105-131. The article looks at three instances of grammatical variation in present-day standard English: the use of bare and to-infinitives with the verb help, the presence or absence of the preposition/complementizer from before -ing-complements depending on prevent, and the choice between -ing- and infinitival complements after the verbs begin and start. In all three instances, current British and American usage will be shown to differ, and these differences need to be interpreted against diachronic changes affecting Late Modern English grammar as a whole.

Baltin, Mark R. 2009. The Properties of Negative Non-finite Complements. NYU  Working  Papers  In  Linguistics,  Vol. 2:  Papers  In  Syntax,  Spring. (Minamalist Syntax treatment of English from as it occurs with barrier verbs - this is a response to Landau). This paper is about the syntax and semantics of non-finite clausal complementation. By focusing on the properties of a small and comparatively neglected class of non-finite complements in English, this paper will shed light on the larger class of non-finite complements that have been the subject of much discussion, arguing that selection for complement type is semantic in nature rather than syntactic.

Tomotsugu, Katsuko. 2013. Asymmetric causation types in the competing complements of negative causative verbs: NP (from) V-ing. The 12th International Cognitive Linguistics Conference (ICLC). University of Alberta in Edmonton, Alberta, Canada. 23-28 June. ("This study focuses on the omission of the preposition from from the complements of negative causative verbs, which represent the nonrealization of a situation expressed by V-ing.")

I would be remiss if I failed to remind y'all that several verb class scholars have recognized classes similar to barrier verbs, as I pointed out in previous posts, particularly here. The list of scholars who have touched on barrier verbs throughout history is actually longer, and goes back longer than this brief list suggest. These are simply four recent examples that I have stumbled upon. Apologies to anyone who deserves to be listed here but is not, and if you know of such a person, please provide me with a citation and I'll gladly update the record.

On a side note, I asked both Baltin and Tomotsugu if they have also looked at the occurrence of against as an alternative to from with some of these verbs. Professor Tomotsugu and I have started a productive email exchange regarding the overlaps in our work. I have yet to hear from Professor Baltin.

The occurrence of against is particularly useful to establish the force dynamic properties underlying the semantics of barrier verbs because against is a preposition that means physical contact (e.g., ‘to lean against’). I didn't get a chance to discover what properties condition the occurrence of against instead of from, I suspect there is something interesting there. I think it has to do with the complement acting as a goal-directed agent, instead of the object of the barrier verb. Maybe it's that from makes the NP2 undergoer salient and against makes the NP3 antagonist salient? Not sure yet. But, note that the verb protect is used with both from and against in the following CDC passage in nearly identical contexts:

"The single best way to protect your children from the flu is to get them vaccinated each year. The seasonal flu vaccine protects against three influenza viruses that research indicates will be most common during the season: an influenza A (H1N1) virus, an influenza A (H3N2) virus and an influenza B virus."

The role of frequency has yet to be determined, but there is clearly a difference in the frequency of from and against in both American and British English, as Google Ngrams suggests (an imperfect corpus, I know, but a good hint):

Interesting linguistics, to be sure. More to come...

The Lousy Linguist