MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

 

How Verb Subcategorization Frequencies Are Affected By The Way You Measure Them

 Daniel Jurafsky and Douglas Roland
  
 

Abstract:
Many models of sentence processing are based on verb subcategorization probabilities (Boland 1997, Clifton et al. 1984, Ferreira & McClure 1997, Fodor 1987, Garnsey et al. 1997, Jurafsky 1996, MacDonald 1994, Mitchell & Holmes 1985, Tanenhaus et al. 1990, Trueswell et al. 1993). These probabilities can be computed from on-line corpora. But recent studies (Merlo 1994, Gibson et al. 1996) have found differences between corpus frequencies and psycholinguistic measures (sentence production/completion), and have suggested that corpora inherently reflect language production, and hence may be an inappropriate source for representations in the comprehension lexicon.

We argue that verbs do not have separate argument-structure representations for comprehension and production. Rather, verbs have unified subcategorization probabilities which can be measured by both corpora and sentence production experiments. In order for this to be true, the apparent difference in frequency (corpora/experiments) must be explainable from contextual effects.

In an extension of Roland and Jurafsky (1997), our study analyzes sentence production data (Connine et al. 1984), written discourse (Brown and Wall Street Journal from Penn TreeBank - Marcus et al. 1993), and conversational data (Switchboard - Godfrey et al. 1992). We found that the different frequencies found in these sources were a result of predictable discourse properties of each of the sources, and that these differences can be normalized out. We computed a discourse based normalizatio n function to map between Brown corpus and Connine et al. data. This same function proved to also account for the differences between the Connine and Wall Street Journal data.

We also found that these discourse contexts affect various classes of verbs differently. For example, because sentences in discourse take place with reference to a background, verbs of communication (e.g. answer, ask, call, describe, read, say, write) ar e used in corpora to discuss details of the contents of communication (Brown: "Turning to the reporters, she asked, 'Did you hear her?'"), but in single sentence production to describe the (new) act of communication itself (Connine et al.: "He asked a lot of questions at school"). The result is a decrease in the transitivity of communication verbs in the corpora.

Verbs of propositional attitude (agree, guess, know, see, understand) are typically used transitively in written corpora and single-sentence production (Connine et al: "I guessed the right answer on the quiz."). In spoken discourse, these verbs are more likely to be used metalinguistically, with the previous discourse contribution understood as the argument of the verb (Switchboard "I see", "I guess"). This effect also reduces the transitivity levels for these verbs in Switchboard.

Our investigation of different sources suggests that different frequencies are not the result of separate representations for comprehension and production. Rather, they suggest that the lexical entry for each verb has a unified probabilistic argument str ucture which is combined with probabilistic constraints from different discourse contexts (including the null context used in single-sentence production tasks) to produce observed frequencies.

 
 


© 2010 The MIT Press
MIT Logo