| |
Abstract:
Many models of sentence processing are based on verb
subcategorization probabilities (Boland 1997, Clifton et al. 1984,
Ferreira & McClure 1997, Fodor 1987, Garnsey et al. 1997,
Jurafsky 1996, MacDonald 1994, Mitchell & Holmes 1985,
Tanenhaus et al. 1990, Trueswell et al. 1993). These probabilities
can be computed from on-line corpora. But recent studies (Merlo
1994, Gibson et al. 1996) have found differences between corpus
frequencies and psycholinguistic measures (sentence
production/completion), and have suggested that corpora inherently
reflect language production, and hence may be an inappropriate
source for representations in the comprehension lexicon.
We argue that verbs do not have separate argument-structure
representations for comprehension and production. Rather, verbs
have unified subcategorization probabilities which can be measured
by both corpora and sentence production experiments. In order for
this to be true, the apparent difference in frequency
(corpora/experiments) must be explainable from contextual effects.
In an extension of Roland and Jurafsky (1997), our study analyzes
sentence production data (Connine et al. 1984), written discourse
(Brown and Wall Street Journal from Penn TreeBank - Marcus et al.
1993), and conversational data (Switchboard - Godfrey et al. 1992).
We found that the different frequencies found in these sources were
a result of predictable discourse properties of each of the
sources, and that these differences can be normalized out. We
computed a discourse based normalizatio n function to map between
Brown corpus and Connine et al. data. This same function proved to
also account for the differences between the Connine and Wall
Street Journal data.
We also found that these discourse contexts affect various classes
of verbs differently. For example, because sentences in discourse
take place with reference to a background, verbs of communication
(e.g. answer, ask, call, describe, read, say, write) ar e used in
corpora to discuss details of the contents of communication (Brown:
"Turning to the reporters, she asked, 'Did you hear her?'"), but in
single sentence production to describe the (new) act of
communication itself (Connine et al.: "He asked a lot of questions
at school"). The result is a decrease in the transitivity of
communication verbs in the corpora.
Verbs of propositional attitude (agree, guess, know, see,
understand) are typically used transitively in written corpora and
single-sentence production (Connine et al: "I guessed the right
answer on the quiz."). In spoken discourse, these verbs are more
likely to be used metalinguistically, with the previous discourse
contribution understood as the argument of the verb (Switchboard "I
see", "I guess"). This effect also reduces the transitivity levels
for these verbs in Switchboard.
Our investigation of different sources suggests that different
frequencies are not the result of separate representations for
comprehension and production. Rather, they suggest that the lexical
entry for each verb has a unified probabilistic argument str ucture
which is combined with probabilistic constraints from different
discourse contexts (including the null context used in
single-sentence production tasks) to produce observed
frequencies.
|