| |
Abstract:
The traditional approach to designing a parser for natural
language has involved encoding the rules of a grammar for a given
language, and repetitively matching rules of the grammar against
input text. This is done until either a match has been found, in
which case the parse is successful, or until no match can been
found, in which case the parse crashes. Seeking to improve parser
performance, more recent implementations have used probabilistic
grammars, roughly equivalent to traditional linguistic grammars
except with attached probabilities (Charniak 1993, Marcus 1998).
The success of these parsers has been crucially dependent on
linearly ordered rules. Linearly ordered rules are readily encoded
for languages such as English, for which the word order is
relatively fixed. Little work, however, has been done with
languages in which the word order is much freer. Although linearly
ordered rules are viable for such languages as well, they would be
much more cumbersome to implement due to the number of rules that
would need to be defined (Karttunen 1985).
In Russian, for example, the word order of constituents is
relatively free. Each of the 12 possible renderings of (1) is to a
varying degree acceptable. Using the traditional parser approach,
separate rules for each combination would be necessary. Parsers
implemented using GPSG-like (Gazdar 1985) or HPSG-like (Pollard and
Sag 1994) formalisms can escape this problem for Russian by
encoding rules which indicate immediate dominance (ID) relations,
but do not encode linear precedence (LP). Such rules would have the
form as shown in (2), and differ from traditional grammatical rules
in that order is not dictated, rather composition without
ordering.
(1) sevodnja boris pracital etu knigu
today Boris read this:ACC book:ACC
(2) S -> Adv, V, NP[nom], NP[acc]
Implementing parsers using such formalisms would appear to solve
the problem for languages with freer word order. However, most
free-word order languages have constraints on the orders that
constituents can take (Russian is freer than most). Such
constraints on order might restrict the fronting of certain
constituents (such as the verb), or impose ordering restrictions on
other constituents within clauses or phrases. Using ID rules alone
will not successfully parse such languages, in that some
unacceptable orderings would be accepted. One would be forced to
encode a set of LP rules, indicating the acceptable structures for
the language.
An alternative approach is possible: rather than encoding all
possible output forms for a given language into a set of LP rules,
one could encode just those rules which violate the grammar. Such
rules represent an "antigrammar" for the language being parsed. ID
rules of the form shown in (2) would still be necessary, except
that subordinate to such rules might be antirules of the form shown
in (3). The antirule in (3) represents one impossible (or less
acceptable) structure for the ID rule in (2).
(3) V > Adv, NPnom, NPacc
One could use antirules in a metagrammatic fashion by building a
pre-parser which generates the set of valid rules for any given ID
rule, expressly not generating the set of antirules. Another
approach would be to treat antirules as filters to eliminate
invalid parses.
Charniak, E. (1993). Statistical language learning. Cambridge,
MA: MIT Press.
Gazdar, G., E. Klein, G. K. Pullum, and I. A. Sag (1985).
Generalized Phrase Structure Grammar. Oxford: Basil Blackwell
Publisher Ltd.
Karttunnen, L. and M. Kay (1985). Parsing in free word order
language. In Dowty, D., L. Karttunen, and A. Zwicky (eds), Studies
in Natural Language Processing. Cambridge: Cambridge University
Press.
Marcus, M. (1998). Progress in Statistical Parsing: Statistics vs.
Representations. Paper presented to the Eleventh Annual CUNY
Conference on Human Sentence Processing, Rutgers NJ, 1998.
|