(a) The man in the corner taught his dachshund to play - - PDF document

a the man in the corner taught his dachshund to play golf
SMART_READER_LITE
LIVE PREVIEW

(a) The man in the corner taught his dachshund to play - - PDF document

Proceedin gs of the 16th Internati on al Conference on Computati ona l Linguistics (COLING-9 6), pp. 340-345, Copenhagen , August 1996. [See the cited TR, Eisner (1996), for the m uc h-impro v ed nal results


slide-1
SLIDE 1 Three New Probabilistic Mo dels for Dep endency P arsing: An Exploration
  • Proceedin
gs
  • f
the 16th Internati
  • n
al Conference
  • n
Computati
  • na
l Linguistics (COLING-9 6), pp. 340-345, Copenhagen , August 1996. [See the cited TR, Eisner (1996), for the m uc h-impro v ed nal results and exp erimen tal details. Algorithmic details are in subsequen t pap ers.] Jason M. Eisner CIS Departmen t, Univ ersit y
  • f
P ennsylv ania 200 S. 33rd St., Philadelphia, P A 19104-6389, USA jeisner@li nc .ci s.u pen n. edu Abstract After presen ting a no v el O (n 3 ) parsing al- gorithm for dep endency grammar, w e de- v elop three con trasting w a ys to sto c hasticize it. W e prop
  • se
(a) a lexical anit y mo del where w
  • rds
struggle to mo dify eac h
  • ther,
(b) a sense tagging mo del where w
  • rds
uc- tuate randomly in their selectional prefer- ences, and (c) a generativ e mo del where the sp eak er eshes
  • ut
eac h w
  • rd's
syn tactic and conceptual structure without regard to the implicatio ns for the hearer. W e also giv e preliminary empirical results from ev aluat- ing the three mo dels' parsing p erformance
  • n
annotated Wal l Str e et Journal training text (deriv ed from the P enn T reebank). In these results, the generativ e mo del p erforms signican tly b etter than the
  • thers,
and do es ab
  • ut
equally w ell at assigning part-
  • f-sp
eec h tags. 1 In tro duction In recen t y ears, the statistical parsing comm unit y has b egun to reac h
  • ut
for syn tactic formalism s that recognize the individualit y
  • f
w
  • rds.
Link grammars (Sleator and T emp erley , 1991) and lex- icalized tree-adjoining grammars (Sc hab es, 1992) ha v e no w receiv ed sto c hastic treatmen ts. Other researc hers, not wishing to abandon con text-free grammar (CF G) but disillusioned with its lexical blind sp
  • t,
ha v e tried to re-parameterize sto c has- tic CF G in con text-sensitiv e w a ys (Blac k et al., 1992)
  • r
ha v e augmen ted the formalism with lex- ical headw
  • rds
(Magerman, 1995; Collins, 1996). In this pap er, w e presen t a exible probabilistic parser that sim ultaneously assigns b
  • th
part-of- sp eec h tags and a bare-b
  • nes
dep endency struc- ture (illustrated in Figure 1). The c hoice
  • f
a simple syn tactic structure is delib erate: w e w
  • uld
lik e to ask some basic questions ab
  • ut
where lex- ical relationships app ear and ho w b est to exploit
  • This
material is based up
  • n
w
  • rk
supp
  • rted
un- der a National Science F
  • undation
Graduate F ello w- ship, and has b eneted greatly from discussions with Mik e Collins, Dan Melamed, Mitc h Marcus and Ad- w ait Ratnaparkhi.

(a) man play golf dachshund to The in corner the his (b) taught The man in the corner taught his dachshund play EOS golf to

EOS

NN DT IN DT NN VBD PRP$ NN TO NN VB

Figure 1: (a) A bare-b
  • nes
dep endency parse. Eac h w
  • rd
p
  • in
ts to a single paren t, the w
  • rd
it mo dies; the head
  • f
the sen tence p
  • in
ts to the EOS (end-of- sen tence) mark. Crossing links and cycles are not al- lo w ed. (b) Constituen t structure and sub categoriza- tion ma y b e highlig h ted b y displa ying the same de- p endencies as a lexical tree. them. It is useful to lo
  • k
in to these basic ques- tions b efore trying to ne-tune the p erformance
  • f
systems whose b eha vior is harder to understand. 1 The main con tribution
  • f
the w
  • rk
is to pro- p
  • se
three distinct, lexicalist h yp
  • theses
ab
  • ut
the probabilit y space underlying sen tence structure. W e illustrate ho w eac h h yp
  • thesis
is expressed in a dep endency framew
  • rk,
and ho w eac h can b e used to guide
  • ur
parser to w ard its fa v
  • red
so- lution. Finally , w e p
  • in
t to exp erimen tal results that compare the three h yp
  • theses'
parsing p er- formance
  • n
sen tences from the Wal l Str e et Jour- nal. The parser is trained
  • n
an annotated corpus; no hand-written grammar is required. 2 Probabilistic Dep endencies It cannot b e emphasized to
  • strongly
that a gr am- matic al r epr esentation (dep endency parses, tag se- quences, phrase-structure trees) do es not en tail an y particular pr
  • b
ability mo del. In principle,
  • ne
could mo del the distribution
  • f
dep endency parses 1 Our no v el parsing algorithm also rescues dep en- dency from certain criticisms: \Dep endency gram- mars : : : are not lexical, and (as far as w e kno w) lac k a parsing algorithm
  • f
eciency comparable to link grammars." (Laert y et al., 1992, p. 3)
slide-2
SLIDE 2 in an y n um b er
  • f
sensible
  • r
p erv erse w a ys. The c hoice
  • f
the righ t mo del is not a priori
  • b
vious. One w a y to build a probabilistic grammar is to sp ecify what sequences
  • f
mo v es (suc h as shift and reduce) a parser is lik ely to mak e. It is reasonable to exp ect a giv en mo v e to b e correct ab
  • ut
as
  • ften
  • n
test data as
  • n
training data. This is the philosoph y b ehind sto c hastic CF G (Jelinek et al.1992), \history-based" phrase-structure parsing (Blac k et al., 1992), and
  • thers.
Ho w ev er, probabilit y mo dels deriv ed from parsers sometimes fo cus
  • n
inciden tal prop erties
  • f
the data. This ma y b e the case for (Laert y et al., 1992)'s mo del for link grammar. If w e w ere to adapt their top-do wn sto c hastic parsing strategy to the rather similar case
  • f
dep endency gram- mar, w e w
  • uld
nd their elemen tary probabilities tabulating
  • nly
non-in tuitiv e asp ects
  • f
the parse structure: P r (w
  • rd
j is the righ tmost pre-k c hild
  • f
w
  • rd
i j i is a righ t-spine strict descendan t
  • f
  • ne
  • f
the left c hildren
  • f
a tok en
  • f
w
  • rd
k ,
  • r
else i is the paren t
  • f
k , and i precedes j precedes k ). 2 While it is clearly necessary to decide whether j is a c hild
  • f
i, conditioning that decision as ab
  • v
e ma y not reduce its test en trop y as m uc h as a more linguistically p erspicuous condition w
  • uld.
W e b eliev e it is fruitful to design probabilit y mo dels indep enden tly
  • f
the parser. In this sec- tion, w e will
  • utline
the three lexicalist, linguis- tically p erspicuous, qualitativ ely dieren t mo dels that w e ha v e dev elop ed and tested. 2.1 Mo del A: Bigram lexical anities N
  • gram
taggers lik e (Ch urc h, 1988; Jelinek 1985; Kupiec 1992; Merialdo 1990) tak e the follo wing view
  • f
ho w a tagged sen tence en ters the w
  • rld.
First, a sequence
  • f
tags is generated according to a Mark
  • v
pro cess, with the random c hoice
  • f
eac h tag conditioned
  • n
the previous t w
  • tags.
Second, a w
  • rd
is c hosen conditional
  • n
eac h tag. Since
  • ur
sen tences ha v e links as w ell as tags and w
  • rds,
supp
  • se
that after the w
  • rds
are in- serted, eac h sen tence passes through a third step that lo
  • ks
at eac h pair
  • f
w
  • rds
and randomly de- cides whether to link them. F
  • r
the resulting sen- tences to resem ble real corp
  • ra,
the probabilit y that w
  • rd
j gets link ed to w
  • rd
i should b e lexi- c al ly sensitive: it should dep end
  • n
the (tag,w
  • rd)
pairs at b
  • th
i and j . The probabilit y
  • f
dra wing a giv en parsed sen- tence from the p
  • pulation
ma y then b e expressed 2 This corresp
  • nds
to Laert y et al.'s cen tral statis- tic (p. 4), Pr(W ; j L; R; l ; r ), in the case where i's paren t is to the left
  • f
i. i; j; k corresp
  • nd
to L; W ; R resp ectiv ely . Owing to the particular recursiv e strat- egy the parser uses to break up the sen tence, the statistic w
  • uld
b e measured and utilized
  • nly
under the condition describ ed ab
  • v
e.

DT NN IN DT NN VBD DT NN IN DT NN VBD

the price

  • f

the stock fell (a) the price

  • f

the stock fell (b)

Figure 3: (a) The correct parse. (b) A common error if the mo del ignores arit y . as (1) in Figure 2, where the random v ariable L ij 2 f0; 1g is 1 i w
  • rd
i is the paren t
  • f
w
  • rd
j . Expression (1) assigns a probabilit y to ev ery p
  • ssible
tag-and-link-annotated string, and these probabilities sum to
  • ne.
Man y
  • f
the annotated strings exhibit violations suc h as crossing links and m ultiple paren ts|whic h, if they w ere allo w ed, w
  • uld
let all the w
  • rds
express their lexical prefer- ences indep enden tly and sim ultaneously . W e stip- ulate that the mo del discards from the p
  • pulation
an y illegal structures that it generates; they do not app ear in either training
  • r
test data. Therefore, the parser describ ed b elo w nds the lik eliest le- gal structure: it maxim izes the lexical pr efer enc es
  • f
(1) within the few hard linguistic c
  • nstr
aints imp
  • sed
b y the dep endency formalism . In practice, some generalization
  • r
\coarsen- ing"
  • f
the conditional probabilities in (1) helps to a v
  • id
the eects
  • f
undertraining. F
  • r
exam- ple, w e follo w standard practice (Ch urc h, 1988) in n-gram tagging b y using (3) to appro ximate the rst term in (2). Decisions ab
  • ut
ho w m uc h coars- ening to do are
  • f
great practical in terest, but they dep end
  • n
the training corpus and ma y b e
  • mit-
ted from a conceptual discussion
  • f
the mo del. The mo del in (1) can b e impro v ed; it do es not capture the fact that w
  • rds
ha v e arities. F
  • r
ex- ample, the pric e
  • f
the sto ck fel l (Figure 3a) will t ypically b e misanalyzed under this mo del. Since sto c ks
  • ften
fall, sto ck has a greater anit y for fel l than for
  • f.
Hence sto ck (as w ell as pric e) will end up p
  • in
ting to the v erb fel l (Figure 3b), resulting in a double sub ject for fel l and lea ving
  • f
c hildless. T
  • capture
w
  • rd
arities and
  • ther
sub categoriza- tion facts, w e m ust recognize that the c hildren
  • f
a w
  • rd
lik e fel l are not indep enden t
  • f
eac h
  • ther.
The solution is to mo dify (1) sligh tly , further conditioning L ij
  • n
the n um b er and/or t yp e
  • f
c hildren
  • f
i that already sit b et w een i and j . This means that in the parse
  • f
Figure 3b, the link pric e ! fel l will b e sensitiv e to the fact that fel l already has a closer c hild tagged as a noun (NN). Sp ecif- ically , the pric e ! fel l link will no w b e strongly disfa v
  • red
in Figure 3b, since v erbs rarely tak e t w
  • NN
dep enden ts to the left. By con trast, pric e ! fel l is unob jectionable in Figure 3a, rendering that parse more probable. (This c hange can b e reected in the conceptual mo del, b y stating that the L ij decisions are made in increasing
  • rder
  • f
link length ji
  • j
j and are no longer indep enden t.) 2.2 Mo del B: Selection al preferences In a legal dep endency parse, ev ery w
  • rd
except for the head
  • f
the sen tence (the EOS mark) has
slide-3
SLIDE 3 P r (w
  • rds;
tags; links ) = P r (w
  • rds;
tags )
  • P
r (link presences and absences j w
  • rds;
tags) (1)
  • Y
1in P r (t w
  • rd
(i) j t w
  • rd
(i + 1); t w
  • rd
(i + 2))
  • Y
1i;j n P r (L ij j t w
  • rd
(i); t w
  • rd
(j )) (2) P r (t w
  • rd
(i) j t w
  • rd
(i + 1); t w
  • rd
(i + 2))
  • P
r (tag (i) j tag (i + 1); tag (i + 2))
  • P
r (w
  • rd
(i) j tag (i)) (3) P r (w
  • rds;
tags; links ) / P r (w
  • rds;
tags; preferences ) = P r (w
  • rds;
tags )
  • P
r (preferences j w
  • rds;
tags) (4)
  • Y
1in P r (t w
  • rd
(i) j t w
  • rd
(i + 1); t w
  • rd
(i + 2))
  • Y
1in P r (preferences (i) j t w
  • rd
(i)) P r (w
  • rds;
tags; links ) = Y 1in @ 1+#righ t- kids (i) Y c=(1+#left- kids (i));c6=0 P r (t w
  • rd
(kid c (i)) j tag ( kid c1 (i)
  • r
kid c+1 if c < ); t w
  • rd
(i) 1 A (5) Figure 2: High-lev el views
  • f
mo del A (form ulas 1{3); mo del B (form ula 4); and mo del C (form ula 5). If i and j are tok ens, then t w
  • rd
(i) represen ts the pair (tag (i); w
  • rd
(i)), and L ij 2 f0; 1g is 1 i i is the paren t
  • f
j . exactly
  • ne
paren t. Rather than ha ving the mo del select a subset
  • f
the n 2 p
  • ssible
links, as in mo del A, and then discard the result unless eac h w
  • rd
has exactly
  • ne
paren t, w e migh t restrict the mo del to pic king
  • ut
  • ne
paren t p er w
  • rd
to b e- gin with. Mo del B generates a sequence
  • f
tagged w
  • rds,
then sp ecies a paren t|or more precisely , a typ e
  • f
paren t|for eac h w
  • rd
j . Of course mo del A also ends up selecting a par- en t for eac h w
  • rd,
but its calculation pla ys careful p
  • litics
with the set
  • f
  • ther
w
  • rds
that happ en to app ear in the sen tence: w
  • rd
j considers b
  • th
the b enet
  • f
selecting i as a paren t, and the costs
  • f
spurning all the
  • ther
p
  • ssible
paren ts i .Mo del B tak es an approac h at the
  • pp
  • site
extreme, and simply has eac h w
  • rd
blindly describ e its ideal paren t. F
  • r
example, pric e in Figure 3 migh t in- sist (with some probabilit y) that it \dep end
  • n
a v erb to m y righ t." T
  • capture
arit y , w
  • rds
proba- bilistically sp ecify their ideal c hildren as w ell: fel l is highly lik ely to w an t
  • nly
  • ne
noun to its left. The form and coarseness
  • f
suc h sp ecications is a parameter
  • f
the mo del. When a w
  • rd
sto c hastically c ho
  • ses
  • ne
set
  • f
requiremen ts
  • n
its paren ts and c hildren, it is c ho
  • sing
what a link grammari an w
  • uld
call a dis- junct (set
  • f
selectional preferences) for the w
  • rd.
W e ma y th us imagine generating a Mark
  • v
se- quence
  • f
tagged w
  • rds
as b efore, and then in- dep enden tly \sense tagging" eac h w
  • rd
with a disjunct. 3 Cho
  • sing
all the disjuncts do es not quite sp ecify a parse. Ho w ev er, if the disjuncts are sucien tly sp ecic, it sp ecies at most
  • ne
parse. Some sen tences generated in this w a y are illegal b ecause their disjuncts cannot b e sim ulta- neously satised; as in mo del A, these sen tences are said to b e remo v ed from the p
  • pulation,
and the probabilities renormalized. A lik ely parse is therefore
  • ne
that allo ws a lik ely and consisten t 3 In
  • ur
implemen tation, the distribution
  • v
er p
  • s-
sible disjuncts is giv en b y a pair
  • f
Mark
  • v
pro cesses, as in mo del C. set
  • f
sense tags; its probabilit y in the p
  • pulation
is giv en in (4). 2.3 Mo del C: Recursiv e generation The nal mo del w e prop
  • se
is a generation mo del, as
  • pp
  • sed
to the comprehension mo d- els A and B (and to
  • ther
comprehension mo dels suc h as (Laert y et al., 1992; Magerman, 1995; Collins, 1996)). The con trast recalls an
  • ld
debate
  • v
er sp
  • k
en language, as to whether its prop erties are driv en b y hearers' acoustic needs (comprehen- sion)
  • r
sp eak ers' articulatory needs (generation). Mo dels A and B suggest that sp eak ers pro duce text in suc h a w a y that the grammatical relations can b e easily deco ded b y a listener, giv en w
  • rds'
preferences to asso ciate with eac h
  • ther
and tags' preferences to follo w eac h
  • ther.
But mo del C sa ys that sp eak ers' primary goal is to esh
  • ut
the syn- tactic and conceptual structure for eac h w
  • rd
they utter, surrounding it with argumen ts, mo diers, and function w
  • rds
as appropriate. According to mo del C, sp eak ers should not hesitate to add ex- tra prep
  • sitional
phrases to a noun, ev en if this lengthens some links that are
  • rdinarily
short,
  • r
leads to tagging
  • r
attac hmen t am biguities. The generation pro cess is straigh tforw ard. Eac h time a w
  • rd
i is added, it generates a Mark
  • v
sequence
  • f
(tag,w
  • rd)
pairs to serv e as its left c hildren, and an separate sequence
  • f
(tag,w
  • rd)
pairs as its righ t c hildren. Eac h Mark
  • v
pro cess, whose probabilities dep end
  • n
the w
  • rd
i and its tag, b egins in a sp ecial ST AR T state; the sym b
  • ls
it generates are added as i's c hildren, from closest to farthest, un til it reac hes the STOP state. The pro cess recurses for eac h c hild so generated. This is a sort
  • f
lexicalized con text-free mo del. Supp
  • se
that the Mark
  • v
pro cess, when gen- erating a c hild, remem b ers just the tag
  • f
the c hild's most recen tly generated sister, if an y . Then the probabilit y
  • f
dra wing a giv en parse from the p
  • pulation
is (5), where kid (i; c) denotes the cth- closest righ t c hild
  • f
w
  • rd
i, and where kid (i; 0) = ST AR T and kid (i; 1 + #r ig ht- k ids(i)) = STOP .
slide-4
SLIDE 4

(a) dachshund

  • ver

there can really . . . . . . play dachshund

  • ver

there can really play (b) . . . . . .

Figure 4: Spans participating in the correct parse
  • f
That dachshund
  • ver
ther e c an r e al ly play golf !. (a) has
  • ne
paren tless endw
  • rd;
its subspan (b) has t w
  • .
(c < indexes left c hildren.) This ma y b e though t
  • f
as a non-linear trigram mo del, where eac h tagged w
  • rd
is generated based
  • n
the par- en t tagged w
  • rd
and a sister tag. The links in the parse serv e to pic k
  • ut
the relev an t trigrams, and are c hosen to get trigrams that
  • ptimize
the global tagging. That the links also happ en to annotate useful seman tic relations is, from this p ersp ectiv e, quite acciden tal. Note that the revised v ersion
  • f
mo del A uses probabilities P r (link to c hild j c hild, paren t, closer-c hildren), where mo del C uses P r (link to c hild j paren t, closer-c hildren). This is b ecause mo del A assumes that the c hild w as previously generated b y a linear pro cess, and all that is nec- essary is to link to it. Mo del C actually generates the c hild in the pro cess
  • f
linking to it. 3 Bottom-Up Dep endency P arsing In this section w e sk etc h
  • ur
dep endency parsing algorithm: a no v el dynamic-programm i ng metho d to assem ble the most probable parse from the b
  • t-
tom up. The algorithm adds
  • ne
link at a time, making it easy to m ultiply
  • ut
the mo dels' proba- bilit y factors. It also enforces the sp ecial direc- tionalit y requiremen ts
  • f
dep endency grammar, the prohibitions
  • n
cycles and m ultiple paren ts. 4 The metho d used is similar to the CKY metho d
  • f
con text-free parsing, whic h com bines analyses
  • f
shorter substrings in to analyses
  • f
progressiv ely longer
  • nes.
Multiple analyses ha v e the same signature if they are indistinguishable in their abilit y to com bine with
  • ther
analyses; if so, the parser discards all but the highest-scoring
  • ne.
CKY requires O (n 3 s 2 ) time and O (n 2 s) space, where n is the length
  • f
the sen tence and s is an upp er b
  • und
  • n
signatures p er substring. Let us consider dep endency parsing in this framew
  • rk.
One migh t guess that eac h substring analysis should b e a lexical tree|a tagged head- w
  • rd
plus all lexical subtrees dep enden t up
  • n
it. (See Figure 1b.) Ho w ev er, if a constituen t's 4 Lab eled dep endencies are p
  • ssible,
and a minor v arian t handles the simpler case
  • f
link grammar. In- deed, abstractly , the algorithm resem bles a cleaner, b
  • ttom-up
v ersion
  • f
the top-do wn link grammar parser dev elop ed indep enden tly b y (Laert y et al., 1992).

a (left subspan) b (right subspan) c = a + b + word i

Figure 5: The assem bly
  • f
a span c from t w
  • smaller
spans (a; b) and a co v ering link. Only b isn't minimal. probabilistic b eha vior dep ends
  • n
its headw
  • rd|
the lexicalist h yp
  • thesis|then
dieren tly headed analyses need dieren t signatures. There are at least k
  • f
these for a substring
  • f
length k , whence the b
  • und
s = k = (n), giving a time complex- it y
  • f
(n 5 ). (Collins, 1996) uses this (n 5 ) algo- rithm directly (together with pruning). W e prop
  • se
an alternativ e approac h that pre- serv es the O (n 3 ) b
  • und.
Instead
  • f
analyzing sub- strings as lexical trees that will b e link ed together in to larger lexical trees, the parser will analyze them as non-constituen t spans that will b e con- catenated in to larger spans. A span consists
  • f
  • 2
adjacen t w
  • rds;
tags for all these w
  • rds
ex- cept p
  • ssibly
the last; a list
  • f
all dep endency links among the w
  • rds
in the span; and p erhaps some
  • ther
information carried along in the span's sig- nature. No cycles, m ultiple paren ts,
  • r
crossing links are allo w ed in the span, and eac h in ternal w
  • rd
  • f
the span m ust ha v e a paren t in the span. Tw
  • spans
are illustrated in Figure 4. These di- agrams are t ypical: a span
  • f
a dep endency parse ma y consist
  • f
either a paren tless endw
  • rd
and some
  • f
its descendan ts
  • n
  • ne
side (Figure 4a),
  • r
t w
  • paren
tless endw
  • rds,
with all the righ t de- scendan ts
  • f
  • ne
and all the left descendan ts
  • f
the
  • ther
(Figure 4b). The in tuition is that the in ter- nal part
  • f
a span is grammaticall y inert: except for the endw
  • rds
dachshund and play, the struc- ture
  • f
eac h span is irrelev an t to the span's abilit y to com bine in future, so spans with dieren t in ter- nal structure can comp ete to b e the b est-scoring span with a particular signature. If span a ends
  • n
the same w
  • rd
i that starts span b, then the parser tries to com bine the t w
  • spans
b y co v ered-concatenation (Figure 5). The t w
  • copies
  • f
w
  • rd
i are iden tied, after whic h a left w ard
  • r
righ t w ard co v ering link is
  • ptionally
added b et w een the endw
  • rds
  • f
the new span. An y dep endency parse can b e built up b y co v ered-concatenation. When the parser co v ered- concatenates a and b, it
  • btains
up to three new spans (left w ard, righ t w ard, and no co v ering link). The co v ered-concatenation
  • f
a and b, forming c, is barred unless it meets certain simple tests:
  • a
m ust b e minimal (not itself expressible as a concatenation
  • f
narro w er spans). This prev en ts us from assem bling c in m ultiple w a ys.
  • Since
the
  • v
erlapping w
  • rd
will b e in ternal to c, it m ust ha v e a paren t in exactly
  • ne
  • f
a and b.
slide-5
SLIDE 5 Y k i<` P r (t w
  • rd
(i) j t w
  • rd
(i + 1); t w
  • rd
(i + 2))
  • Y
k i;j ` with i;j link ed Pr(i has prefs that j satises j t w
  • rd
(i); t w
  • rd
(j )) (6) Y k i;j ` with i;j link ed Pr (L ij j t w
  • rd
(i); t w
  • rd
(j ); tag (next
  • closest
  • kid
(i)))
  • Y
k <i<`; (j <k
  • r
`<j ) Pr(L ij j t w
  • rd
(i); t w
  • rd
(j );
  • )
(7)
  • c
m ust not b e giv en a co v ering link if either the leftmost w
  • rd
  • f
a
  • r
the righ tmost w
  • rd
  • f
b has a paren t. (Violating this condition leads to either m ultiple paren ts
  • r
link cycles.) An y sucien tly wide span whose left endw
  • rd
has a paren t is a legal parse, ro
  • ted
at the EOS mark (Figure 1). Note that a span's signature m ust sp ecify whether its endw
  • rds
ha v e paren ts. 4 Bottom-Up Probabilities Is this
  • ne
parser really compatible with all three probabilit y mo dels? Y es, but for eac h mo del, w e m ust pro vide a w a y to k eep trac k
  • f
probabilities as w e parse. Bear in mind that mo dels A, B, and C do not themselv es sp ecify probabilities for all spans; in trinsically they giv e
  • nly
probabilities for sen tences. Mo del C. Dene eac h span's score to b e the pro duct
  • f
all probabilities
  • f
links within the span. (The link to i from its cth c hild is asso- ciated with the probabilit y P r (: : : ) in (5).) When spans a and b are com bined and
  • ne
more link is added, it is easy to compute the resulting span's score: score (a)
  • score
(b)
  • P
r (co v ering link ). 5 When a span constitutes a parse
  • f
the whole input sen tence, its score as just computed pro v es to b e the parse probabilit y , conditional
  • n
the tree ro
  • t
EOS, under mo del C. The highest-probabilit y parse can therefore b e built b y dynamic program- ming, where w e build and retain the highest- scoring span
  • f
eac h signature. Mo del B. T aking the Mark
  • v
pro cess to gen- erate (tag,w
  • rd)
pairs from righ t to left, w e let (6) dene the score
  • f
a span from w
  • rd
k to w
  • rd
`. The rst pro duct enco des the Mark
  • vian
proba- bilit y that the (tag,w
  • rd)
pairs k through `
  • 1
are as claimed b y the span, conditional
  • n
the app ear- ance
  • f
sp ecic (tag,w
  • rd)
pairs at `; ` + 1. 6 Again, scores can b e easily up dated when spans com bine, and the probabilit y
  • f
a complete parse P , divided b y the total probabilit y
  • f
all parses that succeed in satisfying lexical preferences, is just P 's score. Mo del A. Finally , mo del A is scored the same as mo del B, except for the second factor in (6), 5 The third factor dep ends
  • n,
e.g., kid (i; c
  • 1),
whic h w e reco v er from the span signature. Also, mat- ters are complicated sligh tly b y the probabiliti es asso- ciated with the generation
  • f
STOP . 6 Dieren t k {` spans ha v e scores conditioned
  • n
dif- feren t h yp
  • theses
ab
  • ut
tag (`) and tag (` + 1); their signatures are corresp
  • ndingl
y dieren t. Under mo del B, a k {` span ma y not com bine with an `{m span whose tags violate its assumptions ab
  • ut
` and ` + 1. A B C C X Basel. All tokn 90.2 90.9 90.8 90.5 91.0 79.8 Non-punc 88.9 89.8 89.6 89.3 89.8 77.1 Nouns 90.1 89.8 90.2 90.4 90.0 86.2 Lex v erbs 74.6 75.9 73.3 75.8 73.3 67.5 T able 1: Results
  • f
preliminary exp erimen ts: P er- cen tage
  • f
tok ens correctly tagged b y eac h mo del. whic h is replaced b y the less
  • b
vious expression in (7). As usual, scores can b e constructed from the b
  • ttom
up (though t w
  • rd
(j ) in the second factor
  • f
(7) is not a v ailable to the algorithm, j b eing
  • utside
the span, so w e bac k
  • to
w
  • rd
(j )). 5 Empirical Comparison W e ha v e undertak en a careful study to compare these mo dels' success at generalizing from train- ing data to test data. F ull results
  • n
a mo derate corpus
  • f
25,000+ tagged, dep endency-annotated Wal l Str e et Journal sen tences, discussed in (Eis- ner, 1996), w ere not complete at press time. Ho w- ev er, T ables 1{2 sho w pilot results for a small set
  • f
data dra wn from that corpus. (The full results sho w substan tially b etter p erformance, e.g., 93% correct tags and 87% correct paren ts for mo del C, but app ear qualitativ ely similar.) The pilot exp erimen t w as conducted
  • n
a subset
  • f
4772
  • f
the sen tences comprising 93,360 w
  • rds
and punctuation marks. The corpus w as deriv ed b y semi-automati c means from the P enn T ree- bank;
  • nly
sen tences without conjunction w ere a v ailable (mean length=20, max=68). A ran- domly selected set
  • f
400 sen tences w as set aside for testing all mo dels; the rest w ere used to esti- mate the mo del parameters. In the pilot (unlik e the full exp erimen t), the parser w as instructed to \bac k
  • "
from all probabilities with denomina- tors < 10. F
  • r
this reason, the mo dels w ere insen- sitiv e to most lexical distinctions. In addition to mo dels A, B, and C, describ ed ab
  • v
e, the pilot exp erimen t ev aluated t w
  • ther
mo dels for comparison. Mo del C w as a v ersion
  • f
mo del C that ignored lexical dep endencies b e- t w een paren ts and c hildren, considering
  • nly
de- p endencies b et w een a paren t's tag and a c hild's tag. This mo del is similar to the mo del used b y sto c hastic CF G. Mo del X did the same n-gram tagging as mo dels A and B (n = 2 for the prelim- inary exp erimen t, rather than n = 3), but did not assign an y links. T ables 1{2 sho w the p ercen tage
  • f
ra w tok ens that w ere correctly tagged b y eac h mo del, as w ell as the prop
  • rtion
that w ere correctly attac hed to
slide-6
SLIDE 6 A B C C Baseline All tok ens 75.9 72.8 78.1 66.6 47.3 Non-punc 75.0 75.4 79.2 68.8 51.1 Nouns 75.7 71.8 77.2 55.9 29.8 Lexical v erbs 66.5 63.1 71.0 46.9 21.0 T able 2: Results
  • f
preliminary exp erimen ts: P er- cen tage
  • f
tok ens correctly attac hed to their par- en ts b y eac h mo del. their paren ts. F
  • r
tagging, baseline p erformance w as measured b y assigning eac h w
  • rd
in the test set its most frequen t tag (if an y) from the train- ing set. The un usually lo w baseline p erformance results from a com bination
  • f
a small pilot train- ing set and a mildly extended tag set. 7 W e
  • b-
serv ed that in the training set, determiners most commonly p
  • in
ted to the follo wing w
  • rd,
so as a parsing baseline, w e link ed ev ery test determiner to the follo wing w
  • rd;
lik ewise, w e link ed ev ery test prep
  • sition
to the preceding w
  • rd,
and so
  • n.
The patterns in the preliminary data are strik- ing, with v erbs sho wing up as an area
  • f
dicult y , and with some mo dels clearly faring b etter than
  • ther.
The simplest and fastest mo del, the recur- siv e generation mo del C, did easily the b est job
  • f
capturing the dep endency structure (T able 2). It misattac hed the few est w
  • rds,
b
  • th
  • v
erall and in eac h category . This suggests that sub catego- rization preferences|the
  • nly
factor considered b y mo del C|pla y a substan tial role in the struc- ture
  • f
T reebank sen tences. (Indeed, the errors in mo del B, whic h p erformed w
  • rst
across the b
  • ard,
w ere v ery frequen tly arit y errors, where the desire
  • f
a c hild to attac h to a particular paren t
  • v
er- came the reluctance
  • f
the paren t to accept more c hildren.) A go
  • d
deal
  • f
the parsing success
  • f
mo del C seems to ha v e arisen from its kno wledge
  • f
individ- ual w
  • rds,
as w e exp ected. This is sho wn b y the v astly inferior p erformance
  • f
the con trol, mo del C . On the
  • ther
hand, b
  • th
C and C' w ere com- p etitiv e with the
  • ther
mo dels at tagging. This sho ws that a tag can b e predicted ab
  • ut
as w ell from the tags
  • f
its putativ e paren t and sibling as it can from the tags
  • f
string-adjacen t w
  • rds,
ev en when there is considerable error in determin- ing the paren t and sibling. 6 Conclusions Bare-b
  • nes
dep endency grammar|whi c h requires no link lab els, no grammar, and no fuss to understand|is a clean testb ed for studying the lexical anities
  • f
w
  • rds.
W e b eliev e that this is an imp
  • rtan
t line
  • f
in v estigativ e researc h,
  • ne
that is lik ely to pro duce b
  • th
useful parsing to
  • ls
and signican t insigh ts ab
  • ut
language mo deling. 7 W e used distinctiv e tags for auxiliary v erbs and for w
  • rds
b eing used as noun mo diers (e.g., partici- ples), b ecause they ha v e v ery dieren t sub categoriza- tion frames. As a rst step in the study
  • f
lexical an- it y , w e ask ed whether there w as a \natural" w a y to sto c hasticize suc h a simple formalism as de- p endency . In fact, w e ha v e no w exhibited three promising t yp es
  • f
mo del for this simple problem. F urther, w e ha v e dev elop ed a no v el parsing algo- rithm to compare these h yp
  • theses,
with results that so far fa v
  • r
the sp eak er-orien ted mo del C, ev en in written, edited Wal l Str e et Journal text. T
  • ur
kno wledge, the relativ e merits
  • f
sp eak er-
  • rien
ted v ersus hearer-orien ted probabilistic syn- tax mo dels ha v e not b een in v estigated b efore. References Ezra Blac k, F red Jelinek, et al. 1992. T
  • w
ards history- based grammars: using ric her mo dels for probabilis- tic parsing. In Fifth D ARP A Workshop
  • n
Sp e e ch and Natur al L anguage, Arden Conference Cen ter, Harriman, New Y
  • rk,
F ebruary . cmp-lg/9405007. Kenneth W. Ch urc h. 1988. A sto c hastic parts pro- gram and noun phrase parser for unrestricted text. In Pr
  • c.
  • f
the 2nd Conf.
  • n
Applie d Natur al L an- guage Pr
  • c
essing, 136{148, Austin, TX. Asso ciation for Computational Linguistics, Morristo wn, NJ. Mic hael J. Collins. 1996. A new statistical parser based
  • n
bigram lexical dep endencies. Pr
  • c.
  • f
the 34th A CL, San ta Cruz, July . cmp-lg/9605012. Jason Eisner. 1996. An empirical comparison
  • f
prob- abilit y mo dels for dep endency grammar. T ec hni- cal rep
  • rt
IR CS-96-11, Univ ersit y
  • f
P ennsylv ania. cmp-lg/9706004. F red Jelinek. 1985. Mark
  • v
source mo deling
  • f
text generation. In J. Skwirzinski, editor, Imp act
  • f
Pr
  • c
essing T e chniques
  • n
Communic ation, Dordrec h t. F red Jelinek, John D. Laert y , and Rob ert L. Mercer. 1992. Basic metho ds
  • f
probabili stic con text-free grammars. In Sp e e ch R e c
  • gnition
and Understand
  • ing:
R e c ent A dvanc es, T r ends, and Applic ation s. J. Kupiec. 1992. Robust part-of-sp eec h tagging us- ing a hidden Mark
  • v
mo del. Computer Sp e e ch and L anguage, 6. John Laert y , Daniel Sleator, and Da vy T emp erley . 1992. Grammatical trigrams: A probabilisti c mo del
  • f
link grammar In Pr
  • c.
  • f
the AAAI Conf.
  • n
Pr
  • b
abilistic Appr
  • aches
to Natur al L anguage, Oct. Da vid Magerman. 1995. Statistical decision-tree mo dels for parsing. In Pr
  • c
e e dings
  • f
the 33r d A CL, Boston, MA. cmp-lg/9504030. Igor A. Mel'
  • cuk.
1988. Dep endency Syntax: The
  • ry
and Pr actic e. State Univ ersit y
  • f
New Y
  • rk
Press. B. Merialdo. 1990. T agging text with a probabilisti c mo del. In Pr
  • c
e e dings
  • f
the IBM Natur al L anguage ITL, P aris, F rance, pp. 161-172. Yv es Sc hab es. 1992. Sto c hastic lexicali zed tree- adjoining grammars. In Pr
  • c
e e dings
  • f
COLING- 92, Nan tes, F rance, July . Daniel Sleator and Da vy T emp erley . 1991. P arsing English with a Link Grammar. T ec h. rpt. CMU-CS- 91-196. Carnegie Mellon Univ. cmp-lg/9508004.