Appears in: Carl Weir (ed.), Statistical ly -Ba se d - - PDF document

appears in carl weir ed statistical ly ba se d natural
SMART_READER_LITE
LIVE PREVIEW

Appears in: Carl Weir (ed.), Statistical ly -Ba se d - - PDF document

Appears in: Carl Weir (ed.), Statistical ly -Ba se d Natural Language Processing Technique s: Papers from the 1992 Workshop, pp. 20-27. Technical Report W-92-01, AAAI Press, Menlo Park, 1992. A Probabilistic P arser


slide-1
SLIDE 1 Appears in: Carl Weir (ed.), Statistical ly
  • Ba
se d Natural Language Processing Technique s: Papers from the 1992 Workshop, pp. 20-27. Technical Report W-92-01, AAAI Press, Menlo Park, 1992. A Probabilistic P arser and Its Application Mark A. Jones A T&T Bell Lab
  • ratories
600 Moun tain Av en ue, Rm. 2B-435 Murra y Hill, NJ 07974{063 6 jones@researc h.att.com Jason M. Eisner Emman uel College, Cam bridge Cam bridge CB2 3AP England jme14@pho enix.cam bri dge.ac. uk Abstract W e describ e a general approac h to the probabilis- tic parsing
  • f
con text-free grammars. The metho d in tegrates con text-sensitiv e statistical kno wledge
  • f
v arious t yp es (e.g., syn tactic and seman tic) and can b e trained incremen tally from a brac k eted cor- pus. W e in tro duce a v arian t
  • f
the GHR con text- free recognition algorithm, and explain ho w to adapt it for ecien t probabilistic parsing. In split- corpus testing
  • n
a real-w
  • rld
corpus
  • f
sen tences from soft w are testing do cumen ts, with 20 p
  • ssible
parses for a sen tence
  • f
a v erage length, the sys- tem nds and iden ties the correct parse in 96%
  • f
the sen tences for whic h it nds an y parse, while pro ducing
  • nly
1.03 parses p er sen tence for those sen tences. Signican tly , this success rate w
  • uld
b e
  • nly
79% without the seman tic statistics. In tro duction In constrained domains, natural language pro cessing can
  • ften
pro vide lev erage. A t A T&T, for instance, NL tec hnology can p
  • ten
tially help automate man y asp ects
  • f
soft w are dev elopmen t. A t ypical example
  • ccurs
in the soft w are testing area. Here 250,000 En- glish sen tences sp ecify the
  • p
erational tests for a tele- phone switc hing system. The c hallenge is to to ex- tract at least the surface con ten t
  • f
this highly ref- eren tial, naturally
  • ccurring
text, as a rst step in automating the largely man ual testing pro cess. The sen tences v ary in length and complexit y , ranging from short sen tences suc h as \Station B3 go es
  • nho
  • k"
to 50 w
  • rd
sen tences con taining paren theticals, sub
  • rdinate
clauses, and conjunction. F
  • rtunately
the discourse is reasonably w ell fo cused: a large but nite n um b er
  • f
telephonic concepts en ter in to a limited set
  • f
logi- cal relationships. Suc h fo cus is c haracteristic
  • f
man y sublanguages with practical imp
  • rtance
(e.g., medical records). W e desire to press forw ard to NL tec hniques that are robust, that do not need complete grammars in ad- v ance, and that can b e trained from existing corp
  • ra
  • f
sample sen tences. Our approac h to this problem grew
  • ut
  • f
earlier w
  • rk
[Jones et al 1991 ]
  • n
correcting the
  • utput
  • f
  • ptical
c haracter recognition (OCR) systems. W e w ere amazed at ho w m uc h correction w as p
  • ssible
using
  • nly
lo w-lev el statistical kno wledge ab
  • ut
En- glish (e.g., the frequency
  • f
digrams lik e \pa") and ab
  • ut
common OCR mistak es (e.g., rep
  • rting
\c" for \e"). As man y as 90%
  • f
incorrect w
  • rds
could b e xed within the telephon y sublanguage domain, and 70{80% for broader samples
  • f
English. Naturally w e w
  • n-
dered whether more sophisticated uses
  • f
statistical kno wledge could aid in suc h tasks as the
  • ne
describ ed ab
  • v
e. The recen t literature also reects an increas- ing in terest in statistical training metho ds for man y NL tasks, including parsing [Jelinek and Laert y 1991 , Magerman and Marcus 1991 , Bobro w 1991 , Magerman and W eir 1992 , Blac k, Jelinek, et al 1992 ], part
  • f
sp eec h tagging [Ch urc h 1988 ], and corp
  • ra
alignmen t [Dagan et al 1991 , Gale and Ch urc h 1991 ]. Simply stated, w e seek to build a parser that can construct accurate syn tactic and seman tic analyses for the sen tences
  • f
a giv en language. The parser should kno w little
  • r
nothing ab
  • ut
the target language, sa v e what it can disco v er statistically from a represen ta- tiv e corpus
  • f
analyzed sen tences. When
  • nly
unan- alyzed sen tences are a v ailable, a practical approac h is to parse a small set
  • f
sen tences b y hand, to get started, and then to use the parser itself as a to
  • l
to suggest analyses (or partial analyses) for further sen- tences. A similar \b
  • tstrapping"
approac h is found in [Simmo ns 1990 ]. The precise grammatical theory w e use to hand-analyze sen tences should not b e cru- cial, so long as it is applied consisten tly and is not unduly large. P arsing Algorithms F
  • llo
wing [Graham et al 1980 ], w e adopt the follo wing notation. An arbitrary con text-free grammar is giv en b y G = (V ; ; P ; S ), where V is the v
  • cabulary
  • f
all sym b
  • ls,
  • is
the set
  • f
terminal sym b
  • ls,
P is the set
  • f
rewrite rules, and S is the start sym b
  • l.
F
  • r
an input sen tence w = a 1 a 2 : : : a n , let w i;j denote the substring a i+1 : : : a j and w i = w 0;i denote the prex
  • f
length i. W e use Greek letters (;
  • ;
: : : ) to denote
slide-2
SLIDE 2 sym b
  • l
strings in V
  • .
T abular dynamic programming algorithms are the metho ds
  • f
c hoice for
  • rdinary
con text-free recognition [Co c k e and Sc h w artz 1970 , Earley 1970 , Graham et al 1980 ]. Eac h en try t i;j in a table
  • r
c hart, t, holds a set
  • f
sym b
  • ls
  • r
rules that match w i;j . A sym b
  • l
A matc hes w i;j if A )
  • w
i;j . Some
  • f
these metho ds use dotte d rules to represen t progress in matc hing the input. F
  • r
all A !
  • in
P , A !
  • is
a dotted rule
  • f
G. The dotted rule A !
  • matc
hes w i;j (and hence is in the set t i;j ) if
  • )
  • w
i;j . The dynamic programming algorithms w
  • rk
b y com- bining shorter deriv ations in to longer
  • nes.
In the CKY algorithm, the grammar is in Chomsky Normal F
  • rm.
The sym b
  • l
A ma y b e added to t j 1;j b y the lexical rule A ! a j ;
  • r
to t i;j b y the rule A ! B C , if there exist sym b
  • ls
B in t i;k and C in t k ;j . In
  • ther
w
  • rds,
CKY
  • b
eys the follo wing in v arian t: In v arian t 1 CKY: A dd A to t i;j if and
  • nly
if A )
  • w
i;j . The principal dra wbac k
  • f
the CKY metho d is that the algorithm nds matc hes that cannot lead to deriv a- tions for S . The GHR algorithm [Graham et al 1980 ] impro v es the a v erage case p erformance b y considering
  • nly
matc hes that are consisten t with the left con text: In v arian t 2 GHR: A dd A !
  • to
t i;j if and
  • nly
if
  • )
  • w
i;j and S )
  • w
i A for some
  • 2
  • .
In
  • ne
sense, GHR seems to do as w ell as
  • ne
could exp ect in an
  • n-line
recognizer that recognizes eac h prex
  • f
w without lo
  • k
ahead. Still, the algorithm runs in time O (n 3 ) and space O (n 2 ) for arbitrary con text-free grammars. F urthermore, in man y applica- tions the goal is not simply to recognize a gramma tical sen tence, but to nd its p
  • ssible
parses,
  • r
the b est parse. Extracting al l parses from the c hart can b e quite exp ensiv e. Natural language constructs suc h as prep
  • sitional
phrase attac hmen ts and noun-noun com- p
  • unds
can giv e rise to a Catalan n um b er
  • f
parses [Winograd 1983 ], as in the classic sen tence \I saw a man in the p ark with a telesc
  • p
e." With suc h inheren t am biguit y , ev en renemen ts based
  • n
lo
  • k
ahead do not reduce the
  • v
erall complexit y . The
  • nly
w a y to further impro v e p erformance is to nd few er parses|to trac k
  • nly
those analyses that mak e seman tic and pragmatic sense. Suc h an approac h is not
  • nly
p
  • ten
tially faster; it is usually more useful as w ell. It is straigh tforw ard to turn the GHR recognizer in to a c hart parser. The c hart will no w store trees rather than dotted rules. Let A !
  • i;j
  • represen
t a dotte d tr e e with ro
  • t
A that dominates w i;j (i < j ) through a con tiguous sequence
  • f
c hild subtrees,
  • i;j
. When the con text is clear, w e will refer to suc h a tree as A i;j ,
  • r
more generally as h i;j . Here
  • 2
V
  • as
b efore; when
  • is
n ull the tree is called c
  • mplete.
W e could mo dify In v arian ts 1 and 2 to refer to dot- ted trees. As in CKY, w e could add a tree A i;j if and
  • nly
if it dominated w i;j . A stronger condition, simi- lar to GHR, w
  • uld
further require A i;j to b e syn tacti- cally and seman tically consisten t with the left con text w i . The problem remains, ho w ev er, that the notion
  • f
con textual consistency is to
  • w
eak|w e w an t analy- ses that are con textually pr
  • b
able. Ev en seman tic con- sistency is not enough. Man y
  • f
the readings in the example ab
  • v
e are in ternally consisten t but still im- probable. F
  • r
example, it is p
  • ssible
that the example describ es the sa wing
  • f
a man, but not lik ely . T
  • ef-
fectiv ely reduce the searc h space, w e m ust restrict
  • ur
atten tion to analyses that are pr
  • b
able giv en the join t considerations
  • f
syn tax, seman tics, etc. W e desire to form In v arian t 3 Optimal (OPT): A dd the dotte d tr e e A i;j if and
  • nly
if it is dominate d by the \b est p arse tr e e" b S 0;n , dene d as the most pr
  • b
able c
  • mplete
tr e e
  • f
the form S 0;n . Of course, when w e are parsing a new (un brac k eted) sen tence, b S 0;n is not kno wn ahead
  • f
time. In a strictly left-righ t parser, without lo
  • k
ahead in the input string, it is generally imp
  • ssible
to guaran tee that w e
  • nly
k eep trees that app ear as subtrees
  • f
b S 0;n . Nev erthe- less, since language is generally pro cessed b y h umans from left to righ t in real time, it is reasonable to susp ect that the left-con text con tains enough information to sev erely limit nondeterminism. A rst attempt migh t b e In v arian t 4 Most Pr
  • b
able (MP): A dd A i;j if and
  • nly
if P
  • 2
  • Pr[S
)
  • w
i A i;j
  • ]
  • P
  • 2
  • Pr[S
)
  • w
i B i ;j
  • ]
for al l dotte d tr e es B i ;j that c
  • mp
ete with A i;j . A i;j and B i ;j are said to c
  • mp
ete if they
  • er
ab
  • ut
the same lev el
  • f
explanation (see b elo w) and neither dominates the
  • ther.
Suc h trees are incompatible as explanations for a j in particular, so
  • nly
  • ne
can ap- p ear as part
  • f
b S 0;n . The MP criterion guesses the ultimate usefulness and probabilit y
  • f
a dotted tree b y considering
  • nly
its left con text. The left con text ma y
  • f
course b e unhelp- ful: for instance, there is no con text at the b eginning
  • f
the sen tence. W
  • rse,
the left con text ma y b e mislead- ing. In principle there is nothing wrong with this: ev en h umans ha v e dicult y with misleading \garden path" sen tences. The price for b eing righ t and fast most
  • f
the time is the p
  • ssibilit
y
  • f
b eing fo
  • led
  • ccasionally|
as with an y heuristic. Ev en so, MP is to
  • strict
a criterion for most do- mains: it thro ws a w a y man y plausible trees, some
  • f
whic h ma y b e necessary to build the preferred parse b S 0;n
  • f
the whole sen tence. W e mo dify MP so that instead
  • f
adding
  • nly
the most lik ely tree in eac h set
  • f
comp etitors, it adds all trees within some fraction
  • f
the most lik ely
  • ne.
Th us the parameter
  • <
1
  • p
erationally determines the set
  • f
garden path sen- tences for the parser. If the left con text is sucien tly
slide-3
SLIDE 3 misleading for a giv en , then useful trees ma y still b e discarded. But in exc hange for the certain t y
  • f
pro duc- ing ev ery consisten t analysis, w e hop e to nd a go
  • d
(statistically sp eaking) parse m uc h faster b y pruning a w a y unlik ely alternativ es. If the n um b er
  • f
alterna- tiv es is b
  • unded
b y some constan t k in practice, w e can
  • btain
an algorithm that is O (n + k e) where e is the n um b er
  • f
edges in the parse tree. F
  • r
binary branc h- ing trees, e = 2(n
  • 1),
and the algorithm is O (n) as desired. In v arian t 5 R e asonably Pr
  • b
able (RP): A dd A i;j if and
  • nly
if P
  • 2
  • Pr
[S )
  • w
i A i;j
  • ]
  • P
  • 2
  • Pr
[S )
  • w
i B i ;j
  • ]
for al l dotte d tr e es B i ;j that c
  • mp
ete with A i;j . An alternativ e approac h w
  • uld
k eep m comp eting rules from eac h set, where m is xed. This has the adv an tage
  • f
guaran teeing a constan t n um b er
  • f
can- didates, but the disadv an tage
  • f
not adapting to the am biguit y lev el at eac h p
  • in
t in the sen tence. The fractional metho d b etter reects the kind
  • f
\memory load" eects seen in psyc holinguistic studies
  • f
h uman parsing. Algorithm 1 in App endix A describ es a parser that
  • b
eys the RP in v arian t. The algorithm returns the set
  • f
complete trees
  • f
the form S 0;n . W e restrict the start sym b
  • l
S to b e non-recursiv e. If necessary a dis- tinguished start sym b
  • l
(e.g., R OOT ) can b e added to the gramma r. T rees are created in three w a ys. First, the trivial trees a j assert the presence
  • f
the input sym- b
  • ls.
Second, the parser creates some \empt y trees" from the gramma r, although these are not added to the c hart: they ha v e the form A !
  • i;i
  • ,
where
  • denotes
a sequence
  • f
zero subtrees. Third, the parser can com bine trees in to larger trees using the
  • p
era- tor.
  • pastes
t w
  • adjacen
t trees together: (A !
  • i;j
  • B
  • )
  • B
j;k = (A !
  • i;j
B j;k
  • )
Here the rst argumen t
  • f
  • m
ust b e an incomplete tree, while the second m ust b e a complete tree. The
  • p
erator can easily b e extended to w
  • rk
with sets and c harts: Q
  • R
= fA i;j
  • B
j;k j A i;j 2 Q and is incomplete, B j;k 2 R and is completeg t
  • R
= ( S t i;j )
  • R
Theorem 1 No tr e e A i;j c an dominate an inc
  • mplete
tr e e B i ;j . Pro
  • f
Supp
  • se
  • therwise.
Then A i;j dominates the c
  • mpletion
  • f
B i ;j , and in p articular some rightwar d extension B i ;j
  • C
j;k , wher e C j;k is a c
  • mplete
tr e e with j < k . It fol lows that A i;j dominates w j;k , a c
  • n-
tr adiction. Corollary Given any two inc
  • mplete
tr e es A i;j and B i ;j , neither dominates the
  • ther.
Lemma 2 In line 5
  • f
A lgorithm 1, no tr e e A i;j 2 N c an dominate a tr e e B i ;j 2 N
  • E
j . Pro
  • f
Every tr e e in N
  • E
j has just b e en cr e ate d
  • n
this iter ation
  • f
the while lo
  • p.
But al l subtr e es dom- inate d by A i;j wer e cr e ate d
  • n
pr evious iter ations. Theorem 3 In line 5
  • f
A lgorithm 1, no tr e e A i;j 2 N c an dominate a tr e e B i ;j 2 N . Pro
  • f
Either B i ;j 2 N
  • E
j ,
  • r
B i ;j 2 E j and is inc
  • mplete.
The claim now fol lows fr
  • m
the r esults ab
  • ve.
The most imp
  • rtan
t asp ect
  • f
the tree-building pro- cess is the explicit construction and pruning
  • f
the set
  • f
comp eting h yp
  • theses,
N , during eac h iteration. It is here that the parser c ho
  • ses
whic h h yp
  • theses
to pursue. Theorem 3 states that the trees in N are in fact m utually exclusiv e. The algorithm is also care- ful to ensure that the trees in N are
  • f
roughly equal depth. W ere this not the case, t w
  • equally
lik ely deep trees migh t ha v e to comp ete (on dieren t iterations) with eac h
  • ther's
subtrees. Since the shallo w subtrees are alw a ys more probable than their ancestors (\the part is more lik ely than the whole"), this w
  • uld
lead to the pruning
  • f
b
  • th
deep trees. W e no w state a theorem regarding the parses that will b e found without pruning. Theorem 4 In the sp e cial c ase
  • f
  • =
0, A lgorithm 1 c
  • mputes
pr e cisely the derivations that c
  • uld
b e ex- tr acte d fr
  • m
the r e c
  • gnition
sets
  • f
GHR up to p
  • sition
j in the input. Pro
  • f
When
  • =
0,
  • nly
zer
  • -pr
  • b
ability dotte d tr e es wil l b e prune d. We assume that any p arse tr e e p ermit- te d by the formal gr ammar G has pr
  • b
ability > 0, as do its subtr e es. Conversely, Pr [A i;j !
  • i;j
  • j
w j ] > me ans that S )
  • w
i
  • i;j
  • j;k
  • is
a valid derivation for some se quenc e
  • f
tr e es
  • j;k
and some string
  • .
Thus Pr[A i;j !
  • i;j
  • j
w j ] > is e quivalent to the statement that S )
  • w
i A for some
  • =
w j;k
  • 2
  • .
Henc e Invariant 5 adds A i;j !
  • i;j
  • to
the chart if and
  • nly
if Invariant 2 adds A !
  • .
A signican t adv an tage
  • f
the data-driv en parser de- scrib ed b y Algorithm 1 is its p
  • ten
tial use in noisy recognition en vironmen ts suc h as sp eec h
  • r
OCR. In suc h applications, where man y input h yp
  • theses
ma y comp ete, pruning is ev en more v aluable in a v
  • iding
a com binatorial explosion. Considerations in Probabilistic P arsing Algorithm 1 do es not sp ecify a computation for Pr [h i;j j w j ], and so lea v es
  • p
en sev eral imp
  • rtan
t questions: 1. What sources
  • f
kno wledge (syn tax, seman tics, etc.) can help determine the probabilit y
  • f
the dotted tree h i;j ? 2. What features
  • f
the left con text w i are relev an t to this probabilit y?
slide-4
SLIDE 4 3. Giv en answ ers to the ab
  • v
e questions, ho w can w e compute Pr [h i;j j w j ]? 4. Ho w m uc h training is necessary to
  • btain
sucien tly accurate statistics? The answ ers are sp ecic to the class
  • f
languages un- der consideration. F
  • r
natural languages, reasonable p erformance can require a great deal
  • f
kno wledge. T
  • correctly
in terpret \I saw a man in the p ark with a tele- sc
  • p
e," w e ma y need to to kno w ho w
  • ften
telescop es are used for seeing; ho w
  • ften
a v erb tak es t w
  • prep
  • sitional
phrases; who is most lik ely to ha v e a telescop e (me, the man,
  • r
the park); and so
  • n.
Our system uses kno wledge ab
  • ut
the empirical fre- quencies
  • f
syn tactic and seman tic forms. Ho w ev er,
  • ur
approac h is quite general and w
  • uld
apply with-
  • ut
mo dication to
  • ther
kno wledge sources, whether empirical
  • r
not. The left-con text probabilit y Pr[h i;j j w j ] dep ends
  • n
the literal input seen so far, w j . Ho w are w e to kno w this probabilit y if w j is a no v el string? As it turns
  • ut,
w e can compute it in terms
  • f
the left-con text probabilities
  • f
  • ther
trees already in the c hart, using arbitr arily w eak indep endence assumptions. W e will need empirical v alues for expressions
  • f
the form Pr [h i;j
  • h
j;k j c i & h i;j & h j;k ] Pr [a i+1 j c i ] where c i is
  • ne
p
  • ssible
\partial in terpretation"
  • f
w i (constructed from
  • ther
trees in the c hart). If the lan- guage p ermits relativ ely strong indep endence assump- tions, c i need not b e to
  • detailed
an in terpretation; then w e will not need to
  • man
y statistics,
  • r
a large set
  • f
examples. On the
  • ther
hand, if w e refuse to mak e an y indep endence assumptions at all, w e will ha v e to treat ev ery string w j as a sp ecial case, and k eep sepa- rate statistics for ev ery sen tence
  • f
the language. In the next section, w e will
  • utline
the computation for a simple case where c i con tains no seman tic infor- mation. In the nal section w e presen t a range
  • f
re- sults, including cases in whic h syn tactic and seman tic information is join tly considered. F
  • r
a further dis- cussion
  • f
the syn tactic and seman tic represen tations and their probabilities, including a helpful example, see [Jones and Eisner 1992 ]. Note that seman tic in terdep endence can
  • p
erate across some distance in a sen tence; in practice, the lik e- liho
  • d
  • f
h i;j ma y dep end
  • n
ev en the earliest w
  • rds
  • f
w j . Compare \The champ agne was very bubbly" with \The hostess was very bubbly." If w e are to eliminate the incongruous meaning
  • f
\bubbly" in eac h case, w e will need c 4 (a p
  • ssible
in terpretation
  • f
the left con- text w 4 ) to indicate whether the sub ject
  • f
the sen tence is h uman. It remains an in teresting empirical question whether it is more ecien t (1) to compute highly accurate probabilities, via adequately detailed represen tations
  • f
left con text,
  • r
(2) to use broader (e.g., non- seman tic) represen tations, and comp ensate for inac- curacy b y allo wing more lo cal nondeterminism. It can b e c heap er to ev aluate individual h yp
  • theses
under (2), and psyc holinguistic evidence
  • n
  • n
parallel lexical ac- cess [T anenhaus et al 1985 ] ma y fa v
  • r
(2) for sublexical sp eec h pro cessing. On the
  • ther
hand, if w e p ermit to
  • m
uc h nondeterminism, h yp
  • theses
proliferate and the complexit y rises dramatically . Moreo v er, inaccurate probabilities mak e it dicult to c ho
  • se
among parses
  • f
an am biguous sen tence. A Syn tactic Probabilit y Computation Our parser constructs generalized syn tactic and seman- tic represen tations, and so p ermits c i to b e as broad
  • r
as detailed as desired. Space prev en ts us from giv- ing the general probabilit y computation. Instead w e sk etc h a simple but still useful sp ecial case that dis- regards seman tics. Here the (m utually exclusiv e) de- scriptions c i , where
  • i
< n, will tak e the form \w i is the kind
  • f
string that is follo w ed b y an NP " (or VP , etc.). W e mak e
  • ur
standard assumption that the probabilit y
  • f
h i;j ma y dep end
  • n
c i , but is indep en- den t
  • f
ev erything else ab
  • ut
w i . 1 In this case, c i is a function
  • f
the single correct incomplete dotted tree ending at i: so w e are assuming that nothing else ab
  • ut
w i is relev an t. W e wish to nd Pr [h j;l j w l ]. Giv en
  • j
< n, let E j denote the subset
  • f
inc
  • mplete
dotted trees in S i t i;j . W e ma y assume that some mem b er
  • f
E j do es app ear in b S 0;n . (When this assumption fails, b S 0;n will not b e found no matter ho w accurate
  • ur
probabilit y computation is.) The corollary to Theo- rem 1 then implies that exactly
  • ne
tree in E j ap- p ears in b S 0;n . W e can therefore express Pr [h j;l j w l ] as P h i;j 2E j Pr[h i;j & h j;l j w l ]. W e cac he all the sum- mands, as w ell as the sum, for future use. So w e
  • nly
need an expression for U = Pr [h i;j & h j;l j w l ]. There are three cases. If l = j + 1 and h j;l = a j +1 , w e can apply Ba y es' Theorem: U = Pr [h i;j & a j +1 j w j +1 ] = Pr [h i;j & a j +1 j w j & a j +1 ] = Pr [h i;j j w j & a j +1 ] = Pr [a j +1 j h i;j & w j ]
  • Pr
[h i;j j w j ] P h i ;j 2E j Pr [a j +1 j h i ;j & w j ]
  • Pr
[h i ;j j w j ] = X 1 X 2 P X 1 X 2 In the second case, where h j;l = h j;k
  • h
k ;l , w e factor as follo ws: U = Pr [h i;j & (h j;k
  • h
k ;l ) j w l ] 1 In the language
  • f
classical statistics, w e ha v e a binary- v alued random v ariable H that tak es the v alue true i the tree h i;j app ears in the correct parse. W e ma y treat the un- kno wn p.d.f. for H as determined b y the parameter w i , the preceding input. Our assumption is that c i is a sucien t statistic for w i .
slide-5
SLIDE 5 = Pr[h j;k
  • h
k ;l j h i;j & h j;k & h k ;l & w l ]
  • Pr[h
j;k & h k ;l j w l ]
  • Pr
[h i;j j h j;k & h k ;l & w l ] = X 3 X 4 Y Insofar as
  • ur
indep endence assumption holds, w e can pro v e Y
  • Pr
[h i;j j h j;k & w k ] = Pr[h i;j & h j;k j w k ] Pr [h j;k j w k ] = X 5 X 6 Finally , if < j = l and h j;j = h j;l is an empt y tree, A !
  • j;j
  • (X
5 sometimes yields this form): U = Pr [h i;j & h j;j j w j ] = Pr [h i;j j w j ]
  • Pr[h
j;j j h i;j & w j ] = X 7 X 8 No w X 2 ; X 2 ; X 4 ; X 5 ; X 6 and X 7 are all left-con text probabilities for trees (and pairs
  • f
trees) that are al- ready in the c hart. In fact, all these probabilities ha v e alr e ady b e en c
  • mpute
d and c ache d. X 1 ; X 1 ; X 3 and X 8 , as w ell as the top-do wn probabil- ities Pr[S 0;0 ], ma y b e estimated from empirical statis- tics. Where h i;j is the incomplete tree A !
  • i;j
  • B
  • ,
dene c j (h i;j ) to b e the sym b
  • l
B . Th us if h i;j is cor- rect, w j is in fact the kind
  • f
string that is follo w ed b y a constituen t
  • f
t yp e c j (h i;j ) = B . According to
  • ur
indep endence assumption, nothing else ab
  • ut
w j (or h i;j ) matters. W e therefore write X 1
  • Pr
[a j +1 j c j (h i;j )] X 8
  • Pr
[h j;j j c j (h i;j )], and b y similar assumptions (and abuse
  • f
notation), X 3
  • Pr
[
  • j
c k (h j;k ) & ro
  • t(h
k ;l )]. An illustration ma y b e helpful here. Supp
  • se
A = h j;k is the dotted tree VP ! V j;k
  • NP
. Th us A has the prop ert y that c k (A) = NP . Supp
  • se
further that B = h k ;l is some complete tree represen ting an NP . The ab
  • v
e statistic for X 3 giv es the lik eliho
  • d
that suc h an A will bind suc h a B as its next c hild (rather than binding some deep er NP that dominates B ). Our parser computes this statistic during training, simply b y lo
  • king
at the sample parse trees pro vided to it. GHR mak es use
  • f
similar facts ab
  • ut
the language, but deduces them from the formal grammar. The k ey feature
  • f
this deriv ation (and
  • f
the more general v ersion) is that it p ermits eectiv e cac hing. Thanks to Ba y es' Theorem, left-con text probabilities are alw a ys written in terms
  • f
  • ther,
previously com- puted left-con text probabilities. So Pr[h j;l j w l ] can alw a ys b e found b y m ultiplying
  • r
adding together a few kno wn n um b ers|regardless
  • f
the size
  • f
the dot- ted tree h j;l . If the size
  • f
the sets E j is b
  • unded
b y a constan t, then the summations are b
  • unded
and eac h new tree can b e ev aluated in constan t time. Since at most
  • ne
mem b er
  • f
E j is actually in b S 0;n , the set ma y b e k ept small b y eectiv e pruning. Th us accurate probabilities ha v e a double pa y
  • .
They p ermit us to prune aggres- siv ely , a strategy whic h b
  • th
k eeps the c hart small and mak es it easy to estimate the probabilities
  • f
its en- tries. Status and Results Our parser serv es as a comp
  • nen
t
  • f
the soft- w are testing application men tioned in the in tro duc- tion (for details, see [Nonnenmann and Eddy 1992 ] and [Jones and Eisner 1992 ]). It has b een trained
  • n
sample parse trees for
  • v
er 400 sen tences in the do- main. The trees use lexical tags from the Bro wn Cor- pus [F rancis and Kucera 1982 ] and fairly traditional phrase structure lab els (S , NP , etc.). Although the \telephonese" sublanguage is an unrestricted subset
  • f
English, it diers statistically from English tak en as a whole. The strength
  • f
trainable systems is their abil- it y to adapt to suc h naturally
  • ccurring
(and ev
  • lving)
sublanguages. The training corpus con tains 308 distinct lexical items whic h participate in 355 part
  • f
sp eec h rules. There are 55 distinct non terminal lab els, including 35 parts
  • f
sp eec h. Sen tences range from 5 to 47 w
  • rds
in length (coun ting punctuation). The a v erage sen tence is 11 long; the a v erage parse tree is 9 deep with 31 non terminal no des. W e tak e
  • ur
gramma r to b e the smallest set
  • f
sym- b
  • ls
and con text-free rules needed to write do wn ev ery tree in the corpus. The c
  • rpus
p erplexity b(C ) mea- sures the am biguit y
  • f
an y set
  • f
sen tences C under a giv en gramma r: log b(C ) = P S 2C log(n um b er
  • f
parses for S ) P S 2C n um b er
  • f
w
  • rds
in S Using GHR to parse the corpus exhaustiv ely , w e mea- sure b(C ) = 1:313. Th us a t ypical 11-w
  • rd
sen tence has 1:313 11
  • 20
parses,
  • nly
  • ne
  • f
whic h is correct. 2 10%
  • f
the sen tences ha v e more than 1400 p
  • ssible
parses eac h. Our w
  • rking
parser uses a hand-co ded translation function
  • to
construct the seman tic in terpretations
  • f
dotted trees. It also mak es use
  • f
some formal impro v e- men ts to the algorithm, whic h ha v e b een
  • mitted
here for space reasons. T
  • date,
w e ha v e tried to address three questions. First, particularly when the parser has enough kno wl- edge to generate at least
  • ne
parse, can it generate and iden tify the c
  • rr
e ct parse? Second, ho w quic kly is the parser accum ulating the kno wledge necessary to get at least
  • ne
parse? Third, ho w eectiv ely do es the algorithm con trol the com binatorial proliferation
  • f
h yp
  • theses?
Accuracy W e ha v e done sev eral exp erimen ts to measure the ac- curacy
  • f
the parser
  • n
un trained sen tences. Figure 1 summarizes t w
  • suc
h exp erimen ts, whic h tested the accuracy
  • f
(i) join t syn tactic-seman tic statistics, and (ii) syn tactic statistics alone (as form ulated ab
  • v
e). T
  • b
est use
  • ur
rather small set
  • f
429 brac k eted 2 [Blac k, Jelinek, et al 1992], who call b(C ) the \parse base," rep
  • rt
almost iden tical n um b ers for an corpus
  • f
sen tences from computer man uals.
slide-6
SLIDE 6 (i) join t syn tax (ii) syn tax and seman tics alone %(some parse found) 81% 76% %(top parse correct j some parse found) 96% 79% #(parses/sen tence j some parse found) 1.03 1.33 %(some parse rst found at 10 1 ) 53% 30% %(some parse rst found at 10 2 ) 19% 29% %(some parse rst found at 10 3 ) 9% 17% Figure 1: Benets
  • f
seman tic kno wledge. sen tences, w e trained
  • n
eac h 428 sen tence subset and tested
  • n
the remaining sen tence. Eac h sen tence w as parsed with progressiv ely wider searc h b eams
  • f
  • =
10 1 ; 10 2 , and 10 3 , un til at least
  • ne
parse w as found. W e scored a parse as correct
  • nly
if it matc hed the target parse tree exactly . F
  • r
example, w e w
  • uld
disallo w a parse that w as in error
  • n
a part-of- sp eec h tag, a prep
  • sitional
attac hmen t,
  • r
the in ter- nal structure
  • f
a ma jor constituen t. Some
  • ther
pa- p ers ha v e used less stringen t scoring metho ds (e.g., [Blac k, Jelinek, et al 1992 ]), sometimes b ecause their corp
  • ra
w ere not fully brac k eted to start with. In exp erimen t (i), using join t syn tactic and seman- tic statistics, the parser correctly generated the target parse as its top c hoice in 96%
  • f
the cases where it found an y parse. It ac hiev ed this rate while generating
  • nly
1.03 parses for eac h
  • f
those sen tences. F
  • r
53%
  • f
the test sen tences, the parser found
  • ne
  • r
more parses at the narro w est b eam lev el, 10 1 . F
  • r
19%, the rst parse w as found at 10 2 . F
  • r
another 9%, the rst parse w as found at 10 3 . F
  • r
the remaining 19%
  • f
the test sen tences, no parse w as found b y 10 3 ; half
  • f
these con tained unique v
  • cabulary
not seen in the training data. The con trast b et w een exp erimen ts (i) and (ii) in- dicates the imp
  • rtance
  • f
the statistical indep endence assumptions made b y a language mo del. In exp erimen t (ii), the left con text still generated syn tactic exp ecta- tions that inuenced the probabilities, but the parser assumed that the seman tic role-ller assignmen ts es- tablished b y the left con text w ere irrelev an t. In con- trast to the 96% accuracy yielded b y the join t syn tactic and seman tic statistics
  • f
exp erimen t (i), these syn tax-
  • nly
statistics pic k ed the target parse in
  • nly
79%
  • f
the sen tences with some parse. The w eak er statistics in (ii) also forced the parser to use wider b eams. Kno wledge Con v ergence After 429 sen tences in the telephon y domain, the gramm ar is still gro wing: the rate
  • f
new \structural" (syn tactic) rules is diminishing rapidly , but the v
  • cab-
ulary con tin ues to gro w signican tly . In an attempt to determine ho w w ell
  • ur
incomplete statistics are gen- eralizing, w e ran three exp erimen ts, based
  • n
a 3-w a y split, a 10-w a y split and a 429-w a y split
  • f
the cor- pus. In the 10-w a y split, for example, w e tested
  • n
eac h ten th after training
  • n
the
  • ther
nine-ten ths. The parser used join t syn tactic and seman tic statistics for all three exp erimen ts. The results are summarized in Figure 2. When the parser w as able to nd at least
  • ne
parse, its top c hoice w as nearly alw a ys correct (96%) in eac h exp erimen t. Ho w ev er, the c hance
  • f
nding at least
  • ne
parse within
  • =
10 3 increased with the amoun t
  • f
training. The
  • v
erall success rates w ere 70% (3-w a y split), 76% (10-w a y split), and 77% (429-w a y split). In eac h case, a substan tial fraction
  • f
the error is at- tributable to test sen tences con taining w
  • rds
that had not app eared in the training. The remaining error rep- resen ts grammati cally no v el structures and/or an un- dertrained statistical mo del. W e w
  • uld
also lik e to kno w what the parser's ac- curacy will b e when w e are v ery w ell trained. In a fourth exp erimen t, w e trained
  • n
all 429 sen tences b e- fore parsing them. Here, the parser pro duced
  • nly
1.02 parses/sen tence and reco v ered the target parse 98.4%
  • f
the time. The correct parse w as alw a ys rank ed rst whenev er it w as found. On the 428 sen tences that had at least
  • ne
parse, the parser
  • nly
failed to nd 10
  • f
the 12,769 target constituen ts. F
  • r
the presen t small corpus, there is apparen tly no substitute for ha ving seen the test sen tence itself. The fourth exp erimen t, unlik e the
  • ther
three, demonstrates the v alidit y
  • f
the indep endence assump- tions in
  • ur
statistical mo del. It sho ws that for this corpus, the parser's generalization p erformance do es not come at the exp ense
  • f
\memorization. " That is, the statistics retain enough information for the parser to accurately reconstruct the training data. P erform ance F
  • r
those sen tences where some parse w as found at b eam 10 1 during exp erimen t (i) ab
  • v
e, 74%
  • f
the arcs added to the c hart w ere actually required. W e call this measure the fo cus
  • f
the parser since it quan- ties ho w m uc h
  • f
the w
  • rk
done w as necessary . The c
  • nstituent
fo cus,
  • r
p ercen tage
  • f
the completed con- stituen ts that w ere necessary , w as ev en higher|78%. The constituen t fo cus fell gradually with wider b eams, to 75% at 10 2 and 65% at 10 3 .
slide-7
SLIDE 7 3-w a y 10-w a y 429-w a y T est
  • n
split split split training %(top parse correct j some parse found) 96% 96% 96% 98.6% %(top parse correct j no unkno wn w
  • rds)
81% 84% 85% 98.4% %(top parse correct) 70% 76% 77% 98.4% Figure 2: Eect
  • f
increased training
  • n
accuracy . T
  • remo
v e the eect
  • f
the pruning sc hedule, w e tried rerunning exp erimen t (i) at a constan t b eam width
  • f
  • =
10 3 . Here the constituen t fo cus w as 68% when some parse w as found. In
  • ther
w
  • rds,
a correct constituen t had,
  • n
a v erage,
  • nly
half a com- p etitor within three
  • rders
  • f
magnitude. In that exp erimen t, the total n um b er
  • f
arcs gener- ated during a parse (b efore pruning) had a 94% linear correlation with sen tence length. The shorter half
  • f
the corpus ( 9 w
  • rds)
yielded the same regression line as the longer half. Moreo v er, a QQ-plot sho w ed that the residuals w ere w ell-mo deled b y a (truncated) Gaussian distribution. These results suggest O (n) time and space p erformance
  • n
this corpus, plus a Gaussian noise factor. 3 Extensibilit y Hand-co ded natural language systems tend to b e plagued b y the p
  • ten
tial
  • p
en-endedness
  • f
the kno wl- edge required. The corresp
  • nding
problem for statis- tical sc hemes is undertraining. In
  • ur
task, w e do not ha v e a large set
  • f
analyzed examples,
  • r
a com- plete lexicon for telephonese. T
  • help
comp ensate, the parser can utilize hand-co ded kno wledge as w ell as statistical kno wledge. The hand-co ded kno wledge expresses general kno wledge ab
  • ut
linguistic subtheo- ries
  • r
classes
  • f
rules, not sp ecic kno wledge ab
  • ut
particular rules. As an example, consider the problem
  • f
assigning part
  • f
sp eec h to no v el w
  • rds.
Sev eral sources
  • f
kno wl- edge ma y help suggest the correct part
  • f
sp eec h class: the state
  • f
the parser when it encoun ters the no v el w
  • rd,
the relativ e closedness
  • f
the class, the morphol-
  • gy
  • f
the w
  • rd,
and
  • rthographic
con v en tions lik e cap- italization. An exp erimen tal v ersion
  • f
  • ur
parser com- bines v arious forms
  • f
evidence to assign probabilities to no v el lexical rules. Using this tec hnique, the exp er- imen tal parser can tak e a no v el sen tence suc h as \XX YY go es ZZ" and deriv e syn tactic and seman tic repre- sen tations analogous to \Station B1 go es
  • ho
  • k."
3 In k eeping with usual statistical practice, w e discarded the t w
  • 47-w
  • rd
  • utliers;
the remaining sen tences w ere 5{24 w
  • rds
long. W e also discarded the 88 sen tences with unkno wn w
  • rds
and/or no parses found, lea ving 339 sen tences. Conclusions and F uture W
  • rk
T ruly robust natural language systems will require b
  • th
the distributional kno wledge and the general lin- guistic kno wledge that are a v ailable to h umans. Suc h kno wledge will help these systems p erform quic kly and accurately , ev en under conditions
  • f
noise, am biguit y ,
  • r
no v elt y . W e are esp ecially in terested in b
  • tstrap-
ping approac hes to enable a parser to learn more di- rectly from an un brac k eted corpus. W e w
  • uld
lik e to com bine statistical tec hniques with w eak prior theories
  • f
syn tax, seman tics, and/or lo w-lev el recognition suc h as sp eec h and OCR. Suc h theories pro vide an en viron- men t in whic h language learning can tak e place. References [Blac k, Jelinek, et al 1992] Blac k, E., Jelinek, F., Laf- fert y , J., Magerman, D.M., Mercer, R., and Rouk
  • s,
S. \T
  • w
ards History-Based Grammars: Us- ing Ric her Mo dels for Probabilistic P arsing," Fifth D ARP A Workshop
  • n
Sp e e ch and Natur al L anguage Pr
  • c
essing. F ebruary 1992. [Bobro w 1991] Bobro w, R.J. \Statistical Agenda P ars- ing," F
  • urth
D ARP A Workshop
  • n
Sp e e ch and Nat- ur al L anguage Pr
  • c
essing. F ebruary 1991. [Co c k e and Sc h w artz 1970] Co c k e, J. and Sc h w artz, J.I. Pr
  • gr
amming L anguages and Their Compilers. Couran t Institute
  • f
Mathematical Sciences, New Y
  • rk
Univ ersit y , New Y
  • rk,
1970. [Ch urc h 1988] Ch urc h, K. W. \A Sto c hastic P arts Program and Noun Phrase P arser for Unrestricted T ext," Pr
  • c.
  • f
the Se c
  • nd
Confer enc e
  • n
Applie d Natur al L anguage Pr
  • c
essing. Austin, T exas, 1988, pp. 136-143. [Dagan et al 1991] Dagan, I., Itai, A., Sc h w all, U. \Tw
  • Languages
Are More Informativ e Than One," Pr
  • c.
  • f
the 29th A nnual Me eting
  • f
the Asso ciation for Computational Linguistics. Berk eley , California, 1991, pp. 130-137. [Earley 1970] Earley , J. \An Ecien t Con text-F ree P arsing Algorithm," Communic ations
  • f
the A CM 13 (2). F ebruary 1970, pp. 94-102. [F rancis and Kucera 1982] F rancis, W. and Kucera, H. F r e quency A nalysis
  • f
English Usage. Hough ton Mif- in, Boston, 1982. [Gale and Ch urc h 1991] Gale, W. A. and Ch urc h, K. W. \A Program for Aligning Sen tences in Bilingual
slide-8
SLIDE 8 Corp
  • ra,"
Pr
  • c.
  • f
the 29th A nnual Me eting
  • f
the Asso ciation for Computational Linguistics. Berk e- ley , California, 1991, pp. 177-184. [Graham et al 1980] Graham, S.L., Harrison, M.A. and Ruzzo, W.L. \An Impro v ed Con text-F ree Rec-
  • gnizer,"
A CM T r ansactions
  • n
Pr
  • gr
amming L an- guages and Systems 2 (3). 1980, pp. 415-463. [Jelinek and Laert y 1991] Jelinek, F. and Laert y , J.D. \Computation
  • f
the Probabilit y
  • f
Initial Sub- string Generation b y Sto c hastic Con text-F ree Gram- mars," Computations Linguistics 17 (3). 1991, pp. 315-323. [Jones and Eisner 1992] Jones, M.A. and Eisner, J. \A Probabilistic P arser Applied to Soft w are T esting Do cumen ts," T enth National Confer enc e
  • n
A rti- cial Intel ligenc e, San Jose, CA, 1992. [Jones et al 1991] Jones, M.A., Story , G.A., and Bal- lard, B.W. \Using Multiple Kno wledge Sources in a Ba y esian OCR P
  • st-Pro
cessor," First International Confer enc e
  • n
Do cument A nalysis and R etrieval. St. Malo, F rance, Septem b er 1991, pp. 925-933. [Magerman and Marcus 1991] Magerman, D.M. and Marcus, M.P . \P earl: A Probabilistic Chart P arser," F
  • urth
D ARP A Workshop
  • n
Sp e e ch and Natur al L anguage Pr
  • c
essing. F ebruary 1991. [Magerman and W eir 1992] Magerman, D.M. and W eir, C. \Probabilistic Prediction and P ic ky Chart P arsing," Fifth D ARP A Workshop
  • n
Sp e e ch and Natur al L anguage Pr
  • c
essing. F ebruary 1992. [Nonnenmann and Eddy 1992] Nonnenmann, U., and Eddy J.K. \KITSS
  • A
F unctional Soft w are T esting System Using a Hybrid Domain Mo del," Pr
  • c.
  • f
8th IEEE Confer enc e
  • n
A rticial Intel ligenc e Ap- plic ations, Mon terey , CA, Marc h 1992. [Simm
  • ns
1990] Simmons, R. and Y u, Y. \The Acqui- sition and Application
  • f
Con text Sensitiv e Gram- mar for English," Pr
  • c.
  • f
the 29th A nnual Me et- ing
  • f
the Asso ciation for Computational Linguis- tics. Berk eley , California, 1991, pp. 122-129. [T anenhaus et al 1985] T anenhaus, M.K., Carlson, G.N. and Seiden b erg, M.S. \Do Listeners Compute Linguistic Represen tations?" in Natur al L anguage Parsing (eds. D. R. Do wt y , L. Karttunen and A.M. Zwic ky). Cam bridge Univ ersit y Press, 1985, pp. 359- 408. [Winograd 1983] Winograd, T. L anguage as a Co gni- tive Pr
  • c
ess, V
  • lume
1: Syntax. Addison-W esley , 1983. App endix A: The P arsing Algorithm Algorithm 1. P ARSE(w ): (* create an (n + 1)
  • (n
+ 1) c hart t = (t i;j ): *) t 0;0 := fS !
  • 0;0
  • j
S !
  • is
in P g; for j := 1 to n do D j := t j 1;j := fa j g; E j := ;; while D j 6= ; N := (t
  • D
j ) [ (PREDICT(D j )
  • D
j ) [ E j ; R :=PR UNE(N ); D j := E j := ;; for h i;j 2 R do t i;j := t i;j [ fh i;j g; if h i;j is complete then D j := D j [ fh i;j g else E j := E j [ fh i;j g endfor endwhile endfor; return fall complete S
  • trees
in t 0;n g F unction PREDICT(D ): return fC !
  • i;i
  • A
j C ! A is in P , some complete tree A i;j is in D , and C 6= S g F unction PR UNE(N ): (*
  • nly
lik ely trees are k ept *) R := ;; thr eshol d :=
  • max
h i;j 2N Pr(h i;j jw 0;j ); for h i;j in N do if Pr(h i;j jw 0;j ) > thr eshol d then R := R [ fh i;j g; endfor; return R