Multiword Expression Identification with Tree Substitution Grammars - - PowerPoint PPT Presentation
Multiword Expression Identification with Tree Substitution Grammars - - PowerPoint PPT Presentation
Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic context to find multiword expressions
Main Idea
Use syntactic context to find multiword expressions
Main Idea
Use syntactic context to find multiword expressions Syntactic context → constituency parses
Main Idea
Use syntactic context to find multiword expressions Syntactic context → constituency parses Multiword expressions → idiomatic constructions
Which languages?
Results and analysis for French
3 / 42
Which languages?
Results and analysis for French
◮ Lexicographic tradition of compiling MWE lists ◮ Annotated data!
3 / 42
Which languages?
Results and analysis for French
◮ Lexicographic tradition of compiling MWE lists ◮ Annotated data!
English examples in the talk
3 / 42
Motivating Example: Humans get this
- 1. He kicked the pail.
- 2. He kicked the bucket.
◮ “He died.”
(Katz and Postal 1963)
4 / 42
Stanford parser can’t tell the difference
S NP He VP kicked NP the pail
5 / 42
Stanford parser can’t tell the difference
S NP He VP kicked NP the pail S NP He VP kicked NP the bucket
5 / 42
What does the lexicon contain?
Single-word entries?
◮ kick : <agent, theme> ◮ die : <theme>
Multi-word entries?
◮ kick the bucket : <theme>
S NP He VP kicked NP the bucket
6 / 42
Lexicon-Grammar: He kicked the bucket
S NP He VP died
7 / 42
Lexicon-Grammar: He kicked the bucket
S NP He VP died S NP He VP MWV kicked the bucket
(Gross 1986)
7 / 42
MWEs in Lexicon-Grammar
Classified by global POS Described by internal POS sequence Flat structures! MWV VBD kicked DT the NN bucket
8 / 42
MWEs in Lexicon-Grammar
Classified by global POS Described by internal POS sequence Flat structures! MWV VBD kicked DT the NN bucket Of theoretical interest but...
8 / 42
Why do we care (in NLP)?
MWE knowledge improves: Dependency parsing
(Nivre and Nilsson 2004)
Constituency parsing
(Arun and Keller 2005)
Sentence generation
(Hogan et al. 2007)
Machine translation
(Carpuat and Diab 2010)
Shallow parsing
(Korkontzelos and Manandhar 2010)
9 / 42
Why do we care (in NLP)?
MWE knowledge improves: Dependency parsing
(Nivre and Nilsson 2004)
Constituency parsing
(Arun and Keller 2005)
Sentence generation
(Hogan et al. 2007)
Machine translation
(Carpuat and Diab 2010)
Shallow parsing
(Korkontzelos and Manandhar 2010)
Most experiments assume high accuracy identification!
9 / 42
French and the French Treebank
MWEs common in French
◮ ∼5,000 multiword adverbs
10 / 42
French and the French Treebank
MWEs common in French
◮ ∼5,000 multiword adverbs
Paris 7 French Treebank
◮ ∼16,000 trees ◮ 13% of tokens are MWE
MWC P sous N prétexte C que
- n the grounds that
10 / 42
French Treebank: MWE types
N ADV P C V D ADV PRO CL ET I
10 20 30 40 50 %Total MWEs
G l
- b
a l P O S Lots of nominal compounds e.g. N – N numéro deux
11 / 42
MWE Identification Evaluation
Identification is a by-product of parsing
12 / 42
MWE Identification Evaluation
Identification is a by-product of parsing
◮ Corpus: Paris 7 French Treebank (FTB) ◮ Split: same as (Crabbé and Candito 2008) ◮ Metrics: Precision and Recall ◮ Lengths ≤ 40 words
12 / 42
MWE Identification: Parent-Annotated PCFG
PA-PCFG
20 40 60
32.6
F1
13 / 42
MWE Identification: n-gram methods
PA-PCFG mwetoolkit
20 40 60
32.6 34.7
F1
14 / 42
MWE Identification: n-gram methods
PA-PCFG mwetoolkit
20 40 60
32.6 34.7
F1
Standard approach in 2008 MWE Shared Task, MWE Workshops, etc.
14 / 42
n-gram methods: mwetoolkit
Based on surface statistics
15 / 42
n-gram methods: mwetoolkit
Based on surface statistics Step 1: Lemmatize and POS tag corpus
15 / 42
n-gram methods: mwetoolkit
Based on surface statistics Step 1: Lemmatize and POS tag corpus Step 2: Compute n-gram statistics:
◮ Maximum likelihood estimator ◮ Dice’s coefficient ◮ Pointwise mutual information ◮ Student’s t-score
(Ramisch, Villavicencio, and Boitet 2010)
15 / 42
n-gram methods: mwetoolkit
Step 3: Create n-gram feature vectors
16 / 42
n-gram methods: mwetoolkit
Step 3: Create n-gram feature vectors Step 4: Train a binary classifier
16 / 42
n-gram methods: mwetoolkit
Step 3: Create n-gram feature vectors Step 4: Train a binary classifier Exploits statistical idiomaticity of MWEs
16 / 42
Is statistical idiomaticity sufficient?
French multiword verbs Tree maintains relationship between MWV parts VN MWV va MWADV d’ ailleurs MWV bon train is also well underway
17 / 42
Recap: French MWE Identification Baselines
PA-PCFG mwetoolkit
20 40 60
32.6 34.7
F1
18 / 42
Recap: French MWE Identification Baselines
PA-PCFG mwetoolkit
20 40 60
32.6 34.7
F1
Let’s build a better grammar
18 / 42
Better PCFGs: Manual grammar splits
Symbol refinement à la (Klein
and Manning 2003)
19 / 42
Better PCFGs: Manual grammar splits
Symbol refinement à la (Klein
and Manning 2003)
◮ Has a verbal nucleus
(VN)
19 / 42
Better PCFGs: Manual grammar splits
Symbol refinement à la (Klein
and Manning 2003)
◮ Has a verbal nucleus
(VN) COORD C Ou ADV bien VN doit -il ... Otherwise he must
19 / 42
Better PCFGs: Manual grammar splits
Symbol refinement à la (Klein
and Manning 2003)
◮ Has a verbal nucleus
(VN) COORD-hasVN C Ou ADV bien VN doit -il ... Otherwise he must
20 / 42
French MWE Identification: Manual Splits
P A
- P
C F G m w e t
- l
k i t S p l i t s
20 40 60 80
32.6 34.7 63.1
F 1
21 / 42
French MWE Identification: Manual Splits
P A
- P
C F G m w e t
- l
k i t S p l i t s
20 40 60 80
32.6 34.7 63.1
F 1
MWE features: high frequency POS sequences
21 / 42
Capture more syntactic context?
PCFGs work well!
22 / 42
Capture more syntactic context?
PCFGs work well! Larger “rules”: Tree Substitution Grammars (TSG)
22 / 42
Capture more syntactic context?
PCFGs work well! Larger “rules”: Tree Substitution Grammars (TSG) Relationship with Data-Oriented Parsing (DOP):
◮ Same grammar formalism (TSG) ◮ We include unlexicalized fragments ◮ Different parameter estimation
22 / 42
Which tree fragments do we select?
S NP N He VP MWV V kicked D the N bucket
23 / 42
Which tree fragments do we select?
S NP N He VP MWV V kicked D the N bucket
24 / 42
Which tree fragments do we select?
NP N He V kicked MWV V D the N bucket S NP VP MWV
25 / 42
TSG Grammar Extraction as Tree Selection
MWV V D the N bucket
26 / 42
TSG Grammar Extraction as Tree Selection
MWV V D the N bucket
◮ Describes MWE context ◮ Allows for inflection: kick, kicked, kicking
26 / 42
Dirichlet process TSG (DP-TSG)
Tree selection as non-parametric clustering1
1Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
27 / 42
Dirichlet process TSG (DP-TSG)
Tree selection as non-parametric clustering1 Labeled Chinese Restaurant process
◮ Dirichlet process (DP) prior for each non-terminal
type c
1Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
27 / 42
Dirichlet process TSG (DP-TSG)
Tree selection as non-parametric clustering1 Labeled Chinese Restaurant process
◮ Dirichlet process (DP) prior for each non-terminal
type c Supervised case: segment the treebank
1Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
27 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG
28 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG Type-based Gibbs sampler
(Liang, Jordan, and Klein 2010)
◮ Fast convergence: 400 iterations
28 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG Type-based Gibbs sampler
(Liang, Jordan, and Klein 2010)
◮ Fast convergence: 400 iterations
Derivations of a TSG are a CFG forest
28 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG Type-based Gibbs sampler
(Liang, Jordan, and Klein 2010)
◮ Fast convergence: 400 iterations
Derivations of a TSG are a CFG forest
◮ SCFG decoder: cdec
(Dyer et al. 2010)
28 / 42
French MWE Identification: DP-TSG
P A
- P
C F G m w e t
- l
k i t S p l i t s D P
- T
S G
20 40 60 80
32.6 34.7 63.1 71.1
F1
29 / 42
French MWE Identification: DP-TSG
P A
- P
C F G m w e t
- l
k i t S p l i t s D P
- T
S G
20 40 60 80
32.6 34.7 63.1 71.1
F1
DP-TSG result is a lower bound
29 / 42
Human-interpretable DP-TSG rules
MWN → coup de N coup de pied ‘kick’ coup de coeur ‘favorite’ coup de foudre ‘love at first sight’ coup de main ‘help’ coup de grâce ‘death blow’
30 / 42
Human-interpretable DP-TSG rules
MWN → coup de N coup de pied ‘kick’ coup de coeur ‘favorite’ coup de foudre ‘love at first sight’ coup de main ‘help’ coup de grâce ‘death blow’ n-gram methods: separate feature vectors
30 / 42
DP-TSG errors: Overgeneration
NP D Le N marché AP A national ‘The national march’ Reference NP D Le MWN N marché A national DP-TSG
31 / 42
DP-TSG errors: Overgeneration
NP D Le N marché AP A national ‘The national march’ Reference NP D Le MWN N marché A national DP-TSG MWEs are subtle; reference sometimes inconsistent
31 / 42
Standard Parsing Evaluation
Same setup as MWE identification!
32 / 42
Standard Parsing Evaluation
Same setup as MWE identification!
◮ Corpus: Paris 7 French Treebank (FTB) ◮ Split: same as (Crabbé and Candito 2008) ◮ Metrics: Evalb and Leaf Ancestor ◮ Lengths ≤ 40 words
32 / 42
French Parsing Evaluation: All bracketings
PA-PCFG Splits DP-TSG
60 70 80 90
67.6 75.2 75.8
Evalb F1
33 / 42
French Parsing Evaluation: All bracketings
PA-PCFG Splits DP-TSG
60 70 80 90
67.6 75.2 75.8
Evalb F1
Paper: more results (Stanford, Berkeley, etc.)
33 / 42
Future Directions
Syntactic context for n-gram methods
◮ Parse the corpus! ◮ Adapt lexical context measures to syntactic context
34 / 42
Future Directions
Syntactic context for n-gram methods
◮ Parse the corpus! ◮ Adapt lexical context measures to syntactic context
DP-TSG
◮ Better base distribution
34 / 42
Conclusion
Parsers work well for MWE identification
35 / 42
Conclusion
Parsers work well for MWE identification Other languages: combine treebanks with MWE lists
35 / 42
Conclusion
Parsers work well for MWE identification Other languages: combine treebanks with MWE lists Non-“gold mode” parsing results for French
35 / 42
Conclusion
Parsers work well for MWE identification Other languages: combine treebanks with MWE lists Non-“gold mode” parsing results for French
Code → Google: “Stanford parser”
35 / 42
un grand merci. thanks a lot.
Questions?
MWE Identification Results
P A
- P
C F G m w e t
- l
k i t S p l i t s B e r k e l e y S t a n f
- r
d D P
- T
S G
20 40 60 80
32.6 34.7 63.1 69.6 70.1 71.1
F1
38 / 42
Dirichlet process TSG
DP prior for each non-terminal type c ∈ V: θc|c, αc, P0(·|c) ∼ DP(αc, P0) e|θc ∼ θc
2Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
39 / 42
Dirichlet process TSG
DP prior for each non-terminal type c ∈ V: θc|c, αc, P0(·|c) ∼ DP(αc, P0) e|θc ∼ θc Binary variable bs for each non-terminal node in corpus
◮ Supervised case: segment the treebank2
2Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
39 / 42
DP-TSG: Base distribution P0
Phrasal rules: P0(A+ → B− C+) = pMLE(A → B C) sB(1 − sC)
40 / 42
DP-TSG: Base distribution P0
Phrasal rules: P0(A+ → B− C+) = pMLE(A → B C) sB(1 − sC) pMLE is the manually-split grammar! sB is the stop probability
40 / 42
DP-TSG: Base distribution P0
Lexical insertion rules: P0(C+ → t) = pMLE(C → t) p(t)
41 / 42
DP-TSG: Base distribution P0
Lexical insertion rules: P0(C+ → t) = pMLE(C → t) p(t) p(t) is unigram probability of word t
41 / 42
Tree substitution grammars
A Probabilistic TSG is a 5-tuple V, Σ, R, ♦, θ c ∈ V are non-terminals ♦ ∈ V is a unique start symbol t ∈ Σ are terminals e ∈ R are elementary trees θc,e ∈ θ are parameters for each tree fragment
42 / 42
Tree substitution grammars
A Probabilistic TSG is a 5-tuple V, Σ, R, ♦, θ c ∈ V are non-terminals ♦ ∈ V is a unique start symbol t ∈ Σ are terminals e ∈ R are elementary trees θc,e ∈ θ are parameters for each tree fragment elementary tree == tree fragment
42 / 42