structured prediction models via the matrix tree theorem
play

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial


  1. Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial Intelligence Laboratory

  2. Dependency parsing * John saw Mary Syntactic structure represented by head-modifier dependencies

  3. Projective vs. non-projective structures * John saw a movie today that he liked Non-projective structures allow crossing dependencies Frequent in languages like Czech, Dutch, etc. Non-projective parsing is max-spanning-tree (McDonald et al., 2005)

  4. Contributions of this work Fundamental inference algorithms that sum over possible structures: Model type Inference Algorithm HMM Forward-Backward Graphical Model Belief Propagation PCFG Inside-Outside Projective Dep. Trees Inside-Outside Non-projective Dep. Trees ?? This talk: Inside-outside-style algorithms for non-projective dependency structures An application: training log-linear and max-margin parsers Independently-developed work: Smith and Smith (2007), McDonald and Satta (2007)

  5. Overview Background Matrix-Tree-based inference Experiments

  6. Edge-factored structured prediction 0 1 2 3 * John saw Mary A dependency tree y is a set of head-modifier dependencies (McDonald et al., 2005; Eisner, 1996) ( h, m ) is a dependency with feature vector f ( x , h, m ) Y ( x ) is the set of all possible trees for sentence x y ∗ � = argmax w · f ( x , h, m ) y ∈Y ( x ) ( h,m ) ∈ y

  7. Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1

  8. Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1 Log-linear distribution over trees   1   � P ( y | x ; w ) = Z ( x ; w ) exp w · f ( x , h, m )   ( h,m ) ∈ y     � � Z ( x ; w ) = exp w · f ( x , h, m )   y ∈Y ( x ) ( h,m ) ∈ y

  9. Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N C 2 || w || 2 − � � L ( w ) = w · f ( x i , h, m ) i =1 ( h,m ) ∈ y i N C 2 || w || 2 + � log Z ( x i ; w ) i =1 Main difficulty: computation of the partition functions

  10. Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N ∂L � � = C w − f ( x i , h, m ) ∂ w i =1 ( h,m ) ∈ y i N P ( h ′ → m ′ | x ; w ) f ( x i , h ′ , m ′ ) � � C w + i =1 h ′ ,m ′ The marginals are edge-appearance probabilities � P ( h → m | x ; w ) = P ( y | x ; w ) y ∈Y ( x ) : ( h,m ) ∈ y

  11. Generalized log-linear inference Vector θ with parameter θ h,m for each dependency   1   � P ( y | x ; θ ) = Z ( x ; θ ) exp θ h,m   ( h,m ) ∈ y     � � Z ( x ; θ ) = exp θ h,m   y ∈Y ( x ) ( h,m ) ∈ y   1   � � P ( h → m | x ; θ ) = exp θ h,m Z ( x ; θ )   y ∈Y ( x ) : ( h,m ) ∈ y ( h,m ) ∈ y E.g., θ h,m = w · f ( x , h, m )

  12. Applications of log-linear inference Generalized inference engine that takes θ as input Different definitions of θ can be used for log-linear or max-margin training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1 Exponentiated-gradient updates for max-margin models Bartlett, Collins, Taskar and McAllester (2004) Globerson, Koo, Carreras and Collins (2007)

  13. Overview Background Matrix-Tree-based inference Experiments

  14. Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

  15. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r

  16. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

  17. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

  18. Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

  19. Multi-root partition function 0 1 2 3 * John saw Mary Edge weights θ , root r = 0 det( L (0) ) = non-projective multi-root partition function

  20. Construction of L (0) L (0) has a simple construction L (0) off-diagonal: = − exp { θ h,m } h,m n L (0) � on-diagonal: = exp { θ h,m } m,m h ′ =0 E.g., L (0) 3 , 3 0 1 2 3 * John saw Mary The determinant of L (0) can be evaluated in O ( n 3 ) time

  21. Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

  22. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary

  23. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 1) Computing n determinants requires O ( n 4 ) time

  24. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 2) Computing n determinants requires O ( n 4 ) time

  25. Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 3) Computing n determinants requires O ( n 4 ) time

  26. Single-root partition function An alternate matrix ˆ L can be constructed such that det(ˆ L ) is the single-root partition function ˆ first row: = exp { θ 0 ,m } L 1 ,m n ˆ � other rows, on-diagonal: = exp { θ h,m } L m,m h ′ =1 ˆ other rows, off-diagonal: = − exp { θ h,m } L h,m Single-root partition function requires O ( n 3 ) time

  27. Non-projective marginals The log-partition generates the marginals ∂ log det(ˆ ∂ log Z ( x ; θ ) L ) P ( h → m | x ; θ ) = = ∂θ h,m ∂θ h,m ∂ log det(ˆ ∂ ˆ L ) L h ′ ,m ′ � = ∂ ˆ ∂θ h,m L h ′ ,m ′ h ′ ,m ′ ∂ log det(ˆ L ) L − 1 � T � ˆ = Derivative of log-determinant: ∂ ˆ L Complexity dominated by matrix inverse, O ( n 3 )

  28. Summary of non-projective inference Partition function: matrix determinant, O ( n 3 ) Marginals: matrix inverse, O ( n 3 ) Single-root inference: ˆ L Multi-root inference: L (0)

  29. Overview Background Matrix-Tree-based inference Experiments

  30. Log-linear and max-margin training Log-linear training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 Max-margin training N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1

  31. Multilingual parsing experiments Six languages from CoNLL 2006 shared task Training algorithms: averaged perceptron, log-linear models, max-margin models Projective models vs. non-projective models Single-root models vs. multi-root models

  32. Multilingual parsing experiments Dutch Projective Non-Projective Training Training (4.93%cd) 77.17 78.83 Perceptron 76.23 79.55 Log-Linear 76.53 79.69 Max-Margin Non-projective training helps on non-projective languages

  33. Multilingual parsing experiments Spanish Projective Non-Projective Training Training (0.06%cd) 81.19 80.02 Perceptron 81.75 81.57 Log-Linear 81.71 81.93 Max-Margin Non-projective training doesn’t hurt on projective languages

  34. Multilingual parsing experiments Results across all 6 languages (Arabic, Dutch, Japanese, Slovene, Spanish, Turkish) Perceptron 79.05 Log-Linear 79.71 Max-Margin 79.82 Log-linear and max-margin parsers show improvement over perceptron-trained parsers Improvements are statistically significant (sign test)

  35. Summary Inside-outside-style inference algorithms for non-projective structures Application of the Matrix-Tree Theorem Inference for both multi-root and single-root structures Empirical results Non-projective training is good for non-projective languages Log-linear and max-margin parsers outperform perceptron parsers

  36. Thanks! Thanks for listening!

  37. Thanks!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend