Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial Intelligence Laboratory

Dependency parsing * John saw Mary Syntactic structure represented by head-modifier dependencies

Projective vs. non-projective structures * John saw a movie today that he liked Non-projective structures allow crossing dependencies Frequent in languages like Czech, Dutch, etc. Non-projective parsing is max-spanning-tree (McDonald et al., 2005)

Contributions of this work Fundamental inference algorithms that sum over possible structures: Model type Inference Algorithm HMM Forward-Backward Graphical Model Belief Propagation PCFG Inside-Outside Projective Dep. Trees Inside-Outside Non-projective Dep. Trees ?? This talk: Inside-outside-style algorithms for non-projective dependency structures An application: training log-linear and max-margin parsers Independently-developed work: Smith and Smith (2007), McDonald and Satta (2007)

Overview Background Matrix-Tree-based inference Experiments

Edge-factored structured prediction 0 1 2 3 * John saw Mary A dependency tree y is a set of head-modifier dependencies (McDonald et al., 2005; Eisner, 1996) ( h, m ) is a dependency with feature vector f ( x , h, m ) Y ( x ) is the set of all possible trees for sentence x y ∗ � = argmax w · f ( x , h, m ) y ∈Y ( x ) ( h,m ) ∈ y

Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1

Training log-linear dependency parsers Given a training set { ( x i , y i ) } N i =1 , minimize N C 2 || w || 2 − � L ( w ) = log P ( y i | x i ; w ) i =1 Log-linear distribution over trees   1   � P ( y | x ; w ) = Z ( x ; w ) exp w · f ( x , h, m )   ( h,m ) ∈ y     � � Z ( x ; w ) = exp w · f ( x , h, m )   y ∈Y ( x ) ( h,m ) ∈ y

Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N C 2 || w || 2 − � � L ( w ) = w · f ( x i , h, m ) i =1 ( h,m ) ∈ y i N C 2 || w || 2 + � log Z ( x i ; w ) i =1 Main difficulty: computation of the partition functions

Training log-linear dependency parsers Gradient-based optimizers evaluate L ( w ) and ∂L ∂ w N ∂L � � = C w − f ( x i , h, m ) ∂ w i =1 ( h,m ) ∈ y i N P ( h ′ → m ′ | x ; w ) f ( x i , h ′ , m ′ ) � � C w + i =1 h ′ ,m ′ The marginals are edge-appearance probabilities � P ( h → m | x ; w ) = P ( y | x ; w ) y ∈Y ( x ) : ( h,m ) ∈ y

Generalized log-linear inference Vector θ with parameter θ h,m for each dependency   1   � P ( y | x ; θ ) = Z ( x ; θ ) exp θ h,m   ( h,m ) ∈ y     � � Z ( x ; θ ) = exp θ h,m   y ∈Y ( x ) ( h,m ) ∈ y   1   � � P ( h → m | x ; θ ) = exp θ h,m Z ( x ; θ )   y ∈Y ( x ) : ( h,m ) ∈ y ( h,m ) ∈ y E.g., θ h,m = w · f ( x , h, m )

Applications of log-linear inference Generalized inference engine that takes θ as input Different definitions of θ can be used for log-linear or max-margin training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1 Exponentiated-gradient updates for max-margin models Bartlett, Collins, Taskar and McAllester (2004) Globerson, Koo, Carreras and Collins (2007)

Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r

Matrix-Tree Theorem (Tutte, 1984) 2 Given: 1 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L ( r ) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp { 2 + 4 } + exp { 1 + 3 } = det( L (1) ) �

Multi-root partition function 0 1 2 3 * John saw Mary Edge weights θ , root r = 0 det( L (0) ) = non-projective multi-root partition function

Construction of L (0) L (0) has a simple construction L (0) off-diagonal: = − exp { θ h,m } h,m n L (0) � on-diagonal: = exp { θ h,m } m,m h ′ =0 E.g., L (0) 3 , 3 0 1 2 3 * John saw Mary The determinant of L (0) can be evaluated in O ( n 3 ) time

Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary

Single-root partition function Na¨ ıve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0 , 1) Computing n determinants requires O ( n 4 ) time

Single-root partition function An alternate matrix ˆ L can be constructed such that det(ˆ L ) is the single-root partition function ˆ first row: = exp { θ 0 ,m } L 1 ,m n ˆ � other rows, on-diagonal: = exp { θ h,m } L m,m h ′ =1 ˆ other rows, off-diagonal: = − exp { θ h,m } L h,m Single-root partition function requires O ( n 3 ) time

Non-projective marginals The log-partition generates the marginals ∂ log det(ˆ ∂ log Z ( x ; θ ) L ) P ( h → m | x ; θ ) = = ∂θ h,m ∂θ h,m ∂ log det(ˆ ∂ ˆ L ) L h ′ ,m ′ � = ∂ ˆ ∂θ h,m L h ′ ,m ′ h ′ ,m ′ ∂ log det(ˆ L ) L − 1 � T � ˆ = Derivative of log-determinant: ∂ ˆ L Complexity dominated by matrix inverse, O ( n 3 )

Summary of non-projective inference Partition function: matrix determinant, O ( n 3 ) Marginals: matrix inverse, O ( n 3 ) Single-root inference: ˆ L Multi-root inference: L (0)

Log-linear and max-margin training Log-linear training N � C � 2 || w || 2 − w ∗ � = argmin log P ( y i | x i ; w ) LL w i =1 Max-margin training N � C � 2 || w || 2 + w ∗ � = argmin max ( E i,y − m i,y ( w )) MM y w i =1

Multilingual parsing experiments Six languages from CoNLL 2006 shared task Training algorithms: averaged perceptron, log-linear models, max-margin models Projective models vs. non-projective models Single-root models vs. multi-root models

Multilingual parsing experiments Dutch Projective Non-Projective Training Training (4.93%cd) 77.17 78.83 Perceptron 76.23 79.55 Log-Linear 76.53 79.69 Max-Margin Non-projective training helps on non-projective languages

Multilingual parsing experiments Spanish Projective Non-Projective Training Training (0.06%cd) 81.19 80.02 Perceptron 81.75 81.57 Log-Linear 81.71 81.93 Max-Margin Non-projective training doesn’t hurt on projective languages

Multilingual parsing experiments Results across all 6 languages (Arabic, Dutch, Japanese, Slovene, Spanish, Turkish) Perceptron 79.05 Log-Linear 79.71 Max-Margin 79.82 Log-linear and max-margin parsers show improvement over perceptron-trained parsers Improvements are statistically significant (sign test)

Summary Inside-outside-style inference algorithms for non-projective structures Application of the Matrix-Tree Theorem Inference for both multi-root and single-root structures Empirical results Non-projective training is good for non-projective languages Log-linear and max-margin parsers outperform perceptron parsers

Thanks! Thanks for listening!

Thanks!

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

31. Stokes Theorem Stokes theorem is to Greens theorem, for the work done, as the

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

INDEXING - 1 Tree-Structured Indices Tree-structured indexing techniques support both

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Link prediction via matrix factorization Charles Elkan University of California, San Diego

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Mean Field Limits for Ginzburg-Landau Vortices Sylvia Serfaty Universit P. et M. Curie Paris 6,

Mathematics by Experiment, I & II : Plausible Reasoning in the 21st Century Jonathan M.

REPRESENTING THE PRESENT-DAY REGIONAL CLIMATE OVER SOUTHERN AFRICA IN A DOUBLE NESTED SYSTEM Dr

Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in

Why splitting your focus could be good Alex Peattie CTO & co-founder, Peg alex@peg.co Mon

Network OS OpenFlow Network OS: distributed system that creates a consistent, up-to-date network

Daren Hasenkamp*, Alex Sim, Michael Wehner, Kesheng Wu Lawrence Berkeley National Laboratory

A Meta-Language for Hardware Testbench Michael Katelman and Jos e Meseguer University of

Sambuz

Useful Links

Newsletter

Mail Us

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo - PowerPoint PPT Presentation

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir Globerson gamir@csail.mit.edu Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu MIT Computer Science and Artificial

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

31. Stokes Theorem Stokes theorem is to Greens theorem, for the work done, as the

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

INDEXING - 1 Tree-Structured Indices Tree-structured indexing techniques support both

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

Link prediction via matrix factorization Charles Elkan University of California, San Diego

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Mean Field Limits for Ginzburg-Landau Vortices Sylvia Serfaty Universit P. et M. Curie Paris 6,

Mathematics by Experiment, I &amp; II : Plausible Reasoning in the 21st Century Jonathan M.

REPRESENTING THE PRESENT-DAY REGIONAL CLIMATE OVER SOUTHERN AFRICA IN A DOUBLE NESTED SYSTEM Dr

Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in

Why splitting your focus could be good Alex Peattie CTO &amp; co-founder, Peg alex@peg.co Mon

Network OS OpenFlow Network OS: distributed system that creates a consistent, up-to-date network

Daren Hasenkamp*, Alex Sim, Michael Wehner, Kesheng Wu Lawrence Berkeley National Laboratory

A Meta-Language for Hardware Testbench Michael Katelman and Jos e Meseguer University of

Sambuz

Useful Links

Newsletter

Mail Us

Mathematics by Experiment, I & II : Plausible Reasoning in the 21st Century Jonathan M.

Why splitting your focus could be good Alex Peattie CTO & co-founder, Peg alex@peg.co Mon