Global Linear Models Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Global Linear Models Michael Collins, Columbia University

Overview I A brief review of history-based methods I A new framework: Global linear models I Parsing problems in this framework: Reranking problems I Parameter estimation method 1: A variant of the perceptron algorithm

Techniques I So far: I Smoothed estimation I Probabilistic context-free grammars I Log-linear models I Hidden markov models I The EM Algorithm I History-based models I Today: I Global linear models

Supervised Learning in Natural Language I General task: induce a function F from members of a set X to members of a set Y . e.g., Problem x 2 X y 2 Y Parsing sentence parse tree Machine translation French sentence English sentence POS tagging sentence sequence of tags I Supervised learning: we have a training set ( x i , y i ) for i = 1 . . . n

The Models so far I Most of the models we’ve seen so far are history-based models : I We break structures down into a derivation, or sequence of decisions I Each decision has an associated conditional probability I Probability of a structure is a product of decision probabilities I Parameter values are estimated using variants of maximum-likelihood estimation I Function F : X ! Y is defined as F ( x ) = argmax y p ( x, y ; Θ ) or F ( x ) = argmax y p ( y | x ; Θ )

Example 1: PCFGs I We break structures down into a derivation, or sequence of decisions We have a top-down derivation, where each decision is to expand some non-terminal α with a rule α ! β I Each decision has an associated conditional probability α ! β has probability q ( α ! β ) I Probability of a structure is a product of decision probabilities n Y p ( T, S ) = q ( α i ! β i ) i =1 where α i ! β i for i = 1 . . . n are the n rules in the tree I Parameter values are estimated using variants of maximum-likelihood estimation q ( α ! β ) = Count ( α ! β ) Count ( α ) I Function F : X ! Y is defined as F ( x ) = argmax y p ( y, x ; Θ ) Can be computed using dynamic programming

Example 2: Log-linear Taggers I We break structures down into a derivation, or sequence of decisions For a sentence of length n we have n tagging decisions, in left-to-right order I Each decision has an associated conditional probability p ( t i | t i − 1 , t i − 2 , w 1 . . . w n ) where t i is the i ’th tagging decision, w i is the i ’th word I Probability of a structure is a product of decision probabilities n Y p ( t 1 . . . t n | w 1 . . . w n ) = p ( t i | t i − 1 , t i − 2 , w 1 . . . w n ) i =1 I Parameter values are estimated using variants of maximum-likelihood estimation p ( t i | t i − 1 , t i − 2 , w 1 . . . w n ) is estimated using a log-linear model I Function F : X ! Y is defined as F ( x ) = argmax y p ( y | x ; Θ )

A New Set of Techniques: Global Linear Models Overview of today’s lecture: I Global linear models as a framework I Parsing problems in this framework: I Reranking problems I A variant of the perceptron algorithm

Global Linear Models as a Framework I We’ll move away from history-based models No idea of a “derivation”, or attaching probabilities to “decisions” I Instead, we’ll have feature vectors over entire structures “Global features” I First piece of motivation: Freedom in defining features

A Need for Flexible Features Example 1 Parallelism in coordination [Johnson et. al 1999] Constituents with similar structure tend to be coordinated ) how do we allow the parser to learn this preference? Bars in New York and pubs in London vs. Bars in New York and pubs

A Need for Flexible Features (continued) Example 2 Semantic features We might have an ontology giving properties of various nouns/verbs ) how do we allow the parser to use this information? pour the cappucino vs. pour the book Ontology states that cappucino has the +liquid feature, book does not.

Three Components of Global Linear Models I f is a function that maps a structure ( x, y ) to a feature vector f ( x, y ) 2 R d I GEN is a function that maps an input x to a set of candidates GEN ( x ) I v is a parameter vector (also a member of R d ) I Training data is used to set the value of v

Component 1: f I f maps a candidate to a feature vector 2 R d I f defines the representation of a candidate S NP VP She announced NP NP VP a program to VP promote NP PP safety in NP + f NP and NP trucks vans h 1 , 0 , 2 , 0 , 0 , 15 , 5 i

Features I A “feature” is a function on a structure, e.g., h ( x, y ) = Number of times A is seen in ( x, y ) B C ( x 1 , y 1 ) A ( x 2 , y 2 ) A B C B C D E F G D E F A d e f g B C d e h b c h ( x 1 , y 1 ) = 1 h ( x 2 , y 2 ) = 2

Feature Vectors I A set of functions h 1 . . . h d define a feature vector f ( x ) = h h 1 ( x ) , h 2 ( x ) . . . h d ( x ) i A A T 1 T 2 B C B C D E F G D E F A d e f g B C d e h b c f ( T 1 ) = h 1 , 0 , 0 , 3 i f ( T 2 ) = h 2 , 0 , 1 , 1 i

Component 2: GEN I GEN enumerates a set of candidates for a sentence She announced a program to promote safety in trucks and vans + GEN S S S S S S NP VP NP VP NP VP NP VP She NP VP She She NP VP She announced NP She announced NP She announced NP announced NP NP VP NP VP a program announced NP NP VP a program announced NP NP PP to promote NP a program to promote NP PP in NP safety NP VP safety PP in NP a program trucks and vans to promote NP in NP to promote NP trucks and vans safety trucks and vans NP and NP NP and NP vans vans NP and NP NP VP NP VP safety PP vans a program a program in NP to promote NP PP trucks to promote NP safety in NP trucks safety PP in NP trucks

Component 2: GEN I GEN enumerates a set of candidates for an input x I Some examples of how GEN ( x ) can be defined: I Parsing: GEN ( x ) is the set of parses for x under a grammar I Any task: GEN ( x ) is the top N most probable parses under a history-based model I Tagging: GEN ( x ) is the set of all possible tag sequences with the same length as x I Translation: GEN ( x ) is the set of all possible English translations for the French sentence x

Component 3: v I v is a parameter vector 2 R d I f and v together map a candidate to a real-valued score S NP VP She announced NP NP VP a program to VP promote NP safety PP in NP NP and NP trucks vans + f h 1 , 0 , 2 , 0 , 0 , 15 , 5 i + f · v h 1 , 0 , 2 , 0 , 0 , 15 , 5 i · h 1 . 9 , � 0 . 3 , 0 . 2 , 1 . 3 , 0 , 1 . 0 , � 2 . 3 i = 5 . 8

Putting it all Together I X is set of sentences, Y is set of possible outputs (e.g. trees) I Need to learn a function F : X ! Y I GEN , f , v define F ( x ) = arg max y ∈ GEN ( x ) f ( x, y ) · v Choose the highest scoring candidate as the most plausible structure I Given examples ( x i , y i ) , how to set v ?

She announced a program to promote safety in trucks and vans + GEN S S S S S S NP VP NP VP NP VP NP VP She She NP VP She NP VP She announced NP She She announced NP announced NP announced NP NP VP NP VP a program announced NP announced NP NP VP a program NP PP to promote NP a program to promote NP PP in NP NP VP safety PP safety in NP a program trucks and vans to promote NP in NP to promote NP trucks and vans safety NP and NP trucks and vans NP and NP vans vans NP and NP vans NP VP NP VP safety PP in a program a program NP to promote NP PP trucks to promote NP safety in NP safety PP trucks + f + f in NP + f + f + f + f trucks h 1 , 1 , 3 , 5 i h 2 , 0 , 0 , 5 i h 1 , 0 , 1 , 5 i h 0 , 0 , 3 , 0 i h 0 , 1 , 0 , 5 i h 0 , 0 , 1 , 5 i + f · v + f · v + f · v + f · v + f · v + f · v 13.6 12.2 12.1 3.3 9.4 11.1 + arg max S NP VP She announced NP NP VP a program to VP promote NP safety PP in NP NP and NP trucks vans

Overview I A brief review of history-based methods I A new framework: Global linear models I Parsing problems in this framework: Reranking problems I Parameter estimation method 1: A variant of the perceptron algorithm

Reranking Approaches to Parsing I Use a baseline parser to produce top N parses for each sentence in training and test data GEN ( x ) is the top N parses for x under the baseline model I One method: use a lexicalized PCFG to generate a number of parses (in our experiments, around 25 parses on average for 40,000 training sentences, giving ⇡ 1 million training parses) I Supervision: for each x i take y i to be the parse that is “closest” to the treebank parse in GEN ( x i )

The Representation f I Each component of f could be essentially any feature over parse trees I For example: f 1 ( x, y ) = log probability of ( x, y ) under the baseline model ⇢ 1 if ( x, y ) includes the rule VP ! PP VBD NP f 2 ( x, y ) = 0 otherwise

From [Collins and Koo, 2005]: The following types of features were included in the model. We will use the rule VP -> PP VBD NP NP SBAR with head VBD as an example. Note that the output of our baseline parser produces syntactic trees with headword annotations.

Rules These include all context-free rules in the tree, for example VP -> PP VBD NP NP SBAR . VP PP VBD NP NP SBAR

Global Linear Models Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Global Linear Models Michael Collins, Columbia University Overview I A brief review of history-based methods I A new framework: Global linear models I Parsing problems in this framework: Reranking problems I Parameter estimation method 1: A

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

GLOBAL RISKS GLOBAL RISKS GLOBAL RISKS - GLOBAL RISKS - - - GLOBAL RISKS GLOBAL RISKS

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Chapter 12 Network Flow CS 573: Algorithms, Fall 2013 October 3, 2013 12.1 Network Flow

Phase transitions in the independent sets of random graphs Endre Cska [ EndrE > tSo:k6

Collective Communications Collective Communication Communications involving a group of

CS257 Linear and Convex Optimization Lecture 1 Bo Jiang John Hopcroft Center for Computer

Structured Regression for Efficient Object Detection Christoph Lampert www.christoph-lampert.org

Global Constraints Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell (based on

Management of the Unknowable Dr. Alva L. Couch Tufts University Medford, Massachusetts, USA

Implementing GLS Recall the assumptions of Approach 9: E( Y | x ) = f ( x , ) , var( Y | x ) =

Sambuz

Useful Links

Newsletter

Mail Us

Global Linear Models Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Global Linear Models Michael Collins, Columbia University Overview I A brief review of history-based methods I A new framework: Global linear models I Parsing problems in this framework: Reranking problems I Parameter estimation method 1: A

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

GLOBAL RISKS GLOBAL RISKS GLOBAL RISKS - GLOBAL RISKS - - - GLOBAL RISKS GLOBAL RISKS

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Chapter 12 Network Flow CS 573: Algorithms, Fall 2013 October 3, 2013 12.1 Network Flow

Phase transitions in the independent sets of random graphs Endre Cska [ EndrE &gt; tSo:k6

Collective Communications Collective Communication Communications involving a group of

CS257 Linear and Convex Optimization Lecture 1 Bo Jiang John Hopcroft Center for Computer

Structured Regression for Efficient Object Detection Christoph Lampert www.christoph-lampert.org

Global Constraints Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell (based on

Management of the Unknowable Dr. Alva L. Couch Tufts University Medford, Massachusetts, USA

Implementing GLS Recall the assumptions of Approach 9: E( Y | x ) = f ( x , ) , var( Y | x ) =

Sambuz

Useful Links

Newsletter

Mail Us

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Phase transitions in the independent sets of random graphs Endre Cska [ EndrE > tSo:k6