Log-Linear Models Michael Collins, Columbia University The Language - PowerPoint PPT Presentation

Log-Linear Models Michael Collins, Columbia University

The Language Modeling Problem ◮ w i is the i ’th word in a document ◮ Estimate a distribution p ( w i | w 1 , w 2 , . . . w i − 1 ) given previous “history” w 1 , . . . , w i − 1 . ◮ E.g., w 1 , . . . , w i − 1 = Third, the notion “grammatical in English” cannot be identified in any way with the notion “high order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical

Trigram Models ◮ Estimate a distribution p ( w i | w 1 , w 2 , . . . w i − 1 ) given previous “history” w 1 , . . . , w i − 1 = Third, the notion “grammatical in English” cannot be identified in any way with the notion “high order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical ◮ Trigram estimates: q ( model | w 1 , . . . w i − 1 ) = λ 1 q ML ( model | w i − 2 = any , w i − 1 = statistical ) + λ 2 q ML ( model | w i − 1 = statistical ) + λ 3 q ML ( model ) where λ i ≥ 0 , � i λ i = 1 , q ML ( y | x ) = Count ( x,y ) Count ( x )

Trigram Models q ( model | w 1 , . . . w i − 1 ) = λ 1 q ML ( model | w i − 2 = any , w i − 1 = statistical ) + λ 2 q ML ( model | w i − 1 = statistical ) + λ 3 q ML ( model ) ◮ Makes use of only bigram, trigram, unigram estimates ◮ Many other “features” of w 1 , . . . , w i − 1 may be useful, e.g.,: q ML ( model | w i − 2 = any ) q ML ( model | w i − 1 is an adjective ) q ML ( model | w i − 1 ends in “ical” ) q ML ( model | author = Chomsky ) q ML ( model | “model” does not occur somewhere in w 1 , . . . w i − 1 ) q ML ( model | “grammatical” occurs somewhere in w 1 , . . . w i − 1 )

A Naive Approach q ( model | w 1 , . . . w i − 1 ) = λ 1 q ML ( model | w i − 2 = any , w i − 1 = statistical ) + λ 2 q ML ( model | w i − 1 = statistical ) + λ 3 q ML ( model ) + λ 4 q ML ( model | w i − 2 = any ) + λ 5 q ML ( model | w i − 1 is an adjective ) + λ 6 q ML ( model | w i − 1 ends in “ical” ) + λ 7 q ML ( model | author = Chomsky ) + λ 8 q ML ( model | “model” does not occur somewhere in w 1 , . . . w i − 1 ) + λ 9 q ML ( model | “grammatical” occurs somewhere in w 1 , . . . w i − 1 ) This quickly becomes very unwieldy...

A Second Example: Part-of-Speech Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results. OUTPUT: Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./. N = Noun V = Verb P = Preposition Adv = Adverb Adj = Adjective . . .

A Second Example: Part-of-Speech Tagging Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . • There are many possible tags in the position ?? { NN, NNS, Vt, Vi, IN, DT, . . . } • The task: model the distribution p ( t i | t 1 , . . . , t i − 1 , w 1 . . . w n ) where t i is the i ’th tag in the sequence, w i is the i ’th word

A Second Example: Part-of-Speech Tagging Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . • The task: model the distribution p ( t i | t 1 , . . . , t i − 1 , w 1 . . . w n ) where t i is the i ’th tag in the sequence, w i is the i ’th word • Again: many “features” of t 1 , . . . , t i − 1 , w 1 . . . w n may be relevant q ML ( NN | w i = base ) q ML ( NN | t i − 1 is JJ ) q ML ( NN | w i ends in “e” ) q ML ( NN | w i ends in “se” ) q ML ( NN | w i − 1 is “important” ) q ML ( NN | w i +1 is “from” )

Overview ◮ Log-linear models ◮ The maximum-entropy property ◮ Smoothing, feature selection etc. in log-linear models

The General Problem ◮ We have some input domain X ◮ Have a finite label set Y ◮ Aim is to provide a conditional probability p ( y | x ) for any x, y where x ∈ X , y ∈ Y

Language Modeling ◮ x is a “history” w 1 , w 2 , . . . w i − 1 , e.g., Third, the notion “grammatical in English” cannot be identified in any way with the notion “high order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical ◮ y is an “outcome” w i

Feature Vector Representations ◮ Aim is to provide a conditional probability p ( y | x ) for “decision” y given “history” x ◮ A feature is a function f k ( x, y ) ∈ R (Often binary features or indicator functions f ( x, y ) ∈ { 0 , 1 } ). ◮ Say we have m features f k for k = 1 . . . m ⇒ A feature vector f ( x, y ) ∈ R m for any x, y

Language Modeling ◮ x is a “history” w 1 , w 2 , . . . w i − 1 , e.g., Third, the notion “grammatical in English” cannot be identified in any way with the notion “high order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical ◮ y is an “outcome” w i ◮ Example features: � 1 if y = model f 1 ( x, y ) = 0 otherwise � 1 if y = model and w i − 1 = statistical f 2 ( x, y ) = 0 otherwise � 1 if y = model , w i − 2 = any , w i − 1 = statistical f 3 ( x, y ) = 0 otherwise

� 1 if y = model , w i − 2 = any f 4 ( x, y ) = 0 otherwise � 1 if y = model , w i − 1 is an adjective f 5 ( x, y ) = 0 otherwise � 1 if y = model , w i − 1 ends in “ical” f 6 ( x, y ) = 0 otherwise � 1 if y = model , author = Chomsky f 7 ( x, y ) = 0 otherwise � 1 if y = model , “model” is not in w 1 , . . . w i − 1 f 8 ( x, y ) = 0 otherwise � 1 if y = model , “grammatical” is in w 1 , . . . w i − 1 f 9 ( x, y ) = 0 otherwise

Defining Features in Practice ◮ We had the following “trigram” feature: � 1 if y = model , w i − 2 = any , w i − 1 = statistical f 3 ( x, y ) = 0 otherwise ◮ In practice, we would probably introduce one trigram feature for every trigram seen in the training data: i.e., for all trigrams ( u, v, w ) seen in training data, create a feature � 1 if y = w , w i − 2 = u , w i − 1 = v f N ( u,v,w ) ( x, y ) = 0 otherwise where N ( u, v, w ) is a function that maps each ( u, v, w ) trigram to a different integer

The POS-Tagging Example ◮ Each x is a “history” of the form � t 1 , t 2 , . . . , t i − 1 , w 1 . . . w n , i � ◮ Each y is a POS tag, such as NN, NNS, Vt, Vi, IN, DT, . . . ◮ We have m features f k ( x, y ) for k = 1 . . . m For example: � 1 if current word w i is base and y = Vt f 1 ( x, y ) = 0 otherwise � 1 if current word w i ends in ing and y = VBG f 2 ( x, y ) = 0 otherwise . . .

The Full Set of Features in Ratnaparkhi, 1996 ◮ Word/tag features for all word/tag pairs, e.g., � 1 if current word w i is base and y = Vt f 100 ( x, y ) = 0 otherwise ◮ Spelling features for all prefixes/suffixes of length ≤ 4 , e.g., � 1 if current word w i ends in ing and y = VBG f 101 ( x, y ) = 0 otherwise � 1 if current word w i starts with pre and y = NN f 102 ( h, t ) = 0 otherwise

The Full Set of Features in Ratnaparkhi, 1996 ◮ Contextual Features, e.g., � 1 if � t i − 2 , t i − 1 , y � = � DT, JJ, Vt � f 103 ( x, y ) = 0 otherwise � 1 if � t i − 1 , y � = � JJ, Vt � f 104 ( x, y ) = 0 otherwise � 1 if � y � = � Vt � f 105 ( x, y ) = 0 otherwise � 1 if previous word w i − 1 = the and y = Vt f 106 ( x, y ) = 0 otherwise � 1 if next word w i +1 = the and y = Vt f 107 ( x, y ) = 0 otherwise

The Final Result ◮ We can come up with practically any questions ( features ) regarding history/tag pairs. ◮ For a given history x ∈ X , each label in Y is mapped to a different feature vector f ( � JJ, DT, � Hispaniola, . . . � , 6 � , Vt ) = 1001011001001100110 f ( � JJ, DT, � Hispaniola, . . . � , 6 � , JJ ) = 0110010101011110010 f ( � JJ, DT, � Hispaniola, . . . � , 6 � , NN ) = 0001111101001100100 f ( � JJ, DT, � Hispaniola, . . . � , 6 � , IN ) = 0001011011000000010 . . .

Parameter Vectors ◮ Given features f k ( x, y ) for k = 1 . . . m , also define a parameter vector v ∈ R m ◮ Each ( x, y ) pair is then mapped to a “score” � v · f ( x, y ) = v k f k ( x, y ) k

Language Modeling ◮ x is a “history” w 1 , w 2 , . . . w i − 1 , e.g., Third, the notion “grammatical in English” cannot be identified in any way with the notion “high order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical ◮ Each possible y gets a different score: v · f ( x, model ) = 5 . 6 v · f ( x, the ) = − 3 . 2 v · f ( x, is ) = 1 . 5 v · f ( x, of ) = 1 . 3 v · f ( x, models ) = 4 . 5 . . .

Log-Linear Models Michael Collins, Columbia University The Language - PowerPoint PPT Presentation

Log-Linear Models Michael Collins, Columbia University The Language Modeling Problem w i is the i th word in a document Estimate a distribution p ( w i | w 1 , w 2 , . . . w i 1 ) given previous history w 1 , . . . , w i 1

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Log-Linear Models for History-Based Parsing Michael Collins, Columbia University Log-Linear

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Section5.4 Properties of Logarithmic Functions PropertiesofLogarithms Formulas Basic

Implementing the CARES Act for HOPWA Competitive Renewal Grantees May 20, 2020 Presenters

Sign-Up Campaign Internal Training Webinar Over-service the top 20% of your accounts. ~ David

Introduction to Neural Networks Jakob Verbeek 2017-2018 Biological motivation Neuron is basic

CS 224d: Assignment #1 Due date: 4/19 11:59 PM PST (You are allowed to use three (3) late days

BEYOND PROJECT AND SIGN FOR COSINE ESTIMATION WITH BINARY CODES Raghavendran Balu, Teddy

Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice David Adrian, Karthikeyan

MATH 12002 - CALCULUS I 3.3: Increasing & Decreasing Functions Professor Donald L. White

s r s srtr

Log-Linear Models Michael Collins, Columbia University The Language - PowerPoint PPT Presentation

Log-Linear Models Michael Collins, Columbia University The Language Modeling Problem w i is the i th word in a document Estimate a distribution p ( w i | w 1 , w 2 , . . . w i 1 ) given previous history w 1 , . . . , w i 1

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Log-Linear Models for History-Based Parsing Michael Collins, Columbia University Log-Linear

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Section5.4 Properties of Logarithmic Functions PropertiesofLogarithms Formulas Basic

Implementing the CARES Act for HOPWA Competitive Renewal Grantees May 20, 2020 Presenters

Sign-Up Campaign Internal Training Webinar Over-service the top 20% of your accounts. ~ David

Introduction to Neural Networks Jakob Verbeek 2017-2018 Biological motivation Neuron is basic

CS 224d: Assignment #1 Due date: 4/19 11:59 PM PST (You are allowed to use three (3) late days

BEYOND PROJECT AND SIGN FOR COSINE ESTIMATION WITH BINARY CODES Raghavendran Balu, Teddy

Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice David Adrian, Karthikeyan

MATH 12002 - CALCULUS I 3.3: Increasing &amp; Decreasing Functions Professor Donald L. White

s r s srtr

MATH 12002 - CALCULUS I 3.3: Increasing & Decreasing Functions Professor Donald L. White