Predicting Structures: Conditional Models and Local Classifiers CS - PowerPoint PPT Presentation

Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1

Outline • Sequence models • Hidden Markov models – Inference with HMM – Learning • Conditional Models and Local Classifiers • Global models – Conditional Random Fields – Structured Perceptron for sequences 2

Today’s Agenda • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 3

Today’s Agenda • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 4

HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# 5

� HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 6

� HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 7

� HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 8

� HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input: 𝑄(𝑧 # , 𝑧 % , ⋯ , 𝑧 ' ∣ 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' ) Why not directly optimize this conditional likelihood instead? 9

Modeling next-state directly • Instead of modeling the joint distribution 𝑄(𝐲, 𝐳) , focus on 𝑄(𝐳 ∣ 𝐲) only – Which is what we care about eventually anyway (At least in this context) • For sequences, different formulations – Maximum Entropy Markov Model [McCallum, et al 2000] – Projection-based Markov Model [Punyakanok and Roth, 2001] (other names: discriminative/conditional markov model, …) 10

Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 11

Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 12

� HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input. Why not directly optimize this conditional likelihood instead? 16

Let’s revisit the independence assumptions y t-1 y t 𝑄 𝑧 + 𝑧 +.# , anything else = 𝑄 𝑧 + 𝑧 +.# HMM 𝑄 𝑦 + 𝑧 + , anything else = 𝑄 𝑦 + 𝑧 + x t 17

Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model 18

Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model 19

� Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model This assumption lets us write the conditional probability of the entire output sequence 𝐳 as 𝑄 𝐳 𝐲 = * 𝑄(𝑧 + ∣ 𝑧 +.# , 𝑦 + ) + 20

� Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model This assumption lets us write the conditional probability of the entire output sequence 𝐳 as 𝑄 𝐳 𝐲 = * 𝑄(𝑧 + ∣ 𝑧 +.# , 𝑦 + ) + We need to learn this function 21

Modeling 𝑄(𝑧 𝑗 ∣ 𝑧 𝑗 − 1, 𝑦𝑗) Different approaches possible 1. Train a maximum entropy classifier 2. Or, ignore the fact that we are predicting a probability, we only care about maximizing some score. Train any classifier, using say the perceptron algorithm In either case : Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring x i ’s – Eg. Neighboring words influence this words POS tag • 22

Where are we? • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 23

� Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳 • We have seen this before: Kesler construction Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 24

� Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳 • We have seen this before: Kesler construction Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 25

� Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 Interpretation : Score for label, – Feature representation: 𝜚 𝐲, 𝐳 converted to a well-formed probability distribution by • We have seen this before: Kesler construction exponentiating + normalizing Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 26

Predicting Structures: Conditional Models and Local Classifiers CS - PowerPoint PPT Presentation

Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1 Outline Sequence models Hidden Markov models Inference with HMM Learning Conditional Models and Local Classifiers Global

T8: Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Review: Conditional Probability Conditional Probability The conditional probability of event

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

Predicting structures: Practical concerns CS 6355: Structured Prediction 1 So far What are

15. The Conditional 15.1 The conditional: Formation and uses 15.2 Mise en pratique 15.1 The

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Conditional Statements Python Conditional Statements Sometimes a statement (or a block of

Conditional Sentences as Conditional Speech Acts Workshop Questioning Speech Acts Universitt

Conditional Probability & Independence Conditional Probabilities Question : How should we

P( ) 1 conditional probability where P(F) > 0 Conditional probability of E given F:

Hoare logic Lecture 4: A verifier for Hoare logic Jean Pichon-Pharabod University of Cambridge

Briefing on eHR Content By Veronica HUNG Health Informatician, eHRISO Domains Domains eHR

Disclosures Transplanting Interstitial Lung Disease I have nothing to disclose Steven Hays, MD

St Star art t St Stoc ochas hastic tic En Envi viro ronment nments Computer ter Sc

A Nation of Law 1789 1 1781 TIMELINE 1781 1789 1789 1781 overview Confederation

Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation Vassilina

Supporting Open Science through a global network of open, digital repositories Kathleen Shearer,

Multiple Antenna Secret Broadcast over Multiple Antenna Secret Broadcast over Wireless Networks

Predicting Structures: Conditional Models and Local Classifiers CS - PowerPoint PPT Presentation

Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1 Outline Sequence models Hidden Markov models Inference with HMM Learning Conditional Models and Local Classifiers Global

T8: Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Review: Conditional Probability Conditional Probability The conditional probability of event

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

Predicting structures: Practical concerns CS 6355: Structured Prediction 1 So far What are

15. The Conditional 15.1 The conditional: Formation and uses 15.2 Mise en pratique 15.1 The

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Conditional Statements Python Conditional Statements Sometimes a statement (or a block of

Conditional Sentences as Conditional Speech Acts Workshop Questioning Speech Acts Universitt

Conditional Probability &amp; Independence Conditional Probabilities Question : How should we

P( ) 1 conditional probability where P(F) &gt; 0 Conditional probability of E given F:

Hoare logic Lecture 4: A verifier for Hoare logic Jean Pichon-Pharabod University of Cambridge

Briefing on eHR Content By Veronica HUNG Health Informatician, eHRISO Domains Domains eHR

Disclosures Transplanting Interstitial Lung Disease I have nothing to disclose Steven Hays, MD

St Star art t St Stoc ochas hastic tic En Envi viro ronment nments Computer ter Sc

A Nation of Law 1789 1 1781 TIMELINE 1781 1789 1789 1781 overview Confederation

Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation Vassilina

Supporting Open Science through a global network of open, digital repositories Kathleen Shearer,

Multiple Antenna Secret Broadcast over Multiple Antenna Secret Broadcast over Wireless Networks

Conditional Probability & Independence Conditional Probabilities Question : How should we

P( ) 1 conditional probability where P(F) > 0 Conditional probability of E given F: