predicting structures conditional models and local
play

Predicting Structures: Conditional Models and Local Classifiers CS - PowerPoint PPT Presentation

Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1 Outline Sequence models Hidden Markov models Inference with HMM Learning Conditional Models and Local Classifiers Global


  1. Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1

  2. Outline • Sequence models • Hidden Markov models – Inference with HMM – Learning • Conditional Models and Local Classifiers • Global models – Conditional Random Fields – Structured Perceptron for sequences 2

  3. Today’s Agenda • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 3

  4. Today’s Agenda • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 4

  5. HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# 5

  6. � HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 6

  7. � HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 7

  8. � HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 8

  9. � HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input: 𝑄(𝑧 # , 𝑧 % , ⋯ , 𝑧 ' ∣ 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' ) Why not directly optimize this conditional likelihood instead? 9

  10. Modeling next-state directly • Instead of modeling the joint distribution 𝑄(𝐲, 𝐳) , focus on 𝑄(𝐳 ∣ 𝐲) only – Which is what we care about eventually anyway (At least in this context) • For sequences, different formulations – Maximum Entropy Markov Model [McCallum, et al 2000] – Projection-based Markov Model [Punyakanok and Roth, 2001] (other names: discriminative/conditional markov model, …) 10

  11. Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 11

  12. Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 12

  13. Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 13

  14. Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 14

  15. Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 15

  16. � HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input. Why not directly optimize this conditional likelihood instead? 16

  17. Let’s revisit the independence assumptions y t-1 y t 𝑄 𝑧 + 𝑧 +.# , anything else = 𝑄 𝑧 + 𝑧 +.# HMM 𝑄 𝑦 + 𝑧 + , anything else = 𝑄 𝑦 + 𝑧 + x t 17

  18. Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model 18

  19. Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model 19

  20. � Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model This assumption lets us write the conditional probability of the entire output sequence 𝐳 as 𝑄 𝐳 𝐲 = * 𝑄(𝑧 + ∣ 𝑧 +.# , 𝑦 + ) + 20

  21. � Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model This assumption lets us write the conditional probability of the entire output sequence 𝐳 as 𝑄 𝐳 𝐲 = * 𝑄(𝑧 + ∣ 𝑧 +.# , 𝑦 + ) + We need to learn this function 21

  22. Modeling 𝑄(𝑧 𝑗 ∣ 𝑧 𝑗 − 1, 𝑦𝑗) Different approaches possible 1. Train a maximum entropy classifier 2. Or, ignore the fact that we are predicting a probability, we only care about maximizing some score. Train any classifier, using say the perceptron algorithm In either case : Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring x i ’s – Eg. Neighboring words influence this words POS tag • 22

  23. Where are we? • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 23

  24. � Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳 • We have seen this before: Kesler construction Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 24

  25. � Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳 • We have seen this before: Kesler construction Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 25

  26. � Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 Interpretation : Score for label, – Feature representation: 𝜚 𝐲, 𝐳 converted to a well-formed probability distribution by • We have seen this before: Kesler construction exponentiating + normalizing Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend