Predicting Structures: Conditional Models and Local Classifiers CS - - PowerPoint PPT Presentation

predicting structures conditional models and local
SMART_READER_LITE
LIVE PREVIEW

Predicting Structures: Conditional Models and Local Classifiers CS - - PowerPoint PPT Presentation

Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1 Outline Sequence models Hidden Markov models Inference with HMM Learning Conditional Models and Local Classifiers Global


slide-1
SLIDE 1

CS 6355: Structured Prediction

Predicting Structures: Conditional Models and Local Classifiers

1

slide-2
SLIDE 2

Outline

  • Sequence models
  • Hidden Markov models

– Inference with HMM – Learning

  • Conditional Models and Local Classifiers
  • Global models

– Conditional Random Fields – Structured Perceptron for sequences

2

slide-3
SLIDE 3

Today’s Agenda

  • Conditional models for predicting sequences
  • Log-linear models for multiclass classification
  • Maximum Entropy Markov Models

3

slide-4
SLIDE 4

Today’s Agenda

  • Conditional models for predicting sequences
  • Log-linear models for multiclass classification
  • Maximum Entropy Markov Models

4

slide-5
SLIDE 5

HMM redux

  • The independence assumption

𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+

' +-# '.# +-#

5

slide-6
SLIDE 6

HMM redux

  • The independence assumption

𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+

' +-# '.# +-#

  • Training via maximum likelihood

max

2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)

  • +

We are optimizing joint likelihood of the input and the output for training

6

slide-7
SLIDE 7

HMM redux

  • The independence assumption

𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+

' +-# '.# +-#

  • Training via maximum likelihood

max

2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)

  • +

We are optimizing joint likelihood of the input and the output for training

7

slide-8
SLIDE 8

HMM redux

  • The independence assumption

𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+

' +-# '.# +-#

  • Training via maximum likelihood

max

2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)

  • +

We are optimizing joint likelihood of the input and the output for training Probability of input given the prediction!

8

slide-9
SLIDE 9

HMM redux

  • The independence assumption

𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+

' +-# '.# +-#

  • Training via maximum likelihood

max

2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)

  • +

We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input: 𝑄(𝑧#, 𝑧%, ⋯ , 𝑧' ∣ 𝑦#, 𝑦%, ⋯ , 𝑦') Why not directly optimize this conditional likelihood instead?

9

Probability of input given the prediction!

slide-10
SLIDE 10

Modeling next-state directly

  • Instead of modeling the joint distribution 𝑄(𝐲, 𝐳),

focus on 𝑄(𝐳 ∣ 𝐲) only

– Which is what we care about eventually anyway

(At least in this context)

  • For sequences, different formulations

– Maximum Entropy Markov Model [McCallum, et al 2000] – Projection-based Markov Model [Punyakanok and Roth, 2001]

(other names: discriminative/conditional markov model, …)

10

slide-11
SLIDE 11
  • Generative models

– learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model

  • Discriminative models

– learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names)

Generative vs Discriminative models

11

slide-12
SLIDE 12
  • Generative models

– learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model

  • Discriminative models

– learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names)

Generative vs Discriminative models

A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care

12

slide-13
SLIDE 13
  • Generative models

– learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model

  • Discriminative models

– learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names)

Generative vs Discriminative models

A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care

13

slide-14
SLIDE 14
  • Generative models

– learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model

  • Discriminative models

– learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names)

Generative vs Discriminative models

A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care

14

slide-15
SLIDE 15
  • Generative models

– learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model

  • Discriminative models

– learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names)

Generative vs Discriminative models

A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care

15

slide-16
SLIDE 16

HMM redux

  • The independence assumption

𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+

' +-# '.# +-#

  • Training via maximum likelihood

max

2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)

  • +

We are optimizing joint likelihood of the input and the output for training

16

Probability of input given the prediction! At prediction time, we only care about the probability of output given the

  • input. Why not directly optimize this conditional likelihood instead?
slide-17
SLIDE 17

Let’s revisit the independence assumptions

yt-1 yt xt HMM

17

𝑄 𝑧+ 𝑧+.#, anything else = 𝑄 𝑧+ 𝑧+.# 𝑄 𝑦+ 𝑧+, anything else = 𝑄 𝑦+ 𝑧+

slide-18
SLIDE 18

Another independence assumption

yt-1 yt xt yt-1 yt xt HMM Conditional model

18

𝑄 𝑧+ 𝑧+.#, 𝑧+.%, ⋯ , 𝑦+, 𝑦+.#, ⋯ = 𝑄 𝑧+ 𝑧+.#, 𝑦+

slide-19
SLIDE 19

Another independence assumption

yt-1 yt xt yt-1 yt xt HMM Conditional model

19

𝑄 𝑧+ 𝑧+.#, 𝑧+.%, ⋯ , 𝑦+, 𝑦+.#, ⋯ = 𝑄 𝑧+ 𝑧+.#, 𝑦+

slide-20
SLIDE 20

Another independence assumption

This assumption lets us write the conditional probability of the entire output sequence 𝐳 as

yt-1 yt xt yt-1 yt xt HMM Conditional model

20

𝑄 𝑧+ 𝑧+.#, 𝑧+.%, ⋯ , 𝑦+, 𝑦+.#, ⋯ = 𝑄 𝑧+ 𝑧+.#, 𝑦+ 𝑄 𝐳 𝐲 = * 𝑄(𝑧+ ∣ 𝑧+.#, 𝑦+)

  • +
slide-21
SLIDE 21

Another independence assumption

This assumption lets us write the conditional probability of the entire output sequence 𝐳 as

yt-1 yt xt yt-1 yt xt HMM Conditional model

21

𝑄 𝑧+ 𝑧+.#, 𝑧+.%, ⋯ , 𝑦+, 𝑦+.#, ⋯ = 𝑄 𝑧+ 𝑧+.#, 𝑦+ 𝑄 𝐳 𝐲 = * 𝑄(𝑧+ ∣ 𝑧+.#, 𝑦+)

  • +

We need to learn this function

slide-22
SLIDE 22

Modeling 𝑄(𝑧𝑗 ∣ 𝑧𝑗 − 1, 𝑦𝑗)

Different approaches possible

  • 1. Train a maximum entropy classifier
  • 2. Or, ignore the fact that we are predicting a probability,

we only care about maximizing some score. Train any classifier, using say the perceptron algorithm

In either case:

– Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring xi’s

  • Eg. Neighboring words influence this words POS tag

22

slide-23
SLIDE 23

Where are we?

  • Conditional models for predicting sequences
  • Log-linear models for multiclass classification
  • Maximum Entropy Markov Models

23

slide-24
SLIDE 24

Log-linear models for multiclass

Consider multiclass classification

– Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳

  • We have seen this before: Kesler construction

Define probability of an input 𝐲 taking a label 𝐳 as 𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

A generalization of logistic regression to the multiclass setting

24

slide-25
SLIDE 25

Log-linear models for multiclass

Consider multiclass classification

– Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳

  • We have seen this before: Kesler construction

Define probability of an input 𝐲 taking a label 𝐳 as 𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

A generalization of logistic regression to the multiclass setting

25

slide-26
SLIDE 26

Log-linear models for multiclass

Consider multiclass classification

– Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳

  • We have seen this before: Kesler construction

Define probability of an input 𝐲 taking a label 𝐳 as 𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

A generalization of logistic regression to the multiclass setting

26

Interpretation: Score for label, converted to a well-formed probability distribution by exponentiating + normalizing

slide-27
SLIDE 27

Training a log-linear model

Given a data set D = { 𝐲+, 𝐳+ }

– Apply the maximum likelihood principle max

𝐱 Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +

– Maybe with a regularizer max

𝐱

− 𝜇 2 𝐱R𝐱 + Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +

27

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U
slide-28
SLIDE 28

Training a log-linear model

Given a data set D = { 𝐲+, 𝐳+ }

– Apply the maximum likelihood principle max

𝐱 Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +

– Maybe with a regularizer max

𝐱

− 𝜇 2 𝐱R𝐱 + Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +

28

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

The cross-entropy loss

slide-29
SLIDE 29

Training a log-linear model

  • Gradient based methods to minimize

𝑀 𝐱 = − Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +
  • Usual stochastic gradient descent

– Initialize 𝒙 ← 𝟏 – Iterate through examples for multiple epochs

  • For each example 𝒚+ 𝒛+ take gradient step for the loss at that

example

– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)

– Return 𝒙

29

slide-30
SLIDE 30

Training a log-linear model

  • Gradient based methods to minimize

𝑀 𝐱 = − Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +
  • Usual stochastic gradient descent

– Initialize 𝒙 ← 𝟏 – Iterate through examples for multiple epochs

  • For each example 𝒚+ 𝒛+ take gradient step for the loss at that

example

– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)

– Return 𝒙

30

Other methods exist For example the L-BFGS algorithm

slide-31
SLIDE 31

Training a log-linear model

  • Gradient based methods to minimize

𝑀 𝐱 = − Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)

  • +
  • Usual stochastic gradient descent

– Initialize 𝒙 ← 𝟏 – Iterate through examples for multiple epochs

  • For each example 𝒚+ 𝒛+ take gradient step for the loss at that

example

– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)

– Return 𝒙

31

A vector, whose jth element is the derivative of L with wj. Has a neat interpretation

slide-32
SLIDE 32

Gradients of the loss function

Let us compute this derivative of L with respect to w

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱) = −𝐱R𝜚 𝐲, 𝐳 + log Y exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

The derivative of the loss with respect to the weights is: 𝜖𝑀 𝜖𝐱 = −𝜚 𝐲, 𝐳 + ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

𝜚 𝐲, 𝐳T ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

= −𝜚 𝐲, 𝐳 + Y 𝑄 𝐳T 𝐲, 𝐱 𝜚 𝐲, 𝐳T

  • 𝐳U

32

slide-33
SLIDE 33

Gradients of the loss function

Let us compute this derivative of L with respect to w

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱) = −𝐱R𝜚 𝐲, 𝐳 + log Y exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

The derivative of the loss with respect to the weights is: 𝜖𝑀 𝜖𝐱 = −𝜚 𝐲, 𝐳 + ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

𝜚 𝐲, 𝐳T ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

= −𝜚 𝐲, 𝐳 + Y 𝑄 𝐳T 𝐲, 𝐱 𝜚 𝐲, 𝐳T

  • 𝐳U

33

slide-34
SLIDE 34

Gradients of the loss function

Let us compute this derivative of L with respect to w

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱) = −𝐱R𝜚 𝐲, 𝐳 + log Y exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

The derivative of the loss with respect to the weights is: 𝜖𝑀 𝜖𝐱 = −𝜚 𝐲, 𝐳 + ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

𝜚 𝐲, 𝐳T ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

= −𝜚 𝐲, 𝐳 + Y 𝑄 𝐳T 𝐲, 𝐱 𝜚 𝐲, 𝐳T

  • 𝐳U

34

slide-35
SLIDE 35

Gradients of the loss function

Let us compute this derivative of L with respect to w

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱) = −𝐱R𝜚 𝐲, 𝐳 + log Y exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

The derivative of the loss with respect to the weights is: 𝜖𝑀 𝜖𝐱 = −𝜚 𝐲, 𝐳 + ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

𝜚 𝐲, 𝐳T ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • fU

= −𝜚 𝐲, 𝐳 + Y 𝑄 𝐳T 𝐲, 𝐱 𝜚 𝐲, 𝐳T

  • 𝐳U

35

slide-36
SLIDE 36

Gradients of the loss function

– Initialize 𝒙 ← 𝟏 – Iterate through examples for multiple epochs

  • For each example 𝒚+ 𝒛+ take gradient step for the loss at that

example

– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)

– Return 𝒙

36

A vector, whose jth element is the derivative of L with wj. Has a neat interpretation 𝜖 𝜖𝐱 𝑀 𝒙, 𝒚+, 𝒛+ = 𝜚 𝒚+, 𝒛𝒋 − Y 𝑄 𝒛T 𝒚+, 𝐱 𝜚(𝒚𝒋, 𝒛T)

  • 𝒛U

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱)

slide-37
SLIDE 37

Gradients of the loss function

– Initialize 𝒙 ← 𝟏 – Iterate through examples for multiple epochs

  • For each example 𝒚+ 𝒛+ take gradient step for the loss at that

example

– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)

– Return 𝒙

37

A vector, whose jth element is the derivative of L with wj. Has a neat interpretation 𝜖 𝜖𝐱 𝑀 𝒙, 𝒚+, 𝒛+ = 𝜚 𝒚+, 𝒛𝒋 − Y 𝑄 𝒛T 𝒚+, 𝐱 𝜚(𝒚𝒋, 𝒛T)

  • 𝒛U

Features for the true output

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱)

slide-38
SLIDE 38

Gradients of the loss function

– Initialize 𝒙 ← 𝟏 – Iterate through examples for multiple epochs

  • For each example 𝒚+ 𝒛+ take gradient step for the loss at that

example

– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)

– Return 𝒙

38

A vector, whose jth element is the derivative of L with wj. Has a neat interpretation 𝜖 𝜖𝐱 𝑀 𝒙, 𝒚+, 𝒛+ = 𝜚 𝒚+, 𝒛𝒋 − Y 𝑄 𝒛T 𝒚+, 𝐱 𝜚(𝒚𝒋, 𝒛T)

  • 𝒛U

The expected feature vector according to the current model Features for the true output

𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T

  • 𝐳U

𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱)

slide-39
SLIDE 39

Consider all distributions P such that the empirical counts of the features matches the expected counts Recall: Entropy of a distribution P(y|x) is

– A measure of smoothness – Without any other information, maximized by the uniform distribution

Maximum entropy learning

argmaxp H(p) such that it satisfies this constraint

Another training idea: MaxEnt

39

slide-40
SLIDE 40

Consider all distributions P such that the empirical counts of the features matches the expected counts Recall: Entropy of a distribution P(y|x) is

– A measure of smoothness – Without any other information, maximized by the uniform distribution

Maximum entropy learning

argmaxp H(p) such that it satisfies this constraint

Another training idea: MaxEnt

40

There can be many conditional probability distributions that satisfy this constraint. What is a trivial one that does so?

slide-41
SLIDE 41

Consider all distributions P such that the empirical counts of the features matches the expected counts Recall: Entropy of a distribution P(y|x) is

– A measure of smoothness – Without any other information, maximized by the uniform distribution

Maximum entropy learning

argmaxp H(p) such that it satisfies this constraint

Another training idea: MaxEnt

41

There can be many conditional probability distributions that satisfy this constraint. What is a trivial one that does so? We need a principled way to choose between such distributions:

slide-42
SLIDE 42

Consider all distributions P such that the empirical counts of the features matches the expected counts Recall: Entropy of a distribution P(y|x) is

– A measure of smoothness – Without any other information, maximized by the uniform distribution

Maximum entropy learning

argmaxp H(p) such that it satisfies this constraint

Another training idea: MaxEnt

42

There can be many conditional probability distributions that satisfy this constraint. What is a trivial one that does so? We need a principled way to choose between such distributions: Find a distribution that satisfies the constraint, and does not make any

  • ther commitments otherwise.

That is, given the constraint, it is maximally uncertain otherwise.

slide-43
SLIDE 43

Consider all distributions P such that the empirical counts of the features matches the expected counts Recall: Entropy of a distribution P(y|x) is

– A measure of smoothness – Without any other information, maximized by the uniform distribution

Maximum entropy learning

argmaxp H(p) such that it satisfies this constraint

Another training idea: MaxEnt

43

slide-44
SLIDE 44

Consider all distributions P such that the empirical counts of the features matches the expected counts Recall: Entropy of a distribution P(y|x) is

– A measure of smoothness – Without any other information, maximized by the uniform distribution

Maximum entropy learning

argmaxp H(p) such that it satisfies this constraint

Another training idea: MaxEnt

44

slide-45
SLIDE 45

Maximum Entropy distribution = log-linear

Theorem The maximum entropy distribution among those satisfying the constraint has an exponential form Among exponential distributions, the maximum entropy distribution is the most likely distribution

45

Questions?

slide-46
SLIDE 46

Today’s Agenda

  • Conditional models for predicting sequences
  • Log-linear models for multiclass classification
  • Maximum Entropy Markov Models

46

slide-47
SLIDE 47

The next-state model

This assumption lets us write the conditional probability of the output as

yt-1 yt xt yt-1 yt xt HMM Conditional model We need to learn this function

47

Back to sequences

slide-48
SLIDE 48

Modeling P(yi | yi-1, xi)

  • Different approaches possible

1. Train a maximum entropy classifier 2. Ignore the fact that we are predicting a probability, we only care about maximizing some score. Train any classifier, using say the perceptron algorithm

  • For both cases:

– Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring xi’s

  • Eg. Neighboring words influence this words POS tag

48

slide-49
SLIDE 49

Modeling P(yi | yi-1, xi)

  • Different approaches possible

1. Train a maximum entropy classifier

Basically, multinomial logistic regression

2. Ignore the fact that we are predicting a probability, we only care about maximizing some score. Train any classifier, using say the perceptron algorithm

  • For both cases:

– Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring xi’s

  • Eg. Neighboring words influence this words POS tag

49

slide-50
SLIDE 50

Modeling P(yi | yi-1, xi)

  • Different approaches possible

1. Train a maximum entropy classifier

Basically, multinomial logistic regression

2. Ignore the fact that we are predicting a probability, we only care about maximizing some score. Train any classifier, using say the perceptron algorithm

  • For both cases:

– Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring xi’s

  • Eg. Neighboring words influence this words POS tag

50

P(yi | yi-1, x)

slide-51
SLIDE 51

Maximum Entropy Markov Model

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Goal: Compute P(y | x) The prediction task: Using the entire input and the current label, predict the next label

slide-52
SLIDE 52

Maximum Entropy Markov Model

52

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Caps

  • es

Previous word Goal: Compute P(y | x) To model the probability, first, we need to define features for the current classification problem

slide-53
SLIDE 53

Maximum Entropy Markov Model

53

Determiner Noun Verb Noun The Fed raises interest rates Noun

start

Y N start The

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0)

slide-54
SLIDE 54

Maximum Entropy Markov Model

54

Compare to HMM: Only depends on the word and the previous tag

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Questions?

Y N start The Y N Determiner Fed N Y Noun raises N N Verb interest N N Noun rates

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)

Can get very creative here

slide-55
SLIDE 55

Maximum Entropy Markov Model

55

Compare to HMM: Only depends on the word and the previous tag

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Questions?

Y N start The Y N Determiner Fed N Y Noun raises N N Verb interest N N Noun rates

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)

Can get very creative here

slide-56
SLIDE 56

Maximum Entropy Markov Model

56

Compare to HMM: Only depends on the word and the previous tag

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Questions?

Y N start The Y N Determiner Fed N Y Noun raises N N Verb interest N N Noun rates

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)

Can get very creative here

slide-57
SLIDE 57

Maximum Entropy Markov Model

57

Compare to HMM: Only depends on the word and the previous tag

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Questions?

Y N start The Y N Determiner Fed N Y Noun raises N N Verb interest N N Noun rates

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)

Can get very creative here

slide-58
SLIDE 58

Maximum Entropy Markov Model

58

Compare to HMM: Only depends on the word and the previous tag

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Questions?

Y N start The Y N Determiner Fed N Y Noun raises N N Verb interest N N Noun rates

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)

Can get very creative here

slide-59
SLIDE 59

Maximum Entropy Markov Model

59

Compare to HMM: Only depends on the word and the previous tag

Determiner Noun Verb Noun The Fed raises interest rates Noun

start Questions?

Y N start The Y N Determiner Fed N Y Noun raises N N Verb interest N N Noun rates

Caps

  • es

Previous word Goal: Compute P(y | x)

Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)

Can get very creative here

slide-60
SLIDE 60

Using MEMM

  • Training

– Next-state predictor locally as maximum likelihood

  • Similar to any maximum entropy classifier
  • Prediction/decoding

– Modify the Viterbi algorithm for the new independence assumptions

60

HMM Conditional Markov model

slide-61
SLIDE 61

Generalization: Any multiclass classifier

  • Viterbi decoding: we only need a score for each decision

– So far, probabilistic classifiers

  • In general, use any learning algorithm to build get a score

for the label yi given yi-1 and x

– Multiclass versions of perceptron, SVM – Just like MEMM, these allow arbitrary features to be defined Exercise: Viterbi needs to be re-defined to work with sum of scores rather than the product of probabilities

61

slide-62
SLIDE 62

Comparison to HMM

What we gain

1. Rich feature representation for inputs

  • Helps generalize better by thinking about properties of the input

tokens rather than the entire tokens

  • Eg: If a word ends with –es, it might be a present tense verb

(such as raises). Could be a feature; HMM cannot capture this

2. Discriminative predictor

  • Model P(y | x) rather than P(y, x)
  • Joint vs conditional

62

Questions?

slide-63
SLIDE 63

Outline

  • Conditional models for predicting sequences
  • Log-linear models for multiclass classification
  • Maximum Entropy Markov Models

– The Label Bias Problem

63

slide-64
SLIDE 64

The next-state model for sequences

This assumption lets us write the conditional probability of the output as

yt-1 yt xt yt-1 yt xt HMM Conditional model

64

We need to train local multiclass classifiers that predicts the next state given the previous state and the input

slide-65
SLIDE 65

…local classifiers! Label bias problem

Let’s look at the independence assumption

65

The robot wheels are round Eg: Part-of-speech tagging the sentence N V V N N

0.8 0.2 1 1

D

1

A R

1 1

Suppose these are the only state transitions allowed

Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)

“Next-state” classifiers are locally normalized

Example based on [Wallach 2002]

slide-66
SLIDE 66

…local classifiers! Label bias problem

Let’s look at the independence assumption

66

The robot wheels are round Eg: Part-of-speech tagging the sentence N V V N N

0.8 0.2 1 1

D

1

A R

1 1

Suppose these are the only state transitions allowed

Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)

“Next-state” classifiers are locally normalized

Example based on [Wallach 2002]

slide-67
SLIDE 67

…local classifiers! Label bias problem

Let’s look at the independence assumption

67

The robot wheels are round Eg: Part-of-speech tagging the sentence N V V N N

0.8 0.2 1 1

D

1

A R

1 1

Suppose these are the only state transitions allowed

Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)

“Next-state” classifiers are locally normalized

Example based on [Wallach 2002]

slide-68
SLIDE 68

…local classifiers ! Label bias problem

68

The robot wheels are round Suppose these are the only state transitions allowed N V V N N

0.8 0.2 1 1

D

1

A R

1 1

Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)

slide-69
SLIDE 69

…local classifiers ! Label bias problem

69

The robot wheels are round Suppose these are the only state transitions allowed N V V N N

0.8 0.2 1 1

D

1

A R

1 1

Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)

The robot wheels Fred round

P(V | N, Fred) · P(N | V, Fred) ·

The path scores are the same Even if the word Fred is never observed as a verb in the data, it will be predicted as one The input Fred does not influence the output at all

slide-70
SLIDE 70

Label Bias

  • States with a single outgoing transition effectively ignore

their input

– States with lower-entropy next states are less influenced by

  • bservations
  • Why?

– Because each the next-state classifiers are locally normalized – If a state has fewer next states, each of those will get a higher probability mass

  • …and hence preferred
  • Side note: Surprisingly doesn’t affect some tasks

– Eg: part-of-speech tagging

70

slide-71
SLIDE 71

Summary: Local models for Sequences

  • Conditional models
  • Use rich features in the mode
  • Possibly suffer from label bias problem

(Other “local” models may have their own version of the label bias problem too.)

71

slide-72
SLIDE 72

Outline

  • Sequence models
  • Hidden Markov models

– Inference with HMM – Learning

  • Conditional Models and Local Classifiers
  • Global models

– Conditional Random Fields – Structured Perceptron for sequences

72

slide-73
SLIDE 73

Outline

  • Sequence models
  • Hidden Markov models

– Inference with HMM – Learning

  • Conditional Models and Local Classifiers
  • Global models

– Conditional Random Fields – Structured Perceptron for sequences

73