Training Strategies CS 6355: Structured Prediction 1 So far we saw - - PowerPoint PPT Presentation

training strategies
SMART_READER_LITE
LIVE PREVIEW

Training Strategies CS 6355: Structured Prediction 1 So far we saw - - PowerPoint PPT Presentation

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output prediction? Different ways for modeling structured prediction Conditional random fields, factor graphs, constraints What we only


slide-1
SLIDE 1

CS 6355: Structured Prediction

Training Strategies

1

slide-2
SLIDE 2

So far we saw

  • What is structured output prediction?
  • Different ways for modeling structured prediction

– Conditional random fields, factor graphs, constraints

  • What we only occasionally touched upon:

– Algorithms for training and inference

  • Viterbi (inference in sequences)
  • Structured perceptron (training in general)

2

slide-3
SLIDE 3

Rest of the semester

  • Strategies for training

– Structural SVM – Stochastic gradient descent – More on local vs. global training

  • Algorithms for inference

– Exact inference – “Approximate” inference – Formulating inference problems in general

  • Latent/hidden variables, representations and such

3

slide-4
SLIDE 4

Up next

  • Structural Support Vector Machine

– How it naturally extends multiclass SVM

  • Empirical Risk Minimization

– Or: how structural SVM and CRF are solving very similar problems

  • Training Structural SVM via stochastic gradient descent

– And some tricks

4

slide-5
SLIDE 5

Where are we?

  • Structural Support Vector Machine

– How it naturally extends multiclass SVM

  • Empirical Risk Minimization

– Or: how structural SVM and CRF are solving very similar problems

  • Training Structural SVM via stochastic gradient descent

– And some tricks

5

slide-6
SLIDE 6

Recall: Binary and Multiclass SVM

  • Binary SVM

– Maximize margin – Equivalently,

Minimize norm of weights such that the closest points to the hyperplane have a score ±1

  • Multiclass SVM

– Each label has a different weight vector (like one-vs-all) – Maximize multiclass margin – Equivalently,

Minimize total norm of the weights such that the true label is scored at least 1 more than the second best one

6

slide-7
SLIDE 7

Multiclass SVM in the separable case

7

Recall hard binary SVM We have a data set D = {<xi, yi>}

slide-8
SLIDE 8

Multiclass SVM in the separable case

8

Recall hard binary SVM The score for the true label is higher than the score for any other label by 1 Size of the weights. Effectively, regularizer We have a data set D = {<xi, yi>}

slide-9
SLIDE 9

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ* 𝐲, 𝐳*

  • *∈-./01 𝐲

Structural SVM: First attempt

9

slide-10
SLIDE 10

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ* 𝐲, 𝐳*

  • *∈-./01 𝐲

Structural SVM: First attempt

10

We also have a data set 𝐸 = {(𝐲4, 𝐳4)}

slide-11
SLIDE 11

Structural SVM: First attempt

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞

  • 𝑞∈parts 𝐲

We also have a data set 𝐸 = {(𝐲𝑗, 𝐳𝑗)} What we want from training (following the multiclass idea)

For each training example (𝐲4, 𝐳4) :

– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures

∀𝐳 ≠ 𝐳4, 𝐱?Φ 𝐲4, 𝐳4 ≥ 𝐱?Φ 𝐲4, 𝐳 + 1

11

slide-12
SLIDE 12

Structural SVM: First attempt

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞

  • 𝑞∈parts 𝐲

We also have a data set 𝐸 = {(𝐲𝑗, 𝐳𝑗)} What we want from training (following the multiclass idea)

For each training example (𝐲4, 𝐳4) :

– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures

∀𝐳 ≠ 𝐳4, 𝐱?Φ 𝐲4, 𝐳4 ≥ 𝐱?Φ 𝐲4, 𝐳 + 1

12

slide-13
SLIDE 13

Structural SVM: First attempt

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞

  • 𝑞∈parts 𝐲

We also have a data set 𝐸 = {(𝐲𝑗, 𝐳𝑗)} What we want from training (following the multiclass idea)

For each training example (𝐲4, 𝐳4) :

– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures

∀𝐳 ≠ 𝐳4, 𝐱?Φ 𝐲4, 𝐳4 ≥ 𝐱?Φ 𝐲4, 𝐳 + 1

13

slide-14
SLIDE 14

Structural SVM: First attempt

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞

  • 𝑞∈parts 𝐲

We also have a data set 𝐸 = {(𝐲𝑗, 𝐳𝑗)} What we want from training (following the multiclass idea)

For each training example (𝐲4, 𝐳4) :

– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures

∀𝐳 ≠ 𝐳4, 𝐱?Φ 𝐲4, 𝐳4 ≥ 𝐱?Φ 𝐲4, 𝐳 + 1

14

slide-15
SLIDE 15

Structural SVM: First attempt

Suppose we have some definition of a structure (a factor graph)

– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure

  • Features sum over the parts

Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞

  • 𝑞∈parts 𝐲

We also have a data set 𝐸 = {(𝐲𝑗, 𝐳𝑗)} What we want from training (following the multiclass idea)

For each training example (𝐲4, 𝐳4) :

– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures

∀𝐳 ≠ 𝐳4, 𝐱?Φ 𝐲4, 𝐳4 ≥ 𝐱?Φ 𝐲4, 𝐳 + 1

15

slide-16
SLIDE 16

Structural SVM: First attempt

Score for other structure Score for gold structure Some other structure Maximize margin

16

For every training example

slide-17
SLIDE 17

Structural SVM: First attempt

Some other structure Maximize margin Input with gold structure Score for gold Score for other

17

slide-18
SLIDE 18

Structural SVM: First attempt

18

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-19
SLIDE 19

Structural SVM: First attempt

Problem?

19

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-20
SLIDE 20

Structural SVM: First attempt

Problem

Gold structure

20

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-21
SLIDE 21

Structural SVM: First attempt

Problem

Gold structure Other structure A: Only one mistake Other structure B: Fully incorrect

21

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-22
SLIDE 22

Structural SVM: First attempt

Problem

Gold structure Other structure A: Only one mistake Other structure B: Fully incorrect Structure B has is more wrong, but this formulation will be happy if both A & B are scored one less than gold!

22

No partial credit! Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-23
SLIDE 23

Structural SVM: Second attempt

23

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-24
SLIDE 24

Structural SVM: Second attempt

Hamming distance between structures: Counts the number of differences between them

24

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-25
SLIDE 25

Structural SVM: Second attempt

25

Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other

slide-26
SLIDE 26

Structural SVM: Second attempt

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

26

slide-27
SLIDE 27

Structural SVM: Second attempt

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

27

Maximize margin by minimizing norm of w

slide-28
SLIDE 28

Structural SVM: Second attempt

Input with gold structure

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

28

Maximize margin by minimizing norm of w

slide-29
SLIDE 29

Structural SVM: Second attempt

Input with gold structure Score for gold

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

29

Maximize margin by minimizing norm of w

slide-30
SLIDE 30

Structural SVM: Second attempt

Input with gold structure Score for gold Score for other

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

30

Maximize margin by minimizing norm of w

slide-31
SLIDE 31

Structural SVM: Second attempt

Input with gold structure Score for gold Score for other Hamming distance between

  • structures. Defined to be zero if y = yi

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

31

Maximize margin by minimizing norm of w

slide-32
SLIDE 32

Structural SVM: Second attempt

Intuition

  • It is okay for a structure that is close (in Hamming sense) to the true one

to get a score that is close to the true structure

  • Structures that are very different from the true structure should get much

lower scores

32

Maximize margin by minimizing norm of w Another structure, could be yi Input with gold structure Score for gold Score for other Hamming distance between

  • structures. Defined to be zero if y = yi
slide-33
SLIDE 33

Structural SVM: Second attempt

Problem?

33

Maximize margin by minimizing norm of w Another structure, could be yi Input with gold structure Score for gold Score for other Hamming distance between

  • structures. Defined to be zero if y = yi
slide-34
SLIDE 34

Structural SVM: Second attempt

Problem? What if the data is not separable?

What if these constraints are not satisfied for any w for a given dataset?

34

Maximize margin by minimizing norm of w Another structure, could be yi Input with gold structure Score for gold Score for other Hamming distance between

  • structures. Defined to be zero if y = yi
slide-35
SLIDE 35

Structural SVM: Third attempt

Hamming distance Slack variable for each example, must be positive

35

Maximize margin by minimizing norm of w All structures Input with gold structure Score for gold Score for other

slide-36
SLIDE 36

Structural SVM: Third attempt

Slack variable for each example, must be positive Also minimize total slack

36

Hamming distance Maximize margin by minimizing norm of w All structures Input with gold structure Score for gold Score for other

slide-37
SLIDE 37

Structural SVM: Third attempt

37

slide-38
SLIDE 38

Structural SVM: Third attempt

Another structure Input with gold structure

For every labeled example, and every competing structure

38

slide-39
SLIDE 39

Structural SVM: Third attempt

Another structure Input with gold structure Score for gold Score for other Hamming distance

For every labeled example, and every competing structure, the score for the ground truth should be greater than the score for the competing structure by the Hamming distance between them

39

slide-40
SLIDE 40

Structural SVM: Third attempt

Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example

Slack variables allow some examples to be misclassified.

40

slide-41
SLIDE 41

Structural SVM: Third attempt

Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive

Slack variables allow some examples to be misclassified.

41

slide-42
SLIDE 42

Structural SVM: Third attempt

Another structure Input with gold structure Score for gold Score for other Maximize margin & minimize slack C: the tradeoff parameter Hamming distance Slack variable for each example

Slack variables allow some examples to be misclassified. Minimizing the slack forces this to happen as few times as possible

42

All slacks must be positive

slide-43
SLIDE 43

Structural SVM: Third attempt

Another structure Input with gold structure Score for gold Score for other Maximize margin & minimize slack C: the tradeoff parameter Hamming distance Slack variable for each example

Slack variables allow some examples to be misclassified. Minimizing the slack forces this to happen as few times as possible

Questions?

43

All slacks must be positive

slide-44
SLIDE 44

Structural SVM

Maximize margin, minimize slack C: the tradeoff parameter

44

Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive

slide-45
SLIDE 45

Structural SVM

Maximize margin, minimize slack C: the tradeoff parameter

45

Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive Equivalent formulation

slide-46
SLIDE 46

Structural SVM

Questions? Maximize margin, minimize slack C: the tradeoff parameter

46

Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive Equivalent formulation

slide-47
SLIDE 47

Comments

  • Other slightly different formulations exist

– Generally same principle

  • Multiclass is a special case of structure

– Structural SVM strictly generalizes multiclass SVM

  • Can be seen as minimizing structured version of hinge loss

– Remember empirical risk minimization?

  • Learning as optimization

– We have framed the optimization problem – We haven’t seen how it can be solved yet

  • That is, we don’t have a learning algorithm yet

47

Exercise: Work it out

slide-48
SLIDE 48

Where are we?

  • Structural Support Vector Machine

– How it naturally extends multiclass SVM

  • Empirical Risk Minimization

– Or: how structural SVM and CRF are solving very similar problems

  • Training Structural SVM via stochastic gradient descent

– And some tricks

48

slide-49
SLIDE 49

Broader picture: Learning as loss minimization

  • Collect some annotated data. More is generally better
  • Pick a hypothesis class (also called model)

– Decide how the score decomposes over the parts of the output

  • Choose a loss function

– Decide on how to penalize incorrect decisions

  • Learning = minimize empirical risk + regularizer

– Typically an optimization procedure needed here

49

This must look familiar. We have seen this before for binary classification!

slide-50
SLIDE 50

Empirical risk minimization

  • Suppose the function Loss scores the quality of a prediction with

respect to the true structure

– Loss(f(x), y) tells us how good f is for this x by comparing it against y

  • Evaluate the quality of the predictor f by averaging over the

unknown distribution P that generates data

– Expected risk:

  • We don’t know P, so use the empirical risk

possibly with regularizer Learning: Minimizing regularized risk; various algorithms exist

50

slide-51
SLIDE 51

The loss function zoo: binary classification

51

Zero-one

slide-52
SLIDE 52

The loss function zoo

52

Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one

slide-53
SLIDE 53

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • Conditional Random Field (via the maximum a posteriori criterion)

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4

53

Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱)

slide-54
SLIDE 54

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • CRF

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4

54

Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱) Regularizer

slide-55
SLIDE 55

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • CRF

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4

55

Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱) Regularizer How badly does w do on the training data

slide-56
SLIDE 56

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • CRF

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4

56

Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱) Regularizer How badly does w do on the training data Structured hinge loss

slide-57
SLIDE 57

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • CRF

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4

57

Regularizer How badly does w do on the training data Log loss Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱)

slide-58
SLIDE 58

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • CRF

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4
  • Structured Perceptron

min

𝐱 ) max 𝐳

𝐱?Φ 𝐲4, 𝐳 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4

58

Regularizer How badly does w do on the training data

slide-59
SLIDE 59

Structured classifiers: Different learning objectives

  • Structural SVM

min

𝐱

1 2 𝐱?𝐱 + C ) max

𝐳

𝐱?Φ 𝐲4, 𝐳 + Δ 𝐳, 𝐳4 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4
  • CRF

min

𝐱

1 2 𝐱?𝐱 + C ) − log 𝑄 𝐳4 ∣ 𝐲4, 𝐱

  • 4
  • Structured Perceptron

min

𝐱 ) max 𝐳

𝐱?Φ 𝐲4, 𝐳 − 𝐱?𝜚 𝐲4, 𝐳4

  • 4

59

Regularizer How badly does w do on the training data Structured Perceptron loss

slide-60
SLIDE 60

Summary

  • Different structured training objectives are really

different loss functions

  • The structured versions of hinge, log and Perceptron

losses all involve inference

– Hinge, Perceptron: Solve a maximization problem – Log: Solve an expectation problem

  • Learning as stochastic optimization, even for structures

– But, computing the loss (and the gradient) can be expensive

60