Structured Output Learning with Indirect Supervision Ming-Wei Chang , - - PowerPoint PPT Presentation

structured output learning with indirect supervision
SMART_READER_LITE
LIVE PREVIEW

Structured Output Learning with Indirect Supervision Ming-Wei Chang , - - PowerPoint PPT Presentation

Structured Output Learning with Indirect Supervision Ming-Wei Chang , Vivek Srikumar, Dan Goldwasser and Dan Roth Computer Science Department, University of Illinois at Urbana-Champaign Page. 1/31 Review: structured output prediction Example


slide-1
SLIDE 1

Structured Output Learning with Indirect Supervision

Ming-Wei Chang, Vivek Srikumar, Dan Goldwasser and Dan Roth

Computer Science Department, University of Illinois at Urbana-Champaign

  • Page. 1/31
slide-2
SLIDE 2

Review: structured output prediction

Example Input: A Sentence, Output: Its Part-Of-Speech Tags

OUTPUT: h JJ NN NN VBZ ADJ INPUT: x Natural language processing is fun

  • Page. 2/31
slide-3
SLIDE 3

Review: structured output prediction

Example Input: A Sentence, Output: Its Part-Of-Speech Tags

OUTPUT: h JJ NN NN VBZ ADJ INPUT: x Natural language processing is fun

Properties of Structured Output Prediction Many interdependent decisions. Expensive to label Exponential number of structures for a given input Many important tasks in NLP, Computer Vision and other domains are structured output prediction tasks

  • Page. 2/31
slide-4
SLIDE 4

Notation

OUTPUT: h JJ NN NN VBZ ADJ INPUT: x Natural language processing is fun

  • Page. 3/31
slide-5
SLIDE 5

Notation

OUTPUT: h JJ NN NN VBZ ADJ INPUT: x Natural language processing is fun

Training model w, feature vector Φ(x, h) Key idea: learn a scoring function over (x, h) pairs Scoring function: wTΦ(x, h)

  • Page. 3/31
slide-6
SLIDE 6

Notation

OUTPUT: h JJ NN NN VBZ ADJ INPUT: x Natural language processing is fun

Training model w, feature vector Φ(x, h) Key idea: learn a scoring function over (x, h) pairs Scoring function: wTΦ(x, h) Inference based prediction Given x, find h that maximizes the score arg max

h∈H(x)

wTΦ(x, h) H(x): A set of all possible structures for an example x.

  • Page. 3/31
slide-7
SLIDE 7

Motivation

Our Goal Given that supervising structures is time consuming and often requires expertise, our goal is to reduce the supervision effort for structured output learning. Reducing the supervision effort: A major challenge in many domains

  • Page. 4/31
slide-8
SLIDE 8

Motivation

Our Goal Given that supervising structures is time consuming and often requires expertise, our goal is to reduce the supervision effort for structured output learning. Reducing the supervision effort: A major challenge in many domains Research Question Is it possible to use (and gain from) additional cheap sources of supervision?

  • Page. 4/31
slide-9
SLIDE 9

Supervising structured output problems

Task Given a car image, where are the body, windows and wheels?

  • Page. 5/31
slide-10
SLIDE 10

Supervising structured output problems

Task Given a car image, where are the body, windows and wheels?

  • Page. 5/31
slide-11
SLIDE 11

Supervising structured output problems

machine learning model labeled data

Task Given a car image, where are the body, windows and wheels? Supervised Approach

  • Page. 5/31
slide-12
SLIDE 12

Supervising structured output problems

machine learning model labeled data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive!

  • Page. 5/31
slide-13
SLIDE 13

Supervising structured output problems

machine learning model labeled data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach

  • Page. 5/31
slide-14
SLIDE 14

Supervising structured output problems

machine learning model unlabeled data labeled data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach

  • Page. 5/31
slide-15
SLIDE 15

Supervising structured output problems

machine learning model unlabeled data labeled data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach

  • Page. 5/31
slide-16
SLIDE 16

Supervising structured output problems

machine learning model unlabeled data labeled data invalid data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach

  • Page. 5/31
slide-17
SLIDE 17

Supervising structured output problems

machine learning model unlabeled data labeled data invalid data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach ignores invalid data!

  • Page. 5/31
slide-18
SLIDE 18

Supervising structured output problems

machine learning model unlabeled data labeled data invalid data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach ignores invalid data! Can we use invalid data to improve the model?

  • Page. 5/31
slide-19
SLIDE 19

Supervising structured output problems

machine learning model unlabeled data labeled data invalid data

Task Given a car image, where are the body, windows and wheels? Supervised Approach is Expensive! Semi-Supervised Approach ignores invalid data! Can we use invalid data to improve the model?

  • Page. 5/31
slide-20
SLIDE 20

Outline

1

Motivation

2

Structured Output Prediction and Its Companion Task

3

Joint Learning with Indirect Supervision

4

Optimization

5

Experiments

  • Page. 6/31
slide-21
SLIDE 21

Outline

1

Motivation

2

Structured Output Prediction and Its Companion Task

3

Joint Learning with Indirect Supervision

4

Optimization

5

Experiments

  • Page. 7/31
slide-22
SLIDE 22

Example: Object Part Recognition

  • Page. 8/31
slide-23
SLIDE 23

Example: Object Part Recognition

Structured Output Learning

Given a car image, where are the body, windows and wheels?

  • Page. 8/31
slide-24
SLIDE 24

Example: Object Part Recognition

Structured Output Learning

Given a car image, where are the body, windows and wheels?

  • Page. 8/31
slide-25
SLIDE 25

Example: Object Part Recognition

Structured Output Learning

Given a car image, where are the body, windows and wheels?

  • Page. 8/31
slide-26
SLIDE 26

Example: Object Part Recognition

Structured Output Learning

Given a car image, where are the body, windows and wheels?

Companion Binary Output Problem

Is there a car in this image?

  • Page. 8/31
slide-27
SLIDE 27

Example: Object Part Recognition

Structured Output Learning

Given a car image, where are the body, windows and wheels?

Companion Binary Output Problem

Is there a car in this image? Is there any connection between these two problems?

  • Page. 8/31
slide-28
SLIDE 28

Example: Object Part Recognition

Structured Output Learning

Given a car image, where are the body, windows and wheels?

Companion Binary Output Problem

Is there a car in this image?

Only a car image can contain car parts in the right position! A non-car image cannot have the car parts in the right position

  • Page. 8/31
slide-29
SLIDE 29

Example: Phonetic Alignment

I t a l y

  • Page. 9/31
slide-30
SLIDE 30

Example: Phonetic Alignment

I t a l y

Structured Output Learning Given one English NE and its Hebrew transliteration, tell me what is the phonetic alignment?

  • Page. 9/31
slide-31
SLIDE 31

Example: Phonetic Alignment

I t a l y

Structured Output Learning Given one English NE and its Hebrew transliteration, tell me what is the phonetic alignment?

  • Page. 9/31
slide-32
SLIDE 32

Example: Phonetic Alignment

I t a l y

Structured Output Learning Given one English NE and its Hebrew transliteration, tell me what is the phonetic alignment?

Israel

  • Page. 9/31
slide-33
SLIDE 33

Example: Phonetic Alignment

I t a l y

Structured Output Learning Given one English NE and its Hebrew transliteration, tell me what is the phonetic alignment?

Israel

Yes/No

Companion Binary Output Problem Are these two NEs a transliteration pair?

  • Page. 9/31
slide-34
SLIDE 34

Example: Phonetic Alignment

I t a l y

Structured Output Learning Given one English NE and its Hebrew transliteration, tell me what is the phonetic alignment?

Israel

Yes/No

Companion Binary Output Problem Are these two NEs a transliteration pair?

Is there any connection between these two problems?

  • Page. 9/31
slide-35
SLIDE 35

Example: Phonetic Alignment

I t a l y

Structured Output Learning Given one English NE and its Hebrew transliteration, tell me what is the phonetic alignment?

Israel

Yes/No

Companion Binary Output Problem Are these two NEs a transliteration pair?

Relationships

Only a transliteration pair can have good phonetic alignment! Non-transliteration pairs cannot have good phonetic alignment!

  • Page. 9/31
slide-36
SLIDE 36

Key Intuition

Structured Output Task

  • Page. 10/31
slide-37
SLIDE 37

Key Intuition

Structured Output Task Companion Binary Task

  • Page. 10/31
slide-38
SLIDE 38

Key Intuition

Structured Output Task Companion Binary Task Observation Many structured output prediction problems have a companion binary decision problem: predicting whether an input possess a good structure

  • r not.
  • Page. 10/31
slide-39
SLIDE 39

Key Intuition

Structured Output Task Companion Binary Task Observation Many structured output prediction problems have a companion binary decision problem: predicting whether an input possess a good structure

  • r not.

Why is this important Binary labeled data is very easy to obtain

  • Page. 10/31
slide-40
SLIDE 40

Key Intuition

Structured Output Task Companion Binary Task How to exploit it??? Observation Many structured output prediction problems have a companion binary decision problem: predicting whether an input possess a good structure

  • r not.

Why is this important Binary labeled data is very easy to obtain

  • Page. 10/31
slide-41
SLIDE 41

Geometric Interpretation for SSVM

Decision Function arg max

h∈H(xi)

wTΦ(xi, h) Training: Intuition Given an example (xi, hi), find a w such that the gold structure hi has the highest score!

  • Page. 11/31
slide-42
SLIDE 42

Geometric Interpretation for SSVM

Decision Function arg max

h∈H(xi)

wTΦ(xi, h) Training: Intuition Given an example (xi, hi), find a w such that the gold structure hi has the highest score! {Φ(x1, h) | h ∈ H(x1)}

  • Page. 11/31
slide-43
SLIDE 43

Geometric Interpretation for SSVM

Decision Function arg max

h∈H(xi)

wTΦ(xi, h) Training: Intuition Given an example (xi, hi), find a w such that the gold structure hi has the highest score! {Φ(x1, h) | h ∈ H(x1)} Φ(x1, h∗

1)

  • Page. 11/31
slide-44
SLIDE 44

Geometric Interpretation for SSVM

Decision Function arg max

h∈H(xi)

wTΦ(xi, h) Training: Intuition Given an example (xi, hi), find a w such that the gold structure hi has the highest score! {Φ(x1, h) | h ∈ H(x1)} Φ(x1, h∗

1)

w

  • Page. 11/31
slide-45
SLIDE 45

Geometric Interpretation for SSVM

Decision Function arg max

h∈H(xi)

wTΦ(xi, h) Training: Intuition Given an example (xi, hi), find a w such that the gold structure hi has the highest score! {Φ(x1, h) | h ∈ H(x1)} Φ(x1, h∗

1)

w Predict:Φ(x1, ˆ h)

  • Page. 11/31
slide-46
SLIDE 46

Geometric Interpretation for SSVM

Decision Function arg max

h∈H(xi)

wTΦ(xi, h) Training: Intuition Given an example (xi, hi), find a w such that the gold structure hi has the highest score! {Φ(x1, h) | h ∈ H(x1)} Φ(x1, h∗

1)

w

  • Page. 11/31
slide-47
SLIDE 47

Structural SVM

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w)

Regularization Measures the model complexity Structural Loss : S is the set of structured labeled examples: LS(xi, hi, w): Measures “the distance” between the current best prediction and the gold structure hi LS can use hinge or square hinge functions or others A convex optimization problem

  • Page. 12/31
slide-48
SLIDE 48

Structural SVM

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w)

Regularization Measures the model complexity Structural Loss : S is the set of structured labeled examples: LS(xi, hi, w): Measures “the distance” between the current best prediction and the gold structure hi LS can use hinge or square hinge functions or others A convex optimization problem

Now, add supervision from the companion task!

  • Page. 12/31
slide-49
SLIDE 49

The role of binary labeled data

Structured Output Problem

I t a l y

Companion Binary Output Problem

Israel

Yes/No

  • Page. 13/31
slide-50
SLIDE 50

The role of binary labeled data

Structured Output Problem

I t a l y

Companion Binary Output Problem

Israel

Yes/No

Companion Task: Does this example possess a good structure?

  • Page. 13/31
slide-51
SLIDE 51

The role of binary labeled data

Structured Output Problem

I t a l y

Companion Binary Output Problem

Israel

Yes/No

Companion Task: Does this example possess a good structure? x1 is positive .

There must exist a good structure that justifies the positive label ∃h, wTΦ(x1, h) ≥ 0

  • Page. 13/31
slide-52
SLIDE 52

The role of binary labeled data

Structured Output Problem

I t a l y

Companion Binary Output Problem

Israel

Yes/No

Companion Task: Does this example possess a good structure? x1 is positive .

There must exist a good structure that justifies the positive label ∃h, wTΦ(x1, h) ≥ 0

x2 is negative .

No structure is good enough ∀h, wTΦ(x2, h) ≤ 0

  • Page. 13/31
slide-53
SLIDE 53

Why is binary labeled data useful?

x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-54
SLIDE 54

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-55
SLIDE 55

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} SSVM: w x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-56
SLIDE 56

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} SSVM: w Predict:Φ(x1, ˆ h) x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-57
SLIDE 57

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} SSVM: w Gold:Φ(x1, h∗

1)

Predict:Φ(x1, ˆ h) x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-58
SLIDE 58

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} SSVM: w Gold:Φ(x1, h∗

1)

Predict:Φ(x1, ˆ h) {Φ(x2, h) | h ∈ H(x2)} x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-59
SLIDE 59

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} SSVM: w Gold:Φ(x1, h∗

1)

Predict:Φ(x1, ˆ h) {Φ(x2, h) | h ∈ H(x2)} x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-60
SLIDE 60

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} Gold:Φ(x1, h∗

1)

Predict:Φ(x1, ˆ h) {Φ(x2, h) | h ∈ H(x2)} w: SSVM+Indirect Supervision x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-61
SLIDE 61

Why is binary labeled data useful?

{Φ(x1, h) | h ∈ H(x1)} Gold:Φ(x1, h∗

1)

Predict:Φ(x1, ˆ h) {Φ(x2, h) | h ∈ H(x2)} w: SSVM+Indirect Supervision x1 is positive : There exists a good structure ∃h, wTΦ(x1, h) ≥ 0, or maxh wTΦ(x1, h) ≥ 0 x2 is negative : No structure is good enough ∀h, wTΦ(x2, h) ≤ 0, or maxh wTΦ(x2, h) ≤ 0

  • Page. 14/31
slide-62
SLIDE 62

Outline

1

Motivation

2

Structured Output Prediction and Its Companion Task

3

Joint Learning with Indirect Supervision

4

Optimization

5

Experiments

  • Page. 15/31
slide-63
SLIDE 63

Binary and structured labeled data

Direct Supervision: S Target Task Indirect Supervision: B Companion Task

  • Page. 16/31
slide-64
SLIDE 64

Binary and structured labeled data

Direct Supervision: S Target Task An example: (xi, hi) Indirect Supervision: B Companion Task An example: (xi, yi)

  • Page. 16/31
slide-65
SLIDE 65

Binary and structured labeled data

Direct Supervision: S Target Task An example: (xi, hi) Goal: wTΦ(xi, hi) ≥ max

h∈H(xi) wTΦ(xi, h).

Indirect Supervision: B Companion Task An example: (xi, yi) Goal: yi max

h∈H(xi) wTΦ(xi, h) ≥ 0

  • Page. 16/31
slide-66
SLIDE 66

Binary and structured labeled data

Direct Supervision: S Target Task An example: (xi, hi) Goal: wTΦ(xi, hi) ≥ max

h∈H(xi) wTΦ(xi, h).

Structural Loss: LS Indirect Supervision: B Companion Task An example: (xi, yi) Goal: yi max

h∈H(xi) wTΦ(xi, h) ≥ 0

Binary Loss: LB

  • Page. 16/31
slide-67
SLIDE 67

Binary and structured labeled data

Direct Supervision: S Target Task An example: (xi, hi) Goal: wTΦ(xi, hi) ≥ max

h∈H(xi) wTΦ(xi, h).

Structural Loss: LS Indirect Supervision: B Companion Task An example: (xi, yi) Goal: yi max

h∈H(xi) wTΦ(xi, h) ≥ 0

Binary Loss: LB Both LS and LB can use hinge, square-hinge, logistic, . . .

  • Page. 16/31
slide-68
SLIDE 68

Joint Learning with Indirect Supervision

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B

LB(xi, yi, w) ,

Regularization : measures the model complexity Direct Supervision : structured labeled data S = {(x, h)} Indirect Supervision : binary labeled data B = {(x, y)}

  • Page. 17/31
slide-69
SLIDE 69

Joint Learning with Indirect Supervision

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B

LB(xi, yi, w) ,

Regularization : measures the model complexity Direct Supervision : structured labeled data S = {(x, h)} Indirect Supervision : binary labeled data B = {(x, y)}

  • Page. 17/31
slide-70
SLIDE 70

Joint Learning with Indirect Supervision

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B

LB(xi, yi, w) ,

Regularization : measures the model complexity Direct Supervision : structured labeled data S = {(x, h)} Indirect Supervision : binary labeled data B = {(x, y)}

Share weight vector w Use the same weight vector for both structured labeled data and binary labeled data.

  • Page. 17/31
slide-71
SLIDE 71

Outline

1

Motivation

2

Structured Output Prediction and Its Companion Task

3

Joint Learning with Indirect Supervision

4

Optimization

5

Experiments

  • Page. 18/31
slide-72
SLIDE 72

Convexity Properties

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B

LB(xi, yi, w) , LS(xi, hi, w) = ℓ

  • max

h

  • ∆(h, hi)−wTΦ(xi, hi) + wTΦ(xi, h)
  • (1)

LB(xi, yi, w) = ℓ

  • 1 − yi max

h∈H(x)(wTΦB(xi, h))

  • (2)
  • Page. 19/31
slide-73
SLIDE 73

Convexity Properties

Regularization , Direct Supervision , Negative Data B− Convex Parts min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B− LB(xi, yi, w)

+ C2

  • i∈B+ LB(xi, yi, w)

Neither convex nor concave Positive Data B+

  • Page. 19/31
slide-74
SLIDE 74

JLIS: optimization procedure

Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1. Still need to do inference for structured examples and negative examples 3: Repeat!

  • Page. 20/31
slide-75
SLIDE 75

JLIS: optimization procedure

Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1. Still need to do inference for structured examples and negative examples 3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex.

  • Page. 20/31
slide-76
SLIDE 76

JLIS: optimization procedure

Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1. Still need to do inference for structured examples and negative examples 3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex. Properties of the algorithm: Asymmetric nature Converting a non-convex problem into a series of smaller convex problems Inference allows incorporating constraints on the output space. (Chang, Goldwasser, Roth, and Srikumar NAACL 2010)

  • Page. 20/31
slide-77
SLIDE 77

Solving the convex sub-problem

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B− LB(xi, yi, w)

+ C2

  • i∈B+ LB(xi, yi, w)
  • Page. 21/31
slide-78
SLIDE 78

Solving the convex sub-problem

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B− LB(xi, yi, w)

+ C2

  • i∈B+

✞ ✝ ☎ ✆ LB(xi, yi, w) with fixed structures

  • Page. 21/31
slide-79
SLIDE 79

Solving the convex sub-problem

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B− LB(xi, yi, w)

+ C2

  • i∈B+

✞ ✝ ☎ ✆ LB(xi, yi, w) with fixed structures Cutting plane method Find the “best structure” for examples in S and B− with the current w Add chosen structure into the cache and solve it again!

  • Page. 21/31
slide-80
SLIDE 80

Solving the convex sub-problem

min

w

w2 2 + C1

  • i∈S

LS(xi, hi, w) + C2

  • i∈B− LB(xi, yi, w)

+ C2

  • i∈B+

✞ ✝ ☎ ✆ LB(xi, yi, w) with fixed structures Cutting plane method Find the “best structure” for examples in S and B− with the current w Add chosen structure into the cache and solve it again! Dual coordinate descent method Simple implementation with square (L2) hinge loss

  • Page. 21/31
slide-81
SLIDE 81

Outline

1

Motivation

2

Structured Output Prediction and Its Companion Task

3

Joint Learning with Indirect Supervision

4

Optimization

5

Experiments

  • Page. 22/31
slide-82
SLIDE 82

Experimental Setting

Tasks Task 1: Phonetic alignment Task 2: Part-of-speech Tagging Task 3: Information Extraction

Citation recognition Advertisement field recognition

Companion Tasks Phonetic alignment: Transliteration pair or not POS Tagging: Has a legitimate POS tag sequence or not IE: Is a legitimate Citation/Advertisement or not

  • Page. 23/31
slide-83
SLIDE 83

Experimental Results

PA POS Citation ADS 60 70 80 Tasks Accuracy Structural SVM Joint Learning with Indirect Supervision

PA : Phonetic Alignment ADS : Advertisement field recognition

  • Page. 24/31
slide-84
SLIDE 84

Experimental Results

PA POS Citation ADS 60 70 80 Tasks Accuracy Structural SVM Joint Learning with Indirect Supervision

PA : Phonetic Alignment ADS : Advertisement field recognition

  • Page. 24/31
slide-85
SLIDE 85

Impact of negative examples

J-LIS: takes advantage of both positively and negatively labeled data

  • Page. 25/31
slide-86
SLIDE 86

Impact of negative examples

J-LIS: takes advantage of both positively and negatively labeled data

100 200 400 800 1.6k 3.2k 6.4k 12.8k 25.6k all 62 64 66 Number of tokens in the negative examples Accuracy Structural SVM JLIS

  • Page. 25/31
slide-87
SLIDE 87

Impact of negative examples

J-LIS: takes advantage of both positively and negatively labeled data

100 200 400 800 1.6k 3.2k 6.4k 12.8k 25.6k all 62 64 66 Number of tokens in the negative examples Accuracy Structural SVM JLIS

  • Page. 25/31
slide-88
SLIDE 88

Comparison to other learning framework

Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun

2004)

S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and

Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010)

  • Page. 26/31
slide-89
SLIDE 89

Comparison to other learning framework

Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun

2004)

S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and

Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010)

Semi-Supervised Learning methods

(Zien, Brefeld, and Scheffer 2007): Transductive Structural SSVM, (Brefeld and Scheffer 2006): co-Structural SVM

J-LIS uses “negative” examples

  • Page. 26/31
slide-90
SLIDE 90

Comparison to other learning framework

Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun

2004)

S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and

Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010)

Semi-Supervised Learning methods

(Zien, Brefeld, and Scheffer 2007): Transductive Structural SSVM, (Brefeld and Scheffer 2006): co-Structural SVM

J-LIS uses “negative” examples Compared to Contrastive Estimation Conceptually related.

More discussion

  • Page. 26/31
slide-91
SLIDE 91

Conclusions

It is possible to use binary labeled data for learning structures! J-LIS: gains from both direct and indirect supervision Similarly, structured labeled data can help the binary task

Jump

Allows the use of constraints on structures

  • Page. 27/31
slide-92
SLIDE 92

Conclusions

It is possible to use binary labeled data for learning structures! J-LIS: gains from both direct and indirect supervision Similarly, structured labeled data can help the binary task

Jump

Allows the use of constraints on structures Many exciting new directions! Using existing labeled dataset as structured task supervisions How to generate good “negative” examples? Other forms of indirect supervision?

  • Page. 27/31
slide-93
SLIDE 93

Thank you!

Thank you!!

Our learning code is available: the JLIS package

http://l2r.cs.uiuc.edu/~cogcomp/software.php

  • Page. 28/31
slide-94
SLIDE 94

Compared to Contrastive Estimation: I

Contrastive Estimation Performing unsupervised learning with log-linear models Maximize log P(x) Model 1 P(x) =

  • h exp(wTΦ(x, h))
  • h,ˆ

x exp(wTΦ(ˆ

x, h)) CE P(x) =

  • h exp(wTΦ(x, h))
  • h,ˆ

x∈ N(x) exp(wTΦ(ˆ

x, h))

  • Page. 29/31
slide-95
SLIDE 95

Compared to Contrastive Estimation: II

P(x) = P

h exp(wTΦ(x, h))

P

h,ˆ x∈ N (x) exp(wTΦ(ˆ

x, h)) CE J-LIS

  • Page. 30/31
slide-96
SLIDE 96

Compared to Contrastive Estimation: II

P(x) = P

h exp(wTΦ(x, h))

P

h,ˆ x∈ N (x) exp(wTΦ(ˆ

x, h)) CE J-LIS Supervision type “Neighbors” Structured + Binary

  • Page. 30/31
slide-97
SLIDE 97

Compared to Contrastive Estimation: II

P(x) = P

h exp(wTΦ(x, h))

P

h,ˆ x∈ N (x) exp(wTΦ(ˆ

x, h)) CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max

  • Page. 30/31
slide-98
SLIDE 98

Compared to Contrastive Estimation: II

P(x) = P

h exp(wTΦ(x, h))

P

h,ˆ x∈ N (x) exp(wTΦ(ˆ

x, h)) CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Property Can use existing data CE needs to know the relationship between “neighbors” of the input x. J-LIS can use existing binary labeled data.

  • Page. 30/31
slide-99
SLIDE 99

Compared to Contrastive Estimation: II

P(x) = P

h exp(wTΦ(x, h))

P

h,ˆ x∈ N (x) exp(wTΦ(ˆ

x, h)) CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Property Can use existing data CE needs to know the relationship between “neighbors” of the input x. J-LIS can use existing binary labeled data. Compared J-LIS and CE without using labeled data

Jump Back

Part-of-speech tags experiments. Same features and dataset. Random Base line: 35% EM: 60.9% (62.1%), CE: 74.7% (79.0%) J-LIS : 70.1% .J-LIS + 5 labeled example: 79.1%

  • Page. 30/31
slide-100
SLIDE 100

Joint learning: Results

40 45 50 55 60 65 70 75 80 85 90 95 100 200 400 800 1600 Accuracy on the binary classiciation The size of training data (|B|) |S| = 10, init. only |S| = 10, joint |S| = 20, init. only |S| = 20, joint

Impact of structure labeled data when binary classification is our target. Results (for transliteration identification) show that joint training of direct and indirect supervision significantly improves performance, especially when direct supervision is scarce.

  • Page. 31/31
slide-101
SLIDE 101

Brefeld, U. and T. Scheffer (2006). Semi-supervised learning for structured output variables. In ICML. Tsochantaridis, I., T. Hofmann, T. Joachims, and Y. Altun (2004). Support vector machine learning for interdependent and structured

  • utput spaces.

In ICML. Zien, A., U. Brefeld, and T. Scheffer (2007). Transductive support vector machines for structured variables. In ICML.

  • Page. 31/31