Discriminative L earning over C onstrained L atent R epresentations - - PowerPoint PPT Presentation

discriminative l earning over c onstrained l atent r
SMART_READER_LITE
LIVE PREVIEW

Discriminative L earning over C onstrained L atent R epresentations - - PowerPoint PPT Presentation

Discriminative L earning over C onstrained L atent R epresentations Ming-Wei Chang , Dan Goldwasser, Dan Roth and Vivek Srikumar Computer Science Department, University of Illinois at Urbana-Champaign Page. 1/27 An one minute version of the talk


slide-1
SLIDE 1

Discriminative Learning over Constrained Latent Representations

Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar

Computer Science Department, University of Illinois at Urbana-Champaign

  • Page. 1/27
slide-2
SLIDE 2

An one minute version of the talk

What we did Provide a general recipe for many important NLP problems Our algorithm: Learning over Constrained Latent Representations

  • Page. 2/27
slide-3
SLIDE 3

An one minute version of the talk

What we did Provide a general recipe for many important NLP problems Our algorithm: Learning over Constrained Latent Representations Example NLP problems Transliteration (Klementiev and Roth 2008), Textual entailment (RTE) (Dagan, Glickman, and Magnini 2006) Paraphrase identification (Dolan, Quirk, and Brockett 2004) Question Answering, and many more!

  • Page. 2/27
slide-4
SLIDE 4

An one minute version of the talk

What we did Provide a general recipe for many important NLP problems Our algorithm: Learning over Constrained Latent Representations Example NLP problems Transliteration (Klementiev and Roth 2008), Textual entailment (RTE) (Dagan, Glickman, and Magnini 2006) Paraphrase identification (Dolan, Quirk, and Brockett 2004) Question Answering, and many more! Problems of Interests Binary classification tasks that require an intermediate representation

  • Page. 2/27
slide-5
SLIDE 5

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?
  • Page. 3/27
slide-6
SLIDE 6

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?

Yes, but why?

  • Page. 3/27
slide-7
SLIDE 7

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?

Yes, but why? They carry the same information!

  • Page. 3/27
slide-8
SLIDE 8

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?

Yes, but why? They carry the same information!

Justifying the decision requires an intermediate representation

  • Page. 3/27
slide-9
SLIDE 9

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?

Yes, but why? They carry the same information!

Justifying the decision requires an intermediate representation

  • Page. 3/27
slide-10
SLIDE 10

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?

Yes, but why? They carry the same information!

Justifying the decision requires an intermediate representation Just an example; the real intermediate representation is more complicated

  • Page. 3/27
slide-11
SLIDE 11

Example task: Paraphrase Identification

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

Q: Are sentence 1 and sentence 2 paraphrases

  • f each other?

Yes, but why? They carry the same information!

Justifying the decision requires an intermediate representation Just an example; the real intermediate representation is more complicated Problem of interests Binary output problem: y ∈ {−1, 1} Intermediate representation: h

Some structure that justifies the positive label The intermediate representation is latent (not present in the data)

  • Page. 3/27
slide-12
SLIDE 12

Limitations of existing approaches: two-stage approach

Most systems: a two-stage approach Stage 1: Generate the intermediate representation Obtain intermediate representation → Fix it (ignore the second stage) ! X → H

  • Page. 4/27
slide-13
SLIDE 13

Limitations of existing approaches: two-stage approach

Most systems: a two-stage approach Stage 1: Generate the intermediate representation Obtain intermediate representation → Fix it (ignore the second stage) ! X → H Stage 2: Classification based on the intermediate representation Extract features using the fixed representation and learn: Φ(X, H) → Y

  • Page. 4/27
slide-14
SLIDE 14

Limitations of existing approaches: two-stage approach

Most systems: a two-stage approach Stage 1: Generate the intermediate representation Obtain intermediate representation → Fix it (ignore the second stage) ! X → H Stage 2: Classification based on the intermediate representation Extract features using the fixed representation and learn: Φ(X, H) → Y Problem: the intermediate representation ignores the binary task

  • Page. 4/27
slide-15
SLIDE 15

Limitations of existing approaches: two-stage approach

Most systems: a two-stage approach Stage 1: Generate the intermediate representation Obtain intermediate representation → Fix it (ignore the second stage) ! X → H Stage 2: Classification based on the intermediate representation Extract features using the fixed representation and learn: Φ(X, H) → Y Problem: the intermediate representation ignores the binary task

  • Page. 4/27
slide-16
SLIDE 16

Limitations of existing approaches: inference

Observation: decisions on intermediate representation are interdependent Alan Bob will said face Alan murder will charges be , charged Bob with said murder

  • Page. 5/27
slide-17
SLIDE 17

Limitations of existing approaches: inference

Observation: decisions on intermediate representation are interdependent Alan Bob will said face Alan murder will charges be , charged Bob with said murder

  • Page. 5/27
slide-18
SLIDE 18

Limitations of existing approaches: inference

Observation: decisions on intermediate representation are interdependent Alan Bob will said face Alan murder will charges be , charged Bob with said murder

  • Page. 5/27
slide-19
SLIDE 19

Limitations of existing approaches: inference

Observation: decisions on intermediate representation are interdependent Alan Bob will said face Alan murder will charges be , charged Bob with said murder Many frameworks use custom designed inference procedures Difficult to add linguistic intuition/constraints on the intermediate representation Difficult to generalize to other tasks

  • Page. 5/27
slide-20
SLIDE 20

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y

  • Page. 6/27
slide-21
SLIDE 21

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y

input

  • Page. 6/27
slide-22
SLIDE 22

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y

input intermediate rep- resentation

  • Page. 6/27
slide-23
SLIDE 23

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y

input intermediate rep- resentation features

  • Page. 6/27
slide-24
SLIDE 24

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y

input intermediate rep- resentation features binary label

  • Page. 6/27
slide-25
SLIDE 25

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y feedback

input intermediate rep- resentation features binary label

  • Page. 6/27
slide-26
SLIDE 26

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y feedback

input intermediate rep- resentation features binary label Find an intermediate representation that helps the binary task

  • Page. 6/27
slide-27
SLIDE 27

Learning Constrained Latent Representation (LCLR)

Property 1: Jointly learn intermediate representations and labels

X H Φ(X, H) Y feedback

input intermediate rep- resentation features binary label Find an intermediate representation that helps the binary task

Property 2: Constraint-based inference for the intermediate representation

Uses integer linear programming on latent variables Easy to inject constraints on latent variables Easy to generalize to other tasks

  • Page. 6/27
slide-28
SLIDE 28

Outline

1

Motivation and Contribution

2

Property 1: Jointly learn intermediate representations and labels

3

Property 2: Constraint-based inference for the intermediate representation

4

LCLR: Putting Everything Together

5

Experiments

  • Page. 7/27
slide-29
SLIDE 29

Outline

1

Motivation and Contribution

2

Property 1: Jointly learn intermediate representations and labels

3

Property 2: Constraint-based inference for the intermediate representation

4

LCLR: Putting Everything Together

5

Experiments

  • Page. 8/27
slide-30
SLIDE 30

The intuition behind the joint approach

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

  • Page. 9/27
slide-31
SLIDE 31

The intuition behind the joint approach

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

intermediate representation ⇔ {1, −1} Only positive examples have good intermediate representations No negative example has a good intermediate representation

  • Page. 9/27
slide-32
SLIDE 32

The intuition behind the joint approach

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

intermediate representation ⇔ {1, −1} Only positive examples have good intermediate representations No negative example has a good intermediate representation x: a sentence pair h: an alignment between two sentences H(x): all possible alignments for x

  • Page. 9/27
slide-33
SLIDE 33

The intuition behind the joint approach

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

intermediate representation ⇔ {1, −1} Only positive examples have good intermediate representations No negative example has a good intermediate representation x: a sentence pair, weight vector: u h: an alignment between two sentences H(x): all possible alignments for x

  • Page. 9/27
slide-34
SLIDE 34

The intuition behind the joint approach

Alan Bob will said face Alan murder will charges be , charged Bob with said murder

Yes/NO

intermediate representation ⇔ {1, −1} Only positive examples have good intermediate representations No negative example has a good intermediate representation x: a sentence pair, weight vector: u h: an alignment between two sentences H(x): all possible alignments for x Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0

  • Page. 9/27
slide-35
SLIDE 35

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

  • Page. 10/27
slide-36
SLIDE 36

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

  • Page. 10/27
slide-37
SLIDE 37

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

u

  • Page. 10/27
slide-38
SLIDE 38

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

{Φ(x1, h) | h ∈ H(x1)}

  • Page. 10/27
slide-39
SLIDE 39

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

{Φ(x1, h) | h ∈ H(x1)} {Φ(x2, h) | h ∈ H(x2)}

  • Page. 10/27
slide-40
SLIDE 40

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

{Φ(x1, h) | h ∈ H(x1)} {Φ(x2, h) | h ∈ H(x2)} u

  • Page. 10/27
slide-41
SLIDE 41

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

{Φ(x1, h) | h ∈ H(x1)} {Φ(x2, h) | h ∈ H(x2)} u

  • Page. 10/27
slide-42
SLIDE 42

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

{Φ(x1, h) | h ∈ H(x1)} {Φ(x2, h) | h ∈ H(x2)} u Φ(x1, h∗

1)

  • Page. 10/27
slide-43
SLIDE 43

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

{Φ(x1, h) | h ∈ H(x1)} {Φ(x2, h) | h ∈ H(x2)} u Φ(x1, h∗

1)

Φ(x2, h∗

2)

  • Page. 10/27
slide-44
SLIDE 44

Geometric interpretation: the case of two examples

Pair x1 is positive

There must exist a good explanation that justifies the positive label ∃h, uTΦ(x1, h) ≥ 0, or maxh uTΦ(x1, h) ≥ 0

Pair x2 is negative

No explanation is good enough to justify the positive label ∀h, uTΦ(x2, h) ≤ 0, or maxh uTΦ(x2, h) ≤ 0

The prediction function: maxh uTΦ(x, h) {Φ(x1, h) | h ∈ H(x1)} {Φ(x2, h) | h ∈ H(x2)} u Φ(x1, h∗

1)

Φ(x2, h∗

2)

  • Page. 10/27
slide-45
SLIDE 45

Outline

1

Motivation and Contribution

2

Property 1: Jointly learn intermediate representations and labels

3

Property 2: Constraint-based inference for the intermediate representation

4

LCLR: Putting Everything Together

5

Experiments

  • Page. 11/27
slide-46
SLIDE 46

Integer Linear Programming for LCLR

Declarative Framework Why is a declarative framework important?

No more custom-designed inference procedures Easy to generalize to other tasks Easy to inject constraints and linguistic intuition

  • Page. 12/27
slide-47
SLIDE 47

Integer Linear Programming for LCLR

Declarative Framework Why is a declarative framework important?

No more custom-designed inference procedures Easy to generalize to other tasks Easy to inject constraints and linguistic intuition

LCLR Declarative Framework plug in

  • Page. 12/27
slide-48
SLIDE 48

Integer Linear Programming for LCLR

Declarative Framework Why is a declarative framework important?

No more custom-designed inference procedures Easy to generalize to other tasks Easy to inject constraints and linguistic intuition

LCLR Declarative Framework plug in Paraphrasing Model input as graphs. Ga: the first sentence. Gb: the second sentence.

Each vertex in Ga can be mapped to at most one vertex in Gb (vice versa) Each edge in Ga can be mapped to at most one edge in Gb (vice versa) Edge mapping is active iff the corresponding node mappings are active

  • Page. 12/27
slide-49
SLIDE 49

Integer Linear Programming for LCLR

Declarative Framework Why is a declarative framework important?

No more custom-designed inference procedures Easy to generalize to other tasks Easy to inject constraints and linguistic intuition Check out the CCM tutorial!

LCLR Declarative Framework plug in Paraphrasing Model input as graphs. Ga: the first sentence. Gb: the second sentence.

Each vertex in Ga can be mapped to at most one vertex in Gb (vice versa) Each edge in Ga can be mapped to at most one edge in Gb (vice versa) Edge mapping is active iff the corresponding node mappings are active

  • Page. 12/27
slide-50
SLIDE 50

Finding intermediate representation using ILP

Sentence 1 Sentence 2 Alan Bob will said face Alan murder will charges be , charged Bob with said murder . . We need this because of the

  • formulation. You do not need to

parse the symbols in this page

  • Page. 13/27
slide-51
SLIDE 51

Finding intermediate representation using ILP

Sentence 1 Sentence 2 Alan Bob will said face Alan murder will charges be , charged Bob with said murder . . We need this because of the

  • formulation. You do not need to

parse the symbols in this page Γ(x), the set of all “parts” that x can generate |Γ(x)| = 8x8 = 64

  • Page. 13/27
slide-52
SLIDE 52

Finding intermediate representation using ILP

Sentence 1 Sentence 2 Alan Bob will said face Alan murder will charges be , charged Bob with said murder . . We need this because of the

  • formulation. You do not need to

parse the symbols in this page Γ(x), the set of all “parts” that x can generate |Γ(x)| = 8x8 = 64 Rewrite h ∈ {0, 1}64 as a binary vector h = {0, 0, 0, . . . , 1, 0, 0, 1, 1}

  • Page. 13/27
slide-53
SLIDE 53

Finding intermediate representation using ILP

Sentence 1 Sentence 2 Alan Bob will said face Alan murder will charges be , charged Bob with said murder . . We need this because of the

  • formulation. You do not need to

parse the symbols in this page Γ(x), the set of all “parts” that x can generate |Γ(x)| = 8x8 = 64 Rewrite h ∈ {0, 1}64 as a binary vector h = {0, 0, 0, . . . , 1, 0, 0, 1, 1} A feature vector Φs(x) for every part hs

  • Page. 13/27
slide-54
SLIDE 54

Finding intermediate representation using ILP

Sentence 1 Sentence 2 Alan Bob will said face Alan murder will charges be , charged Bob with said murder . . We need this because of the

  • formulation. You do not need to

parse the symbols in this page Γ(x), the set of all “parts” that x can generate |Γ(x)| = 8x8 = 64 Rewrite h ∈ {0, 1}64 as a binary vector h = {0, 0, 0, . . . , 1, 0, 0, 1, 1} A feature vector Φs(x) for every part hs Inference Problem = ILP formulation (pink box) max

h∈H uTΦ(x, h) = max h∈H uT s∈Γ(x)

hsΦs(x)

  • Page. 13/27
slide-55
SLIDE 55

Outline

1

Motivation and Contribution

2

Property 1: Jointly learn intermediate representations and labels

3

Property 2: Constraint-based inference for the intermediate representation

4

LCLR: Putting Everything Together

5

Experiments

  • Page. 14/27
slide-56
SLIDE 56

LCLR: The objective function

Review: Logistic Regression and Support Vector Machine

Decision Function: f (x, u) ≥ 0

  • Page. 15/27
slide-57
SLIDE 57

LCLR: The objective function

Review: Logistic Regression and Support Vector Machine

Decision Function: f (x, u) ≥ 0 Objective Function: min

u

1 2u2 + C

l

X

i=1

ℓ(−yi f (x, u) )

  • Page. 15/27
slide-58
SLIDE 58

LCLR: The objective function

Review: Logistic Regression and Support Vector Machine

Decision Function: uTΦ(x) ≥ 0 Objective Function: min

u

1 2u2 + C

l

X

i=1

ℓ(−yi uTΦ(xi) )

  • Page. 15/27
slide-59
SLIDE 59

LCLR: The objective function

Review: Logistic Regression and Support Vector Machine

Decision Function: uTΦ(x) ≥ 0 Objective Function: min

u

1 2u2 + C

l

X

i=1

ℓ(−yi uTΦ(xi) )

−2 −1.5 −1 −0.5 0.5 0.5 1 1.5 2 hinge loss square hinge loss logistic loss

  • Page. 15/27
slide-60
SLIDE 60

LCLR: The objective function

Learning over Constrained Latent Representations

Decision Function (ILP): f (x, u)≥ 0

  • Page. 15/27
slide-61
SLIDE 61

LCLR: The objective function

Learning over Constrained Latent Representations

Decision Function (ILP): f (x, u)≥ 0 Objective Function min

u

1 2u2 + C

l

X

i=1

ℓ(−yi f (x, u) )

  • Page. 15/27
slide-62
SLIDE 62

LCLR: The objective function

Learning over Constrained Latent Representations

Decision Function (ILP): maxh∈H uT P

s∈Γ(x) hsΦs(x)≥ 0

Objective Function min

u

1 2u2 + C

l

X

i=1

ℓ(−yi max

h∈H uT X s∈Γ(x)

hsΦs(x) )

  • Page. 15/27
slide-63
SLIDE 63

LCLR: The objective function

Learning over Constrained Latent Representations

Decision Function (ILP): maxh∈H uT P

s∈Γ(x) hsΦs(x)≥ 0

Objective Function min

u

1 2u2 + C

l

X

i=1

ℓ(−yi max

h∈H uT X s∈Γ(x)

hsΦs(x) )

Beyond standard LR/SVM Solves an inference problem (max) to select h (also affect features)

  • Page. 15/27
slide-64
SLIDE 64

Challenges in optimizing the objective function

minu 1

2u2 + C l i=1 ℓ(−yi max h∈H uT s∈Γ(x) hsΦs(x) )

✞ ✝ ☎ ✆ Not a regular LR/SVM LCLR has an inference procedure inside the minimization problem

  • Page. 16/27
slide-65
SLIDE 65

Challenges in optimizing the objective function

minu 1

2u2 + C l i=1 ℓ(−yi max h∈H uT s∈Γ(x) hsΦs(x) )

✞ ✝ ☎ ✆ Not a regular LR/SVM LCLR has an inference procedure inside the minimization problem ✄ ✂

No shortcut Find the best representation for all examples Obtain a new weight vector using a LR/SVM package with the updated representations. Repeat.

  • Page. 16/27
slide-66
SLIDE 66

Challenges in optimizing the objective function

minu 1

2u2 + C l i=1 ℓ(−yi max h∈H uT s∈Γ(x) hsΦs(x) )

✞ ✝ ☎ ✆ Not a regular LR/SVM LCLR has an inference procedure inside the minimization problem ✄ ✂

No shortcut Find the best representation for all examples Obtain a new weight vector using a LR/SVM package with the updated representations. Repeat. Does not minimize the objective function

  • Page. 16/27
slide-67
SLIDE 67

LCLR: optimization procedure

Algorithm 1: Find the best intermediate representations for positive examples 2: Find the weight vector with this intermediate representation

Still need to do inference for negative examples Not a regular SVM problem even in this step!

3: Repeat!

  • Page. 17/27
slide-68
SLIDE 68

LCLR: optimization procedure

Algorithm 1: Find the best intermediate representations for positive examples 2: Find the weight vector with this intermediate representation

Still need to do inference for negative examples Not a regular SVM problem even in this step!

3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex.

  • Page. 17/27
slide-69
SLIDE 69

LCLR: optimization procedure

Algorithm 1: Find the best intermediate representations for positive examples 2: Find the weight vector with this intermediate representation

Still need to do inference for negative examples Not a regular SVM problem even in this step!

3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex. Properties of the algorithm: Asymmetric nature Asymmetry between positive and negative examples Converting a non-convex problem into a series of smaller convex problems

  • Page. 17/27
slide-70
SLIDE 70

Comparison to other latent variable frameworks

Inference procedure Other frameworks often use application-specific inference. LCLR allows you to add constraints and generalize to other tasks.

  • Page. 18/27
slide-71
SLIDE 71

Comparison to other latent variable frameworks

Inference procedure Other frameworks often use application-specific inference. LCLR allows you to add constraints and generalize to other tasks. Learning Not only for SVM. Many different loss functions can be used. Dual coordinate descent methods and cutting plane method

Fewer parameters to tune. Allows parallel inference procedure.

  • Page. 18/27
slide-72
SLIDE 72

Comparison to other latent variable frameworks

Inference procedure Other frameworks often use application-specific inference. LCLR allows you to add constraints and generalize to other tasks. Learning Not only for SVM. Many different loss functions can be used. Dual coordinate descent methods and cutting plane method

Fewer parameters to tune. Allows parallel inference procedure.

CRF-like latent variable framework LCLR can use logistic regression and have a probabilistic interpretation LCLR solves the “max” problem. CRF-like models solves the “sum”

  • problem. “Max” enables adding constraints.

Jump

  • Page. 18/27
slide-73
SLIDE 73

Outline

1

Motivation and Contribution

2

Property 1: Jointly learn intermediate representations and labels

3

Property 2: Constraint-based inference for the intermediate representation

4

LCLR: Putting Everything Together

5

Experiments

  • Page. 19/27
slide-74
SLIDE 74

Experimental setting

Tasks Transliteration: Is named entity B a transliteration of A? Textual Entailment: Can sentence A entail sentence B? Paraphrase Identification Goal of experiments Determine if a joint approach be better than a two-stage approach? Two-stage approach versus LCLR Exactly the same features and definition of latent structures

Our two-stage approach uses a domain-dependent heuristic to find an intermediate representation LCLR finds the intermediate representation automatically

Initialization of LCLR: two-stage

  • Page. 20/27
slide-75
SLIDE 75

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ Our two-stage ⋆ Our LCLR ⋆ ⋆

  • Page. 21/27
slide-76
SLIDE 76

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ Our LCLR ⋆ ⋆

  • Page. 21/27
slide-77
SLIDE 77

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ 80.0 85.7 Our LCLR ⋆ ⋆

  • Page. 21/27
slide-78
SLIDE 78

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ 80.0 85.7 Our LCLR ⋆ ⋆ 92.3 95.4

  • Page. 21/27
slide-79
SLIDE 79

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ 80.0 85.7 Our LCLR ⋆ ⋆ 92.3 95.4 Entailment System Joint ILP Acc Median of TAC 2009 systems Our two-stage ⋆ Our LCLR ⋆ ⋆

  • Page. 21/27
slide-80
SLIDE 80

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ 80.0 85.7 Our LCLR ⋆ ⋆ 92.3 95.4 Entailment System Joint ILP Acc Median of TAC 2009 systems 61.5 Our two-stage ⋆ Our LCLR ⋆ ⋆

  • Page. 21/27
slide-81
SLIDE 81

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ 80.0 85.7 Our LCLR ⋆ ⋆ 92.3 95.4 Entailment System Joint ILP Acc Median of TAC 2009 systems 61.5 Our two-stage ⋆ 65.0 Our LCLR ⋆ ⋆

  • Page. 21/27
slide-82
SLIDE 82

Experimental results

Transliteration System Joint ILP Acc MRR (Goldwasser and Roth 2008) ⋆ N/A 89.4 Our two-stage ⋆ 80.0 85.7 Our LCLR ⋆ ⋆ 92.3 95.4 Entailment System Joint ILP Acc Median of TAC 2009 systems 61.5 Our two-stage ⋆ 65.0 Our LCLR ⋆ ⋆ 66.8

  • Page. 21/27
slide-83
SLIDE 83

Paraphrase Identification

Paraphrase System Joint ILP Acc Experiments using (Dolan, Quirk, and Brockett 2004) (Qiu, Kan, and Chua 2006) 72.00 (Das and Smith 2009) ⋆ 73.86 (Wan, Dras, Dale, and Paris 2006) 75.60 Our two-stage ⋆ Our LCLR ⋆ ⋆

  • Page. 22/27
slide-84
SLIDE 84

Paraphrase Identification

Paraphrase System Joint ILP Acc Experiments using (Dolan, Quirk, and Brockett 2004) (Qiu, Kan, and Chua 2006) 72.00 (Das and Smith 2009) ⋆ 73.86 (Wan, Dras, Dale, and Paris 2006) 75.60 Our two-stage ⋆ 76.23 Our LCLR ⋆ ⋆ 76.41

  • Page. 22/27
slide-85
SLIDE 85

Paraphrase Identification

Paraphrase System Joint ILP Acc Experiments using (Dolan, Quirk, and Brockett 2004) (Qiu, Kan, and Chua 2006) 72.00 (Das and Smith 2009) ⋆ 73.86 (Wan, Dras, Dale, and Paris 2006) 75.60 Our two-stage ⋆ 76.23 Our LCLR ⋆ ⋆ 76.41 Experiments using Noisy data set Our two-stage ⋆ Our LCLR ⋆ ⋆

  • Page. 22/27
slide-86
SLIDE 86

Paraphrase Identification

Paraphrase System Joint ILP Acc Experiments using (Dolan, Quirk, and Brockett 2004) (Qiu, Kan, and Chua 2006) 72.00 (Das and Smith 2009) ⋆ 73.86 (Wan, Dras, Dale, and Paris 2006) 75.60 Our two-stage ⋆ 76.23 Our LCLR ⋆ ⋆ 76.41 Experiments using Noisy data set Our two-stage ⋆ 72.00 Our LCLR ⋆ ⋆ 72.75

  • Page. 22/27
slide-87
SLIDE 87

Conclusions

LCLR = Constraint-based Inference + Large Margin Learning Contributions LCLR joint approach is better than two-stage approaches LCLR allows the use of constraints on latent variables A novel learning framework

  • Page. 23/27
slide-88
SLIDE 88

Conclusions

LCLR = Constraint-based Inference + Large Margin Learning Contributions LCLR joint approach is better than two-stage approaches LCLR allows the use of constraints on latent variables A novel learning framework Bonus: Learning Structures with Indirect Supervision Easy to get binary labeled data can be used to improve learning structures! Check out our ICML paper this year!

  • Page. 23/27
slide-89
SLIDE 89

Thank you!

Thank you!!

Our learning code is available: the JLIS package

http://l2r.cs.uiuc.edu/~cogcomp/software.php

  • Page. 24/27
slide-90
SLIDE 90

Main Idea: Learning with indirect supervision

machine learning model labeled structures testing data training testing unlabeled examples training

  • Page. 25/27
slide-91
SLIDE 91

Main Idea: Learning with indirect supervision

machine learning model labeled structures testing data training testing unlabeled examples training indirect supervision training

Indirect supervision: the supervision form that does not tell you the target output directly

  • Page. 25/27
slide-92
SLIDE 92

Main Idea: Learning with indirect supervision

machine learning model labeled structures testing data training testing unlabeled examples training indirect supervision training

Indirect supervision: the supervision form that does not tell you the target output directly Advantage of using indirect supervision Can directly use human/domain knowledge to improve the model Allow us to use supervision signals that are a lot easier to obtain than labeling structures Use existing labeled data for the related tasks

  • Page. 25/27
slide-93
SLIDE 93

Main Idea: Learning with indirect supervision

machine learning model labeled structures testing data training testing unlabeled examples training indirect supervision training

Indirect supervision: the supervision form that does not tell you the target output directly Advantage of using indirect supervision Can directly use human/domain knowledge to improve the model Allow us to use supervision signals that are a lot easier to obtain than labeling structures Use existing labeled data for the related tasks Indirect supervision greatly reduce the supervision effort!

  • Page. 25/27
slide-94
SLIDE 94

Compared to CRF-like latent variable framework

CRF-like latent variable framework P(y = 1|x) =

  • h

P(y = 1, h|x) =

  • h exp(uTφ(x, h, y = 1))
  • h,y exp(uTφ(x, h, y))

LCLR with logistic loss P(y = 1|x) = maxh exp(uTφ(x, h)) 1 + maxh exp(uTφ(x, h)) Difference 1: LCLR only models the “goodness”

This is important for many NLP problems, where only positive examples have good representations.

Difference 2: LCLR only need to solve the max inference

Sometimes calculating sum is a lot harder!!

Jump back

  • Page. 26/27
slide-95
SLIDE 95

Paraphrase Identification: Revisited

Sentence 1 Sentence 2 Alan Bob will said face Alan murder will charges be , charged Bob with said murder . . Left: The intermediate representation is not expressive enough

For example, “word ordering” is a problem

The real setting

Input: two word sequence → two graphs. We used Stanford Parser to construct dependency parse trees for each sentence

Integer Linear Programming to solve the graph matching problem Four types of sub-structure: node matching, node-deletion, edge matching, edge-deletion Add constraints to enforce consistency

edge matching if and only if the corresponding nodes are matched

Jump Back

slide-96
SLIDE 96

Dagan, I., O. Glickman, and B. Magnini (Eds.) (2006). The PASCAL Recognising Textual Entailment Challenge. Das, D. and N. A. Smith (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In ACL. Dolan, W., C. Quirk, and C. Brockett (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In COLING. Goldwasser, D. and D. Roth (2008). Active sample selection for named entity transliteration. In ACL. Short Paper. Klementiev, A. and D. Roth (2008). Named entity transliteration and discovery in multilingual corpora. In C. Goutte, N. Cancedda, M. Dymetman, and G. Foster (Eds.), Learning Machine Translation.

  • Page. 27/27
slide-97
SLIDE 97

Qiu, L., M.-Y. Kan, and T.-S. Chua (2006). Paraphrase recognition via dissimilarity significance classification. In EMNLP. Wan, S., M. Dras, R. Dale, and C. Paris (2006). Using dependency-based features to take the ¨ para-farce¨

  • ut of

paraphrase. In Proc. of the Australasian Language Technology Workshop (ALTW).

  • Page. 27/27