Constrained Conditional Models Learning and Inference in Natural - - PowerPoint PPT Presentation

constrained conditional models
SMART_READER_LITE
LIVE PREVIEW

Constrained Conditional Models Learning and Inference in Natural - - PowerPoint PPT Presentation

Constrained Conditional Models Learning and Inference in Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev


slide-1
SLIDE 1

Page 1

December 2008 ICMLA With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott Yih, Dav Zimak Funding: ARDA, under the AQUAINT program

NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980, SoD-HCER-0613885 A DOI grant under the Reflex program; DHS DASH Optimization (Xpress-MP)

Constrained Conditional Models

Learning and Inference in Natural Language Understanding

Dan Roth

Department of Computer Science University of Illinois at Urbana-Champaign

slide-2
SLIDE 2

Page 2

Nice to Meet You

slide-3
SLIDE 3

Page 3

Learning and Inference

  • Global decisions in which several local decisions play a role but

there are mutual dependencies on their outcome.

  • E.g. Structured Output Problems –

multiple dependent output variables

  • (Learned) models/classifiers for different sub-problems
  • In some cases, not all models are available to be learned simultaneously
  • Key examples in NLP are Textual Entailment and QA
  • In these cases, constraints may appear only at evaluation time
  • Incorporate models’

information, along with prior knowledge/constraints, in making coherent decisions

  • decisions that respect the learned models as well as domain & context

specific knowledge/constraints.

slide-4
SLIDE 4

Page 4

Inference

slide-5
SLIDE 5

Page 5

Comprehension

  • 1. Christopher Robin was born in
  • 1. Christopher Robin was born in England. 2. Winnie th
  • England. 2. Winnie the P

e Pooh is

  • oh is a

a title of title of a a book. book.

  • 3. Christopher Robin’s dad was a magician.
  • 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must
  • 4. Christopher Robin must be at least 65 now

be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs.

(ENGLAND, June, 1989) - (ENGLAND, June, 1989) - Christopher Robin is alive and hristopher Robin is alive and well.

  • well. He

He lives lives in in

  • England. He is the same pe
  • England. He is the same person that

rson that you you read read about in the book, about in the book, Winnie the innie the

  • Pooh. As a boy, Chris lived
  • Pooh. As a boy, Chris lived in

in a pretty home calle a pretty home called Cotchfield d Cotchfield

  • Farm. When
  • Farm. When

Chris was three years old, Chris was three years old, his father wrote a poem about him. his father wrote a poem about him. The The poem poem was was printed printed in a magazine for others to r in a magazine for others to read. Mr. Robin then wrote a book. He

  • ead. Mr. Robin then wrote a book. He

made up a fairy tale land where Chris li made up a fairy tale land where Chris lived.

  • ved. His friends were animals. There

His friends were animals. There was a bear called Winnie the Pooh. Th was a bear called Winnie the Pooh. There was also an owl and a ere was also an owl and a young pig, young pig, called a piglet. All the animals were stu called a piglet. All the animals were stuffed toys that Chris owned. Mr. ffed toys that Chris owned. Mr. Robin Robin made them come to life with his words. made them come to life with his words. The places in the story were all near The places in the story were all near Cotchfield Cotchfield

  • Farm. Winnie the Pooh was
  • Farm. Winnie the Pooh was wri

written in 1925. Children still love tten in 1925. Children still love to to read about Christopher Robin and his animal read about Christopher Robin and his animal friends. Most people don't know

  • friends. Most people don't know

he is a real person who is grown now. he is a real person who is grown now. He has written two books He has written two books of

  • f his

his own.

  • wn.

They tell what it is like to be famous. They tell what it is like to be famous. This is an Inference Problem This is an Inference Problem

slide-6
SLIDE 6

Page 6

This Talk: Constrained Conditional Models

 A general inference framework that combines

Learning conditional models Learning conditional models with with using declarative expressive constraints using declarative expressive constraints

Within a constrained optimization framework Within a constrained optimization framework

Formulate a decision process as a constrained optimization problem,

  • r

Break up a complex problem into a set of sub-problems and require components’

  • utcomes to be consistent modulo constraints

 Has been shown useful in the context of many NLP problems

SRL, Summarization; Co-refer SRL, Summarization; Co-reference; Information Extraction ence; Information Extraction

[Roth&Yih04,07; Punyakanok et.al [Roth&Yih04,07; Punyakanok et.al 05,08; Chang et.al07,08; 5,08; Chang et.al07,08; Clarke&Lap Clarke&Lapata06,07; Denise&Baldrige07] ata06,07; Denise&Baldrige07]

 Here: focus on Learning and Inference for Structured NLP Problems

slide-7
SLIDE 7

Page 7

Outline

 Constrained Conditional Models

Motivation Motivation

Examples Examples

 Training Paradigms: Investigate ways for training models and combining constraints

Joint Joint Learning and Inference vs. Learning and Inference vs. decoupling decoupling Learning & Inference Learning & Inference

Guiding Semi-Supervised Learning Guiding Semi-Supervised Learning with Constraints with Constraints

Features vs. Constraints Features vs. Constraints

Hard and Soft Constraints Hard and Soft Constraints

 Examples

Semantic Parsing Semantic Parsing

Information Extraction Information Extraction

Pipeline processes Pipeline processes

slide-8
SLIDE 8

Page 8

Inference with General Constraint Structure [Roth&Yih’04]

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R23

  • ther

0.05

per

0.85

loc

0.10

  • ther

0.05

per

0.50

loc

0.45

  • ther

0.10

per

0.60

loc

0.30

irrelevant

0.10

spouse_of

0.05

born_in

0.85

irrelevant

0.05

spouse_of

0.45

born_in

0.50

irrelevant

0.05

spouse_of

0.45

born_in born_in

0.50 0.50

  • ther

0.05

per per

0.85 0.85

loc

0.10

  • ther

0.10

per per

0.60 0.60

loc

0.30

  • ther

0.05

per per

0.50 0.50

loc

0.45

irrelevant

0.05

spou spouse_of se_of

0.45 0.45

born_in

0.50

irrelevant

0.10

spouse_of

0.05

born_in born_in

0.85 0.85

  • ther

0.05

per

0.50

loc loc

0.45 0.45 Improvement over no inference: 2-5% Some Questions: How to guide the global inference? Why not learn Jointly? Models could be learned separately; constraints may come up only at decision time.

slide-9
SLIDE 9

Page 9

Task of Interests: Structured Output

For each instance, assign values to a set of variables

Output variables depends

  • n each other

Common tasks in

Natural language processing

Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,…

Information extraction

Entities, Relations,… 

Many pure machine learning approaches exist

Hidden Markov Models (HMMs) ; CRFs

Perceptrons…

However, … However, …

slide-10
SLIDE 10

Page 10 Page 10

Information Extraction via Hidden Markov Models

Lars Ole Andersen . Program analysis and specialization for the Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . University of Copenhagen, May 1994 . Prediction result of a trained HMM Prediction result of a trained HMM

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [AUTHOR] [TIT [TITLE] LE] [EDITO [EDITOR] R] [BOOKTI [BOOKTITLE] LE] [TECH-REPORT] [TECH-REPORT] [INSTITUTION] [INSTITUTION] [DATE] [DATE]

Unsatisfactory results !

slide-11
SLIDE 11

Page 11

Strategies for Improving the Results

(Pure) Machine Learning Approaches

Higher Order HMM/CRF?

Increasing the window size?

Adding a lot of new features

Requires a lot of labeled examples

What if we only have a few labeled examples?

Any other options?

Humans can immediately tell bad outputs

The output does not make sense

Increasing the model complexity Can we keep the learned model simple and still make expressive decisions?

slide-12
SLIDE 12

Page 12 Page 12

Information extraction without Prior Knowledge

Prediction result of a trained HMM Prediction result of a trained HMM

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [AUTHOR] [TIT [TITLE] LE] [EDITO [EDITOR] R] [BOOKTI [BOOKTITLE] LE] [TECH-REPORT] [TECH-REPORT] [INSTITUTION] [INSTITUTION] [DATE] [DATE]

Violates lots of natural constraints!

Lars Ole Andersen . Program analysis and specialization for the Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . University of Copenhagen, May 1994 .

slide-13
SLIDE 13

Page 13

Examples of Constraints

Each field must be a consecutive list of words and can appear at most

  • nce

in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE.

Four digits starting with 20xx and 19xx are DATE.

Quotations can appear only in TITLE

…….

Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers

slide-14
SLIDE 14

Page 14 Page 14

Adding constraints, we get correct results!

Without Without changing the model changing the model

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Information Extraction with Constraints

slide-15
SLIDE 15

Page 15

Random Variables Y:

Conditional Distributions P (learned by models/classifiers)

Constraints C– any Boolean function defined on partial assignments (possibly: + weights W )

Goal: Find the “best” assignment

The assignment that achieves the highest global performance.

This is an Integer Programming Problem

Problem Setting

y7 y4 y5 y6 y8 y1 y2 y3

C(y1 ,y4 ) C(y2 ,y3 ,y6 ,y7 ,y8 )

Y*=argmaxY PY subject to constraints C

(+ WC)

  • bservations
slide-16
SLIDE 16

Page 16

Formal Model

How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible

(Soft) constraints component Weight Vector for “local” models Penalty for violating the constraint. How far away is y from a “legal” assignment Subject to constraints A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train? How to decompose global objective function? Should we incorporate constraints in the learning process?

slide-17
SLIDE 17

Page 17

Example: Semantic Role Labeling

I left my pearls to my daughter in my will . [I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver

A1 Things left

A2 Benefactor

AM-LOC Location

I left my pearls to my daughter in my will . Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts. Implications on training paradigms

Overlapping arguments If A2 is present, A1 must also be present.

Who did what to whom, when, where, why,…

slide-18
SLIDE 18

Page 18

PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations.

It adds a layer of generic semantic labels to Penn Tree Bank II. It adds a layer of generic semantic labels to Penn Tree Bank II.

(Almost) all the labels are on th (Almost) all the labels are on the constituents of the parse trees. e constituents of the parse trees.

Core arguments: A0-A5 and AA

different semantics for each verb different semantics for each verb

specifie specified in the PropBank Frame files d in the PropBank Frame files

13 types of adjuncts labeled as AM-arg

where where arg specifies the adjunct type specifies the adjunct type

Semantic Role Labeling (2/2)

slide-19
SLIDE 19

Page 19

Algorithmic Approach

Identify argument candidates

Pruning [Xue&Palmer, EMNLP’04]

Argument Identifier

Binary classification (SNoW) 

Classify argument candidates

Argument Classifier

Multi-class classification (SNoW) 

Inference

Use the estimated probability distribution given by the argument classifier

Use structural and linguistic constraints

Infer the optimal global output

I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her I left my nice pearls to her

Identify Vocabulary Inference over (old and new) Vocabulary candidate arguments

EASY

slide-20
SLIDE 20

Page 20

Inference

I left my nice pearls to her

The output of the argument classifier often violates some constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an

  • ptimization problem and solved via Integer Linear
  • Programming. [Punyakanok et. al 04, Roth & Yih 04;05]

Input:

The probability estimation (by the ar The probability estimation (by the argument classifier) gument classifier)

Structural and linguistic constraints Structural and linguistic constraints

Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

slide-21
SLIDE 21

Page 21

Integer Linear Programming Inference

For each argument ai

Set up a Boolean variable: Set up a Boolean variable: ai,

i,t

indicating whether indicating whether ai is classified as is classified as t

Goal is to maximize

 i

score( score(ai = = t ) ) ai,t

i,t

Subject to the (linear) constraints Subject to the (linear) constraints

If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

The Constrained Conditional Model is completely decomposed during training

slide-22
SLIDE 22

Page 22

No duplicate argument classes

a

 POT

OTARG RG x{a

= = A0

A0}

 1

R-ARG

a2 a2  POT

OTARG RG ,

, a

 POT

OTARG RG x{a

= = A0

A0}

x{a2

a2

= = R-

R-A0}

C-ARG

a2 a2  POT

OTARG RG ,

(a  POT

OTARG RG)

)  (a a is before is before a2 a2 )

x{a

= = A0

A0}

x{a2

a2

= = C-

C-A0}

Many other possible constraints:

Unique labels Unique labels

No ov No overlapping or em erlapping or embedding bedding

Relations between number of Relations between number of arguments; order constraints arguments; order constraints

If ver If verb is of is of type A, type A, no argument of ty no argument of type B pe B Any Boolean rule can be encoded as a linear constraint. If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it

Constraints

Joint inference can be used also to combine different SRL Systems.

Universally quantified rules

LBJ: allows a developer to encode constraints in FOL; these are compiled into linear inequalities automatically.

slide-23
SLIDE 23

Page 23

Semantic Role Labeling

Screen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp Semantic parsing reveal Semantic parsing reveals several relations s several relations in the sentence along with their in the sentence along with their arguments. arguments. Top ranked system in CoNLL’05 shared task Key difference is the Inference This approach produces a very good semantic parser. F1~90% Easy and fast: ~7 Sent/Sec (using Xpress-MP)

slide-24
SLIDE 24

Page 24

Outline

 Constrained Conditional Models

Motivation Motivation

Examples Examples

 Training Paradigms: Investigate ways for training models and combining constraints

Joint Joint Learning and Inference vs. Learning and Inference vs. decoupling decoupling Learning & Inference Learning & Inference

Guiding Semi-Supervised Learning Guiding Semi-Supervised Learning with Constraints with Constraints

Features vs. Constraints Features vs. Constraints

Hard and Soft Constraints Hard and Soft Constraints

 Examples

Semantic Parsing Semantic Parsing

Information Extraction Information Extraction

Pipeline processes Pipeline processes

slide-25
SLIDE 25

Page 25

Textual Entailment Eyein Eyeing the huge market the huge market potential, currently led by potential, currently led by Google, Yahoo took over Google, Yahoo took over search company search company Overture Services Inc. last Overture Services Inc. last year year Yahoo acquired Overture Yahoo acquired Overture

Is it true that…? (Textual Entailment)

Overture is a search company Overture is a search company Google is a search company Google is a search company ………. ………. Google owns Overture Google owns Overture

Phrasal verb paraphrasing Phrasal verb paraphrasing [Connor&Roth’07] Entity matching Entity matching [Li et. al, AAAI’04, NAACL’04] Semantic Role Labeling Semantic Role Labeling Punyakanok et. al’05,08 Inference for Entailment Inference for Entailment Braz et. al’05, 07

slide-26
SLIDE 26

Page 26

Training Paradigms that Support Global Inference

Incorporating general constraints (Algorithmic Approach)

Allow both statistical and expr Allow both statistical and expressive declarative constraints essive declarative constraints

Allow non-sequential constraints (generally difficult) Allow non-sequential constraints (generally difficult)

Coupling vs. Decoupling Training and Inference.

Incorporating global constraints is important Incorporating global constraints is important but but

Should it be done only at Should it be done only at evaluation time evaluation time

  • r also at training time?
  • r also at training time?

How to How to decompose decompose the objective function and train in parts? the objective function and train in parts?

Issues related to: Issues related to:

Modularity, efficiency and performa Modularity, efficiency and performance, av nce, availability of tr ailability of training aining da data ta

Problem specific considerations Problem specific considerations

slide-27
SLIDE 27

Page 27

Training in the presence of Constraints

General Training Paradigm:

First Term: First Term: Learning from data (could Learning from data (could be further decomposed) be further decomposed)

Second Term: Second Term: Guiding the model by constraints Guiding the model by constraints

Can choose if constraints’ Can choose if constraints’ weights trained, when and how, or taken weights trained, when and how, or taken into account only in evaluation. into account only in evaluation.

Decompose Model (SRL case) Decompose Model from constraints

slide-28
SLIDE 28

Page 28

L+I: Learning plus Inference IBT: Inference- based Training

Training w/o Constraints Testing: Inference with Constraints

x1 x6 x2 x5 x4 x3 x7 y1 y2 y5 y4 y3

X Y

f1 (x) f2 (x) f3 (x) f4 (x) f5 (x)

Learning the components together!

Cartoon: each model can be more complex and may have a view

  • n a set of output

variables.

slide-29
SLIDE 29

Page 29

  • 1

1 1 1 1 Y’

Local Predictions Local Predictions

Perceptron-based Global Learning

x1 x6 x2 x5 x4 x3 x7 f1 (x) f2 (x) f3 (x) f4 (x) f5 (x)

X Y

  • 1

1 1

  • 1
  • 1

Y

True Global Labeling True Global Labeling

  • 1

1 1 1

  • 1

Y’

Apply Constraints: Apply Constraints:

Which one is better? When and Why?

slide-30
SLIDE 30

Page 30

Claims

When the local modes are “easy” to learn, L+I outperforms IBT.

In many applications, the components are In many applications, the components are identifiable identifiable and easy to learn (e.g., and easy to learn (e.g., argument, open-close, argument, open-close, PER). PER).

Only when the local problems become difficult to solve in isolation, IBT

  • utperforms L+I, but needs a larger number of training examples.

When data is sc When data is scarce, problems are arce, problems are not easy not easy and cons d constraints traints can be used, can be used, along with a along with a “weak weak” model, to label unlabeled model, to label unlabeled data and improve model. data and improve model.

Other training paradigms are possible

Pipeline-like Sequential Models:

Identify a preferred ordering among components Identify a preferred ordering among components

Learn k-th Learn k-th model jointly with model jointly with previously learned models previously learned models

L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases.

slide-31
SLIDE 31

Page 31

opt =0.2 opt =0.1 opt =0

Bound Prediction

Local  ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global  ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data L+I vs. IBT: the more identifiable individual problems are, the better

  • verall performance is with L+I

Indication for hardness of problem

slide-32
SLIDE 32

Page 32

Relative Merits: SRL

Difficulty of the learning problem Difficulty of the learning problem (# features) (# features)

L+I L+I is better. is better. When the problem When the problem is artificially made is artificially made harder, the tradeoff harder, the tradeoff is clearer. is clearer.

easy hard

In some cases problems are hard due to lack of training data. Semi-supervised learning

slide-33
SLIDE 33

Page 33

Outline

 Constrained Conditional Models

Motivation Motivation

Examples Examples

 Training Paradigms: Investigate ways for training models and combining constraints

Joint Joint Learning and Inference vs. Learning and Inference vs. decoupling decoupling Learning & Inference Learning & Inference

Guiding Semi-Supervised Learning Guiding Semi-Supervised Learning with Constraints with Constraints

Features vs. Constraints Features vs. Constraints

Hard and Soft Constraints Hard and Soft Constraints

 Examples

Semantic Parsing Semantic Parsing

Information Extraction Information Extraction

Pipeline processes Pipeline processes

slide-34
SLIDE 34

Page 34 Page 34

Information extraction without Prior Knowledge

Prediction result of a trained HMM Prediction result of a trained HMM

Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [AUTHOR] [TIT [TITLE] LE] [EDITO [EDITOR] R] [BOOKTI [BOOKTITLE] LE] [TECH-REPORT] [TECH-REPORT] [INSTITUTION] [INSTITUTION] [DATE] [DATE]

Violates lots of natural constraints!

Lars Ole Andersen . Program analysis and specialization for the Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . University of Copenhagen, May 1994 .

slide-35
SLIDE 35

Page 35

Examples of Constraints

Each field must be a consecutive list of words and can appear at most

  • nce

in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE.

Four digits starting with 20xx and 19xx are DATE.

Quotations can appear only in TITLE

…….

Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers

slide-36
SLIDE 36

Page 36 Page 36

Adding constraints, we get correct results!

Without Without changing the model changing the model

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Information Extraction with Constraints

slide-37
SLIDE 37

Page 37

Features Versus Constraints

φi

: X × Y → R; Ci : X × Y → {0,1}; d: X × Y → R;

 In p

In principle, constraints and featur inciple, constraints and features can encode the same propeties es can encode the same propeties

 In practice, they are

In practice, they are very different very different

 Features

 Local , short distance properties –

Local , short distance properties – to

  • allow tractable inference

allow tractable inference

 Propositional (grounded):

Propositional (grounded):

 E.g

E.g. True if: “the” . True if: “the” followed by a followed by a Noun Noun

  • ccurs in t
  • ccurs in the sentence”

e sentence”

 Constraints

 Global p

Global properties

  • perties

 Quantified, first order logic expressions

Quantified, first order logic expressions

 E.g

E.g.True .True if: if: “all “all yi s i s in t the s sequence y are assigned different values.” are assigned different values.”

Indeed, used differently

slide-38
SLIDE 38

Page 38

Encoding Prior Knowledge

Consider encoding the knowledge that: Consider encoding the knowledge that:

Entities of type A and B cannot occur simultaneously in a sentence

The “Feature” The “Feature” Way ay

Results in higher order HMM, CRF Results in higher order HMM, CRF

May require designin May require designing a model ta a model tail ilored to knowledge/constraints

  • red to knowledge/constraints

Large number of new features: mi Large number of new features: might require more labeled data ght require more labeled data

Wastes parameters to learn Wastes parameters to learn indirectly indirectly knowledge knowledge we have. we have.

The Constraints Way The Constraints Way

Keeps the model simple; add Keeps the model simple; add expressive constraints directly expressive constraints directly

A small set of constraints A small set of constraints

Allows for decision time in Allows for decision time incorporation of constraints corporation of constraints

slide-39
SLIDE 39

Page 39

Outline

 Constrained Conditional Models

Motivation Motivation

Examples Examples

 Training Paradigms: Investigate ways for training models and combining constraints

Joint Joint Learning and Inference vs. Learning and Inference vs. decoupling decoupling Learning & Inference Learning & Inference

Guiding Semi-Supervised Learning Guiding Semi-Supervised Learning with Constraints with Constraints

Features vs. Constraints Features vs. Constraints

Hard and Soft Constraints Hard and Soft Constraints

 Examples

Semantic Parsing Semantic Parsing

Information Extraction Information Extraction

Pipeline processes Pipeline processes

slide-40
SLIDE 40

Page 40

Guiding Semi-Supervised Learning with Constraints

Model Decision Time Constraints Un-labeled Data Constraints

In traditional Semi-Supervised learning the model can drift away from the correct one.

Constraints can be used

At decision time, to bias the objective function towards favoring constraint satisfaction.

At training to improve labeling of un-labled data (and thus improve the model)

slide-41
SLIDE 41

Page 41

Training Strategies

Hard Constraints or Weighted Constraints

Hard constraints: set penalties to Hard constraints: set penalties to infinity infinity

No more No more degrees degrees of

  • f viola

violation ion

Weighted Constraints Weighted Constraints

Need to figure out penalties values Need to figure out penalties values 

Factored / Jointed Approaches

Factored Models Factored Models (L+I)

Learn model weights and Learn model weights and constraints’ constraints’ penalties penalties separately separately

Joint Models Joint Models (IBT)

Learn the model weights an Learn the model weights and constraints’ d constraints’ penalties penalties jointly jointly

L+I vs L+I vs IBT: [ IBT: [Punyakanok et. al. 05] unyakanok et. al. 05]

Training Algorithms: L+ CI, L+ wCI CIBT, wCIBT

slide-42
SLIDE 42

Page 42

Factored (L+I) Approaches

Learning model weights

HMM HMM

Constraints Penalties

Hard Constraints : infinity Hard Constraints : infinity

Weighted Constraints: Weighted Constraints:

ρi = - = -log P{Constraint Ci is violated in training data}

slide-43
SLIDE 43

Page 43

Joint Approaches

Structured Perceptron

slide-44
SLIDE 44

Page 44 Page 44

Semi-supervised Learning with Constraints

=learn(T) For N iterations do T= For each x in unlabeled dataset {y1 ,…,yK } InferenceWithConstraints(x,C, ) T=T  {(x, yi )}i=1…k =  +(1- )learn(T)

Learn from new training data. Weigh supervised and unsupervised model. Inference based augmentation of the training set (feedback) (inference with constraints). Supervised learning algorithm parameterized by 

[Chang, Ratinov, Roth, ACL’07]

slide-45
SLIDE 45

Page 45

Outline

Constrained Conditional Model

Feature v.s Constraints

Inference

Training

Semi-supervised Learning

Results

Discussion

slide-46
SLIDE 46

Page 46

Results on Factored Model -- Citations

In all cases: semi = 1000 unlabeled examples. In all cases: Significantly better results than existing results [Chang

  • et. al. ’07]
slide-47
SLIDE 47

Page 47

Results on Factored Model -- Advertisements

slide-48
SLIDE 48

Page 48

Hard Constraints vs. Weighted Constraints

Constraints are close to perfect Labeled data might not follow the constraints

slide-49
SLIDE 49

Page 49

Factored vs. Jointed Training

Using the best models for both settings

Factored training: HMM + weighted constraints Factored training: HMM + weighted constraints

Jointed training: Perceptron + Jointed training: Perceptron + weighted constraints weighted constraints

Same feature set Same feature set Agrees with earlier results in the supervised setting ICML’05, IJCAI’5

With constraints

Factored Model is better

More significant with a small # of examples

Without constraints

Few labeled examples, HMM > perceptron

Many labeled examples, perceptron > HMM

slide-50
SLIDE 50

Page 50

Objective function: Objective function:

Value of Constraints in Semi-Supervised Learning

# of available labeled examples # of available labeled examples

Learning w 10 Constraints Learning w 10 Constraints

Constraints are used to Constraints are used to Bootstrap a semi- Bootstrap a semi- supervised learner supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.

Learning w/o Constraints: 300 examples. Learning w/o Constraints: 300 examples. Factored model.

slide-51
SLIDE 51

Page 51

y* = argmax * = argmaxy  wi φ(x; y) (x; y)

Linear objective functions

Typically φ(x,y) will be local functions, or φ(x,y) = φ(x)

Summary: Constrained Conditional Models

y7 y4 y5 y6 y8 y1 y2 y3 y7 y4 y5 y6 y8 y1 y2 y3

Conditional Markov Random Field Constraints Network

i ρi dC (x,y) (x,y)

Expressive constraints over output variables

Soft, weighted constraints

Specified declaratively as FOL formulae

Clearly, there is a joint probability distribution that represents this mixed model.

We would like to:

Learn Learn a simple a simple model model or

  • r several simple models

several simple models

Make decisions Make decisions with respect to a complex model with respect to a complex model

Key difference from MLNs which provide a concise definition of a model, but the whole joint one.

slide-52
SLIDE 52

Page 52

Conclusion

Constrained Conditional Models combine

Learning conditional models Learning conditional models with with using declarative expressive constraints using declarative expressive constraints

Within a constrained optimization framework Within a constrained optimization framework

Use constraints! The framework supports:

A clean way of incorp A clean way of incorporating constrai

  • rating constraints to bias and improve decisions

nts to bias and improve decisions

  • f supe
  • f supervised learning models

rvised learning models

Significant success on several NLP Significant success on several NLP and IE tasks (often, with ILP) and IE tasks (often, with ILP)

A clean way to use (declarati A clean way to use (declarative) prior knowledge to guide ve) prior knowledge to guide semi- semi- supervised learning supervised learning

Training protocol matters

More work needed here More work needed here

LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of constraints and inference with constraints

slide-53
SLIDE 53

Page 53

Questions?

Thank you