Policy Shaping and Generalized Update Equations for Semantic - - PowerPoint PPT Presentation

policy shaping and generalized update equations for
SMART_READER_LITE
LIVE PREVIEW

Policy Shaping and Generalized Update Equations for Semantic - - PowerPoint PPT Presentation

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations 1 Semantic Parsing with Execution Text Environment Semantic Parsing Meaning Representation Execution Denotation (Answer) 2 Semantic Parsing with


slide-1
SLIDE 1

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations

1

slide-2
SLIDE 2

Semantic Parsing with Execution

Text Meaning Representation Denotation (Answer) Environment Semantic Parsing Execution

2

slide-3
SLIDE 3

Semantic Parsing with Execution

“What nation scored the most points?”

Select Nation Where Points is Max

“England” Environment Semantic Parsing Execution

Index Name Nation Points Games Pts/game 1 Karen Andrew England 44 5 8.8 2 Daniella Waterman England 40 5 8 3 Christelle Le Duff France 33 5 6.6 4 Charlotte Barras England 30 5 6 5 Naomi Thomas Wales 25 5 5

3

slide-4
SLIDE 4
  • No gold programs during training

Indirect Supervision

“What nation scored the most points?”

Select Nation Where Points is Max

“England” Environment Semantic Parsing Execution

Index Name Nation Points Games Pts/game 1 Karen Andrew England 44 5 8.8 2 Daniella Waterman England 40 5 8 3 Christelle Le Duff France 33 5 6.6 4 Charlotte Barras England 30 5 6 5 Naomi Thomas Wales 25 5 5 4

slide-5
SLIDE 5
  • Neural Model

○ x: “What nation scored the most points?”

○ y: Select Nation Where Index is Minimum

○ neural models ⇒ score(x, y): encode x, encode y, and produce scores

  • Argmax procedure

○ Beamseach: argmax score(x, y)

  • Indirect supervision

○ Find approximated gold meaning representations ○ Reinforcement learning algorithms

Learning

5

slide-6
SLIDE 6
  • Question: “What nation scored the most points?”
  • Answer: “England”

Semantic Parsing with Indirect Supervision

Index Name Nation Points Games Pts/game 1 Karen Andrew England 44 5 8.8 2 Daniella Waterman England 40 5 8 3 Christelle Le Duff France 33 5 6.6 4 Charlotte Barras England 30 5 6 5 Naomi Thomas Wales 25 5 5

For Training

6

slide-7
SLIDE 7

Search for Training

  • A correct program should execute to the gold answer.
  • In general, there are several spurious programs that execute to

the gold answer but are semantically incorrect.

7

slide-8
SLIDE 8

Select Nation Where Points = 44 ⇒ “England” Select Nation Where Index is Minimum ⇒ “England” Select Nation Where Pts/game is Maximum ⇒ “England” Select Nation Where Point is Maximum ⇒ “England”

Search for Training: Spurious Programs

  • Search for training. Goal: find semantically correct parse!
  • Question: “What nation scored the most points?”
  • All programs above generate right answers but only one is correct.

8

slide-9
SLIDE 9

Update Step

  • Generally there are several methods to update the model.
  • Examples: maximum marginal likelihood, reinforcement

learning, margin methods.

9

slide-10
SLIDE 10
  • (1) Policy Shaping for handling spurious programs

(2) Generalized Update Equation for generalizing common update strategies and allowing novel updates.

  • (1) and (2) seem independent, but they interact with

each other!!

  • 5% absolute improvement over SOTA on SQA dataset

Contributions

10

slide-11
SLIDE 11

Learning from Indirect Supervision

1 2

[Search for Training] With x, t, z, beam search suitable Κ={y’} [Update] Update θ, according K = {y’}

11

  • Question x, Table t, Answer z, Parameters θ
slide-12
SLIDE 12

Spurious Programs

  • If the model selects a spurious program for

update then it increases the chance of selecting spurious programs in future.

1

  • Question x, Table t, Answer z, Parameters θ

[Search for Training] With x, t, z, beam search suitable {y’}

12

slide-13
SLIDE 13

Policy Shaping [Griffith et al., NIPS-2013]

13

slide-14
SLIDE 14

Search with Shaped Policy

1 1

  • Question x, Table t, Answer z, Parameters θ

[Search for Training] With x, t, z, beam search suitable {y’}

14

slide-15
SLIDE 15
  • 1. Surface-form Match: Features triggered for constants in the

program that match a token in the question.

  • 2. Lexical Pair Score: Features triggered between keywords

and tokens (e.g., Maximum and “most”).

Critique Policy

15

slide-16
SLIDE 16

Critique Policy Features

Select Nation Where Points = 44 Select Nation Where Index is Minimum Select Nation Where Pts/game is Maximum Select Nation Where Points is Maximum Select Nation Where Name = Karen Andrew Question: “What nation scored the most points?”

lexical pair match surface-form match

16

slide-17
SLIDE 17

Learning Pipeline Revisited

1 2

[Search for Training] With x, t, z, beam search suitable Κ={y’} [Update] Update θ, according K = {y’}

  • Using policy shaping to find “better” K
  • What is the better objective function Jθ?

⇐ Shaping affects here

17

slide-18
SLIDE 18

Objective Functions Look Different!

  • Maximum Marginal Likelihood (MML)
  • Reinforcement learning (RL)
  • Maximum Margin Reward (MMR)

Most violated program generated according to reward augment inference Maximum Reward Program

18

slide-19
SLIDE 19

Update Rules are Similar

  • Maximum Marginal Likelihood (MML)
  • Reinforcement learning (RL)
  • Maximum Margin Reward (MMR)

19

slide-20
SLIDE 20

Generalized Update Equation

2

[Update] Update θ, according K = {y’}

20

slide-21
SLIDE 21
  • MMR
  • MAVER

Improvement over Margin Approaches

slide-22
SLIDE 22

Results on SQA: Answer Accuracy (%)

  • Policy shaping helps improve performance.
  • With policy shaping, different updates matters even more
  • Achieves new state-of-the-art (previously 44.7%) on SQA

22

slide-23
SLIDE 23

Comparing Updates

23

  • MMR and MAVER are more “aggressive” than MML

○ MMR and MAVER update towards to one program ○ MML updates toward to all programs that can generate the correct answer MML: MMR:

slide-24
SLIDE 24
  • Discussed problem with search and update steps in

semantic parsing from denotation.

  • Introduced policy shaping for biasing the search away from

spurious programs.

  • Introduced generalized update equation that generalizes

common update strategies and allows novel updates.

  • Policy shaping allows more aggressive update!

Conclusion

24

slide-25
SLIDE 25

BACKUP

25

slide-26
SLIDE 26

Generalized Update as an Analysis Tool

26

  • MMR and MAVER are more “aggressive” than MML

○ MMR and MAVER only pick one ○ MML gives credits to all {y} that satisfies {z} ○ MMR and MAVER benefit more from shaping

slide-27
SLIDE 27

Learning from Indirect Supervision

1 2

  • Question x, Table t, Answer z, Parameters θ

[Search for Training] With x, t, z, beam search suitable {y’} [Update] Update θ, according {y’}

  • Search in training. Goal: finding semantically correct y’
  • Many different ways of update θ

27

slide-28
SLIDE 28

Shaping and update

1 2

Better search ⇒ more aggressive update

[Search for Training] With x, t, z, beam search suitable Κ={y’} [Update] Update θ, according K = {y’}

  • Using policy shaping to find “better” K
  • What is the better objective function Jθ?

⇐ Shaping affects here directly ⇐ Shaping affects here indirectly

28

slide-29
SLIDE 29
  • Mixing the MMR’s intensity and MML’s competing distribution

gives an update that outperforms MMR.

Novel Learning Algorithm

Intensity Competing Distribution Dev Performance w/o shaping Maximum Marginal Likelihood (MML) Maximum Marginal Likelihood (MML) 32.4 Maximum Margin Reward (MMR) Maximum Margin Reward (MMR) 40.7 Maximum Margin Reward (MMR) Maximum Marginal Likelihood (MML) 41.9

29

slide-30
SLIDE 30

Novel Learning Algorithms

30

slide-31
SLIDE 31

Learning Method #1 –

Maximum Marginal Likelihood (MML)

31

slide-32
SLIDE 32

Learning Method #2 –

Reinforcement Learning (RL)

32

slide-33
SLIDE 33

Learning Method #3 –

Maximum Margin Reward (MMR)

33

slide-34
SLIDE 34

Learning Method #4 –

Maximum Margin Average Violation Reward (MAVER)

34