Trainable Decoding of Sets of Sequences for Neural Sequence Models - - PowerPoint PPT Presentation

trainable decoding of sets of sequences for neural
SMART_READER_LITE
LIVE PREVIEW

Trainable Decoding of Sets of Sequences for Neural Sequence Models - - PowerPoint PPT Presentation

Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models Trainable Decoding of Sets of Sequences for Neural Sequence Models Ashwin Kalyan Peter Anderson Stefan Lee Dhruv Batra Ashwin Kalyan Ashwin Kalyan ICML 2019 ICML 2019


slide-1
SLIDE 1

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Trainable Decoding of Sets of Sequences for Neural Sequence Models

Ashwin Kalyan Dhruv Batra Stefan Lee Peter Anderson

Ashwin Kalyan ICML 2019

slide-2
SLIDE 2

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Standard Sequence Prediction Pipeline

RNN

ht ht−1

RNN

yt yt+1

  • t
  • 1. Train RNNs to maximize Log Likelihood
slide-3
SLIDE 3

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Standard Sequence Prediction Pipeline

RNN

ht ht−1

RNN

yt yt+1

  • t
  • 1. Train RNNs to maximize Log Likelihood

is picture a the is shows This

B = 2

  • 2. Perform Beam Search to decode top K
slide-4
SLIDE 4

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Standard Sequence Prediction Pipeline

RNN

ht ht−1

RNN

yt yt+1

  • t
  • 1. Train RNNs to maximize Log Likelihood

is picture a the is shows This

B = 2

  • 2. Perform Beam Search to decode top K
  • 3. Return the best sequence in the top K

> A kitchen with a stove. > A kitchen with a stove and a sink. > A kitchen with a stove and a microwave. > A kitchen with a stove and a refrigerator.

slide-5
SLIDE 5

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

But… many real world tasks are multi-modal!

ü A group of people riding horses. ü Kids riding horses with adults help. ü A girl poses on her horse in equestrian dress by a small crowd. ü Some people stand near some horses in a field. ü People are standing around children riding horses in a grassy area. ü A small girl is riding a large light brown horse. ü A young girl in riding gear mounts a pony in front of a group. ü A group of people with a jockey and her horse ü Several people playing with ponies in a park.

How to model more than one correct output?

slide-6
SLIDE 6

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Retool the Standard Sequence Prediction Pipeline

RNN

ht ht−1

RNN

yt yt+1

  • t
  • 1. Train RNNs to maximize Log Likelihood

is picture a the is shows This

B = 2

  • 2. Perform Beam Search to decode top K
  • 3. Return the best sequence in the top K

> A kitchen with a stove. > A kitchen with a stove and a sink. > A kitchen with a stove and a microwave. > A kitchen with a stove and a refrigerator.

slide-7
SLIDE 7

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Retool the Standard Sequence Prediction Pipeline

RNN

ht ht−1

RNN

yt yt+1

  • t
  • 1. Train RNNs to maximize Log Likelihood

is picture a the is shows This

B = 2

  • 2. Perform Beam Search to decode top K
  • 3. Return the best sequence in the top K

> A kitchen with a stove. > A kitchen with a stove and a sink. > A kitchen with a stove and a microwave. > A kitchen with a stove and a refrigerator.

slide-8
SLIDE 8

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Beam Search outputs are nearly identical!

Ø A group of people riding horses on a field. Ø A group of people riding horses in a field. Ø A group of people riding horses down a dirt road. Ø A group of people riding horses through a field. Ø A group of people riding on the back of horses. Ø A group of people riding on the back of a horse. Ø A group of people riding on a horse. Ø A couple of people riding on the back of horses. Ø A couple of people riding on the back of a horse. Ø A couple of people riding horses on a field.

Doesn’t model intra-set interactions!

slide-9
SLIDE 9

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Beam Search outputs are nearly identical!

Ø A group of people riding horses on a field. Ø A group of people riding horses in a field. Ø A group of people riding horses down a dirt road. Ø A group of people riding horses through a field. Ø A group of people riding on the back of horses. Ø A group of people riding on the back of a horse. Ø A group of people riding on a horse. Ø A couple of people riding on the back of horses. Ø A couple of people riding on the back of a horse. Ø A couple of people riding horses on a field.

Fails to COVER the variation in the output space! Doesn’t model intra-set interactions!

slide-10
SLIDE 10

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Learning to Decode Sets of Sequences

Select top-B words at each time step

is picture

t

This

B = 2

slide-11
SLIDE 11

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Learning to Decode Sets of Sequences

Select top-B words at each time step

is picture

t

a the is shows This

B = 2

slide-12
SLIDE 12

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Learning to Decode Sets of Sequences

Select top-B words at each time step

is picture

t

This

B = 2

a shows

slide-13
SLIDE 13

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Learning to Decode Sets of Sequences

Select top-B words at each time step

is picture

t

This

B = 2

Till end token is generated or max time a shows

slide-14
SLIDE 14

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Incoming beams All possible expansions EXPAND

|V| × B Beam Search as Subset Selection

slide-15
SLIDE 15

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Outgoing beams Incoming beams All possible expansions EXPAND MERGE

|V| × B Beam Search as Subset Selection

slide-16
SLIDE 16

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Outgoing beams Incoming beams All possible expansions EXPAND SUBMODULAR MAXIMIZATION

|V| × B Beam Search as Subset Selection

slide-17
SLIDE 17

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Outgoing beams Incoming beams All possible expansions EXPAND SUBMODULAR MAXIMIZATION

|V| × B Submodular Maximization for Subset Selection

  • Naturally models

coverage, promoting diversity

slide-18
SLIDE 18

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Outgoing beams Incoming beams All possible expansions EXPAND SUBMODULAR MAXIMIZATION

|V| × B Submodular Maximization for Subset Selection

  • Naturally models

coverage, promoting diversity

  • NP Hard!
slide-19
SLIDE 19

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Outgoing beams Incoming beams All possible expansions EXPAND SUBMODULAR MAXIMIZATION

|V| × B Submodular Maximization for Subset Selection

  • Naturally models

coverage, promoting diversity

  • NP Hard!
  • Greedy algorithms

with approximation guarantees exist!

slide-20
SLIDE 20

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

Learning Submodular Functions X wi wi ≥ 0

Set feature

MLP W ≥ 0

log(1 + ·)

f(S) ∀e ∈ S, φ(e) ≥ 0

[Bilmes et al., 2017]

slide-21
SLIDE 21

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

FOR t = 1 to T:

  • 1. Construct set of all possible extensions

FOR k = 1 to K:

  • 2. Compute marginal gain of each

extension

  • 3. Sample an extension proportional to

marginal gain RETURN Set of K Sequences of length T

Yt−1 × |V| ∇BS (diff-BS)

slide-22
SLIDE 22

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

FOR t = 1 to T:

  • 1. Construct set of all possible extensions

FOR k = 1 to K:

  • 2. Compute marginal gain of each

extension

  • 3. Sample an extension proportional to

marginal gain RETURN Set of K Sequences of length T

∇BS (diff-BS) Yt−1 × |V|

slide-23
SLIDE 23

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

FOR t = 1 to T:

  • 1. Construct set of all possible extensions

FOR k = 1 to K:

  • 2. Compute marginal gain of each

extension

  • 3. Sample an extension proportional to

marginal gain RETURN Set of K Sequences of length T

Yt−1 × |V| ∇BS (diff-BS)

slide-24
SLIDE 24

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

FOR t = 1 to T:

  • 1. Construct set of all possible extensions

FOR k = 1 to K:

  • 2. Compute marginal gain of each

extension

  • 3. Sample an extension proportional to

marginal gain RETURN Set of K Sequences of length T

Yt−1 × |V| ∇BS (diff-BS)

slide-25
SLIDE 25

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

FOR t = 1 to T:

  • 1. Construct set of all possible extensions

FOR k = 1 to K:

  • 2. Compute marginal gain of each

extension

  • 3. Sample an extension proportional to

marginal gain RETURN Set of K Sequences of length T

Yt−1 × |V| ∇BS (diff-BS)

slide-26
SLIDE 26

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

FOR t = 1 to T:

  • 1. Construct set of all possible extensions

FOR k = 1 to K:

  • 2. Compute marginal gain of each

extension

  • 3. Sample an extension proportional to

marginal gain RETURN Set of K Sequences of length T

Yt−1 × |V| ∇BS (diff-BS)

slide-27
SLIDE 27

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

π∗ = arg max

π∈Π E(Y1,...,YT )∼π(·|x)SET−METRIC(Y|x)

“Set of Sequences” Level Training

slide-28
SLIDE 28

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

π∗ = arg max

π∈Π E(Y1,...,YT )∼π(·|x)SET−METRIC(Y|x)

“Set of Sequences” Level Training

  • Set-metric?
  • Oracle, average accuracy
slide-29
SLIDE 29

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

π∗ = arg max

π∈Π E(Y1,...,YT )∼π(·|x)SET−METRIC(Y|x)

“Set of Sequences” Level Training

  • Set-metric?
  • Oracle, average accuracy
  • Facility Location Accuracy [NEW]
slide-30
SLIDE 30

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

π∗ = arg max

π∈Π E(Y1,...,YT )∼π(·|x)SET−METRIC(Y|x)

“Set of Sequences” Level Training

  • Set-metric?
  • Oracle, average accuracy
  • Facility Location Accuracy [NEW]
  • Training?
  • Teacher Forcing if multiple annotations are available.
slide-31
SLIDE 31

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

π∗ = arg max

π∈Π E(Y1,...,YT )∼π(·|x)SET−METRIC(Y|x)

“Set of Sequences” Level Training

  • Set-metric?
  • Oracle, average accuracy
  • Facility Location Accuracy [NEW]
  • Training?
  • Teacher Forcing if multiple annotations are available.
  • Imitation Learning if expert is available
slide-32
SLIDE 32

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

π∗ = arg max

π∈Π E(Y1,...,YT )∼π(·|x)SET−METRIC(Y|x)

“Set of Sequences” Level Training

  • Set-metric?
  • Oracle, average accuracy
  • Facility Location Accuracy [NEW]
  • Training?
  • Teacher Forcing if multiple annotations are available
  • Imitation Learning if expert is available
  • REINFORCE to directly optimize for the set-metric
slide-33
SLIDE 33

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

  • Novel perspective. Beam Search as Subset Selection
  • Models intra-set dependencies
  • Can be used with arbitrary set constraints
  • No train-test or loss-evaluation mismatch
  • Outperforms Beam Search and other baselines on captioning

Doesn’t scale very well with beam size (some tricks in the paper)

In Summary

slide-34
SLIDE 34

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

  • Novel perspective. Beam Search as Subset Selection
  • Models intra-set dependencies
  • Can be used with arbitrary set constraints
  • No train-test or loss-evaluation mismatch
  • Outperforms Beam Search and other baselines on captioning

Doesn’t scale very well with beam size (some tricks in the paper)

In Summary

slide-35
SLIDE 35

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

  • Novel perspective. Beam Search as Subset Selection
  • Models intra-set dependencies
  • Can be used with arbitrary set constraints
  • No train-test or loss-evaluation mismatch
  • Outperforms Beam Search and other baselines on captioning

Doesn’t scale very well with beam size (some tricks in the paper)

In Summary

slide-36
SLIDE 36

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

  • Novel perspective. Beam Search as Subset Selection
  • Models intra-set dependencies
  • Can be used with arbitrary set constraints
  • No train-test or loss-evaluation mismatch
  • Outperforms Beam Search and other baselines on captioning

Doesn’t scale very well with beam size (some tricks in the paper)

In Summary

slide-37
SLIDE 37

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

  • Novel perspective. Beam Search as Subset Selection
  • Models intra-set dependencies
  • Can be used with arbitrary set constraints
  • No train-test or loss-evaluation mismatch
  • Outperforms Beam Search and other baselines on captioning

Doesn’t scale very well with beam size (some tricks in the paper)

In Summary

slide-38
SLIDE 38

Ashwin Kalyan ICML 2019 Tr Trainable Decoding of Sets of Sequences for Neural Sequence Models

  • Novel perspective. Beam Search as Subset Selection
  • Models intra-set dependencies
  • Can be used with arbitrary set constraints
  • No train-test or loss-evaluation mismatch
  • Outperforms Beam Search and other baselines on captioning

Doesn’t scale very well with beam size (some tricks in the paper)

In Summary

slide-39
SLIDE 39

Poster: Pacific Ballroom #48 June 13th 6:30 pm

Paper: http://proceedings.mlr.press/v97/kalyan19a.html Code: https://github.com/ashwinkalyan/diff-bs