BEA BEAMS ND WH WHERE AND TO FIND ND THEM TO The Th e Gumb - - PowerPoint PPT Presentation

β–Ά
bea beams
SMART_READER_LITE
LIVE PREVIEW

BEA BEAMS ND WH WHERE AND TO FIND ND THEM TO The Th e Gumb - - PowerPoint PPT Presentation

ll ne neve ver hav Wouter Wo er K Kool, He Herke van Ho Hoof, Max Welling - T Thi his is ho how w you - Yo You wi will have ra randomize a a be beam am sear arch! h! du duplic icate te samples es aga gain


slide-1
SLIDE 1

BEA BEAMS

Th The e Gumb umbel el-To Top-𝒍 Tri Trick for r Sampl mpling Se Sequ quences With thou

  • ut

t Repla lacement

Wo Wouter er K Kool, He Herke van Ho Hoof, Max Welling

  • β€œYo

β€œYou wi will ll ne

neve ver hav

have du duplic icate te samples es aga gain in!” !”

AND ND WH

WHERE

TO TO FIND

ND THEM

STO STOCH CHASTIC STIC

BE BEST PAPER

HONO NORABLE MENT NTION

ICM CML

2019 19

  • β€œT

β€œThi his is ho how w you

ra randomize a

a be beam am sear arch! h!”

slide-2
SLIDE 2

TL TL;DR DR

Stoch Stochasti tic B c Bea eam Sea Search ch fi finds a a s set of et of un unique ue sampl ples es (w (without replacement) ) fr from a a s sequen equence m ce model el.

slide-3
SLIDE 3 C A CC CC CA CA AC AC AA AA CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA AAC AAC AAA AAA

(log-)probability

Exa Example

Binarese language model

Vocabulary: {Abra, Cadabra}

What if we want a sam sampl ple from

  • ur model?
AC ACC

𝑄(𝐷) 𝑄(𝐡|𝐷)

𝑄(𝐷|𝐷𝐡)

𝑄 𝐷 𝑄 𝐡 𝐷 𝑄 𝐷 𝐷𝐡 = 𝑄 𝐷𝐡𝐷

slide-4
SLIDE 4

Th The G Gumbe bel-Max Max Tr Trick ck 𝐻* ∼ Gumbel(0) Gumbel noise 𝜚* = log π‘ž* log-probability 𝐻78 ∼ Gumbel 𝜚*

perturbed log-probability

= +

(Gumbel, 1945; Maddison et al., 2014)

β€œProf. Gumbeldore”

slide-5
SLIDE 5

Th The G Gumbe bel-Max Max Tr Trick ck

∼ Gumbel log :

*

exp 𝜚* argmax

*

𝐻78

max and argmax are independent

max

*

𝐻78 ∼ Categorical π‘ž* 𝐽 = 𝑄 𝐽 = 𝑗 = π‘ž*

(Gumbel, 1945; Maddison et al., 2014)

β€œProf. Gumbeldore”

slide-6
SLIDE 6 C A CC CC CA CA AC AC AA AA CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA AAC AAC AAA AAA

(log-)probability

Exa Example

Binarese language model

Vocabulary: {Abra, Cadabra}

AC ACC

This will be

  • ur sample!

What if we want a sam sampl ple from

  • ur model?
slide-7
SLIDE 7

What happens if, instead of 1 (one), we take the 𝑙 largest elements (top 𝑙)?

𝐽F, … , 𝐽I = arg top 𝑙

*

𝐻78 𝑙 = 3

slide-8
SLIDE 8

Th The β€˜ β€˜Gumbe bel-To Top-𝑙’ ’ Trick This is equivalent to repeated sampling without replacement!

𝑄 𝐽F = 𝑗F, … , 𝐽I = 𝑗I = π‘ž*K β‹…

M8N FOM8K β‹… … β‹… M8P FOβˆ‘β„“SK

PTK M8β„“

= ∏VWF

I M8X FOβˆ‘β„“SK

XTK M8β„“

(Vieira, 2014) Also known as Plackett-Luce

𝐽F, … , 𝐽I = arg top 𝑙

*

𝐻78

slide-9
SLIDE 9 C A CC CC CA CA AC AC AA AA CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA AAC AAC AAA AAA

(log-)probability

Exa Example

Binarese language model

Vocabulary: {Abra, Cadabra}

This will be our set of samples!

We can get a set of unique samples from our model!

AC ACC CA CAA AC ACA
slide-10
SLIDE 10

PR PROBLEM

… but we don’t have to! In general, constructing the full tree is not possible…

slide-11
SLIDE 11 C A CC CC CA CA AC AC AA AA CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA AAC AAC AAA AAA

𝐻7Y = max

*∈[ 𝐻78

𝑻

Pe Pert rturb rbed log-pr probability ty

  • f
  • f partial seq

equen ence e (β€œC”)

Look at maximum of perturbed log-probabilities in subtree

𝜚[ = log-probability of β€œC” ∼ Gumbel log :

*∈[

exp 𝜚*

Noise 𝐻[ ∼ Gumbel(0) is inferred

We can sample 𝐻7Y ∼ Gumbel 𝜚[ di directly ly

slide-12
SLIDE 12 C A CC CC CA CA AC AC AA AA CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA AAC AAC AAA AAA

Start from root, sample 𝐻7Y ∼ Gumbel(𝜚[) Sample children 𝐻7Y]conditionally on max

[]∈^_`abcde([) 𝐻7Y] = 𝐻7Y

1. sample 𝐻7Y] independently, compute Z = max

[] 𝐻7Y]

2. β€˜shift’ Gumbels in (negative) exponential space:

g 𝐻7Y] = βˆ’ log exp βˆ’π»7Y βˆ’ exp βˆ’π‘Ž + exp βˆ’π»7Y]

g 𝐻7Y]

To Top-dow down sam sampl pling

CA CAA AC ACC AC ACA

… the result is eq equiv ivalen ent to sampling Gjk for leaves directly!

(Maddison et al., 2014)

slide-13
SLIDE 13 AA AA C A CC CC CA CA AC AC CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA

Th The K Key I Insight

We only need to expand the top 𝑙 nodes at each level in the tree

CA CAA AC ACC AC ACA

Threshold Each top 𝑙 node generates (at least)

  • ne leaf (maximum) above threshold

Other nodes only generate leafs below threshold No need to expand At least 𝑙 leafs will be above threshold

slide-14
SLIDE 14

We only need to expand the top 𝑙 nodes at each level in the tree

This is a beam search

Stoc Stochasti stic B Beam Se Search

Top 𝑙 according to perturbed log-probability = Sampling (without replacement)

Gumbel-Top-𝑙 trick

AA AA C A CC CC CA CA AC AC CCC CCC CCA CCA CA CAC CA CAA AC ACC AC ACA CA CAA AC ACC AC ACA
slide-15
SLIDE 15
  • A beam search that samples the nodes to expand
  • But… samples children conditionally on parent
  • The result is a sample without replacement

from the full sequence model

  • Is a generalization of ancestral sampling (𝑙 = 1)

Important!

Stoc Stochasti stic B Beam Se Search

slide-16
SLIDE 16

Ex Experim iments ts

slide-17
SLIDE 17
  • Generate 𝑙 translations
  • Plot BLEU against diversity
  • Vary softmax temperature
  • Compare:
  • Beam Search
  • Stochastic Beam Search
  • Sampling
  • Diverse Beam Search

(Vijayakumar et al., 2018)

Tr Tran anslat ation D Diversity

slide-18
SLIDE 18

BL BLEU S Scor

  • re E

Est stimat ation

  • n
  • Estimate expected sentence-

level BLEU

  • Plot mean and 95% interval
  • vs. num samples
  • Compare:
  • Monte Carlo Sampling
  • Stochastic Beam Search with

(normalized) Importance Weighted estimator

  • Beam Search with

deterministic estimate

slide-19
SLIDE 19

BEA BEAMS

AND ND WH

WHERE

TO TO FIND

ND THE

PO POST STER?

STO STOCH CHASTIC STIC

#4 #41

Wo Wouter er K Kool, He Herke van Ho Hoof, Max Welling