Advanced Search Algorithms Graham Neubig - - PowerPoint PPT Presentation

advanced search algorithms
SMART_READER_LITE
LIVE PREVIEW

Advanced Search Algorithms Graham Neubig - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Advanced Search Algorithms Graham Neubig https://phontron.com/class/nn4nlp2020/ (Some Slides by Daniel Clothiaux) The Generation Problem We have a model of P(Y|X), how do we use it to generate a sentence?


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Advanced Search Algorithms

Graham Neubig https://phontron.com/class/nn4nlp2020/

(Some Slides by Daniel Clothiaux)

slide-2
SLIDE 2

The Generation Problem

  • We have a model of P(Y|X), how do we use it to

generate a sentence?

  • Two methods:
  • Sampling: Try to generate a random sentence

according to the probability distribution.

  • Argmax: Try to generate the sentence with the

highest probability.

slide-3
SLIDE 3

Which to Use?

  • We want the best possible single output


→ Search

  • We want to observe multiple outputs according to

the probability distribution
 → Sampling

  • We want to generate diverse outputs so that we are

not boring
 → Sampling? Search?

slide-4
SLIDE 4

Sampling

slide-5
SLIDE 5

Ancestral Sampling

  • Randomly generate words one-by-one.



 
 


  • An exact method for sampling from P(X), no further

work needed.

  • Any other sampling method is not an appropriate way
  • f visualizing/understanding the underlying distribution.

while yj-1 != “</s>”: yj ~ P(yj | X, y1, …, yj-1)

slide-6
SLIDE 6

Search Basics

slide-7
SLIDE 7

Why do we Search?

  • We want to find the best output
  • What is "best"?
  • The most accurate output



 
 → impossible! we don't know the reference

  • The most probable output according to the model



 
 → simple, but not necessarily tied to accuracy

  • The output with the lowest Bayes risk



 
 → which output looks like it has the lowest error? ˆ Y = argmin

˜ Y

error(Y, ˜ Y )

<latexit sha1_base64="pazaO1OUOgQ/R/MsnOhbEaj7I3Q=">ACMHicbVDLSgNBEJz1GeMr6tHLYBAiSNgVQT0IQS8eFYwPsiHMznaSwdnZaZXDMv6SV78E/HiQcWrX+HkgWi0YKC6upqeriCRwqDrvjgTk1PTM7OFueL8wuLScml9cLEqeZQ57GM9VXADEihoI4CJVwlGlgUSLgMbo7/ctb0EbE6hx7CTQj1lGiLThDK7VKJ36XYXad0Pqpyq0TsDMRyFDsGpuKdxhxnQnEirP74claB3rvHK9Tb+dW61S2a26A9C/xBuRMhnhtFV68sOYpxEo5JIZ0/DcBJt2FQouIS/6qYGE8RvWgYalikVgmtng4pxuWiWk7Vjbp5AO1J8TGYuM6UWBdUYMu2a81xf/6zVSbO83M6GSFEHx4aJ2KinGtB8fDYUGjrJnCeNa2L9S3mWacbTRFW0I3vjJf0l9p3pQ9c52y7WjURoFsk42SIV4ZI/UyAk5JXCyQN5Jq/kzXl0Xpx352NonXBGM2vkF5zPLylArDg=</latexit><latexit sha1_base64="pazaO1OUOgQ/R/MsnOhbEaj7I3Q=">ACMHicbVDLSgNBEJz1GeMr6tHLYBAiSNgVQT0IQS8eFYwPsiHMznaSwdnZaZXDMv6SV78E/HiQcWrX+HkgWi0YKC6upqeriCRwqDrvjgTk1PTM7OFueL8wuLScml9cLEqeZQ57GM9VXADEihoI4CJVwlGlgUSLgMbo7/ctb0EbE6hx7CTQj1lGiLThDK7VKJ36XYXad0Pqpyq0TsDMRyFDsGpuKdxhxnQnEirP74claB3rvHK9Tb+dW61S2a26A9C/xBuRMhnhtFV68sOYpxEo5JIZ0/DcBJt2FQouIS/6qYGE8RvWgYalikVgmtng4pxuWiWk7Vjbp5AO1J8TGYuM6UWBdUYMu2a81xf/6zVSbO83M6GSFEHx4aJ2KinGtB8fDYUGjrJnCeNa2L9S3mWacbTRFW0I3vjJf0l9p3pQ9c52y7WjURoFsk42SIV4ZI/UyAk5JXCyQN5Jq/kzXl0Xpx352NonXBGM2vkF5zPLylArDg=</latexit><latexit sha1_base64="pazaO1OUOgQ/R/MsnOhbEaj7I3Q=">ACMHicbVDLSgNBEJz1GeMr6tHLYBAiSNgVQT0IQS8eFYwPsiHMznaSwdnZaZXDMv6SV78E/HiQcWrX+HkgWi0YKC6upqeriCRwqDrvjgTk1PTM7OFueL8wuLScml9cLEqeZQ57GM9VXADEihoI4CJVwlGlgUSLgMbo7/ctb0EbE6hx7CTQj1lGiLThDK7VKJ36XYXad0Pqpyq0TsDMRyFDsGpuKdxhxnQnEirP74claB3rvHK9Tb+dW61S2a26A9C/xBuRMhnhtFV68sOYpxEo5JIZ0/DcBJt2FQouIS/6qYGE8RvWgYalikVgmtng4pxuWiWk7Vjbp5AO1J8TGYuM6UWBdUYMu2a81xf/6zVSbO83M6GSFEHx4aJ2KinGtB8fDYUGjrJnCeNa2L9S3mWacbTRFW0I3vjJf0l9p3pQ9c52y7WjURoFsk42SIV4ZI/UyAk5JXCyQN5Jq/kzXl0Xpx352NonXBGM2vkF5zPLylArDg=</latexit>

ˆ Y = argmax

˜ Y

P( ˜ Y |X)

<latexit sha1_base64="daJTM+c0MGzbLqAwsNEmTltAuIU=">ACJnicbVDLSgMxFM34tr6qLt0Ei6CbMiOCulBENy4rWK10SslkbtgJjMkd6RlHL/Gjb/iRvCBuPNTGsRXwcCh3PO5eaeIJHCoOu+OSOjY+MTk1PThZnZufmF4uLSmYlTzaHKYxnrWsAMSKGgigIl1BINLAoknAeXR3/Aq0EbE6xV4CjYi1lWgJztBKzeK+32GYXeR0j/qpCm0SMPNRyBCsmlsKXcyYbkesm+c3lfUvj17T2kazWHL7gD0L/GpESGqDSLj34Y8zQChVwyY+qem2DLkDBJeQFPzWQMH7J2lC3VLEITCMb3JnTNauEtBVr+xTSgfp9ImORMb0osMmIYcf89vrif149xdZOIxMqSREU/1zUSiXFmPZLo6HQwFH2LGFcC/tXyjtM462sItwft98l9S3Szvlr2TrdLB4bCNKbJCVsk68cg2OSDHpEKqhJNbck+eyLNz5zw4L87rZ3TEGc4skx9w3j8AzM6nUg=</latexit><latexit sha1_base64="daJTM+c0MGzbLqAwsNEmTltAuIU=">ACJnicbVDLSgMxFM34tr6qLt0Ei6CbMiOCulBENy4rWK10SslkbtgJjMkd6RlHL/Gjb/iRvCBuPNTGsRXwcCh3PO5eaeIJHCoOu+OSOjY+MTk1PThZnZufmF4uLSmYlTzaHKYxnrWsAMSKGgigIl1BINLAoknAeXR3/Aq0EbE6xV4CjYi1lWgJztBKzeK+32GYXeR0j/qpCm0SMPNRyBCsmlsKXcyYbkesm+c3lfUvj17T2kazWHL7gD0L/GpESGqDSLj34Y8zQChVwyY+qem2DLkDBJeQFPzWQMH7J2lC3VLEITCMb3JnTNauEtBVr+xTSgfp9ImORMb0osMmIYcf89vrif149xdZOIxMqSREU/1zUSiXFmPZLo6HQwFH2LGFcC/tXyjtM462sItwft98l9S3Szvlr2TrdLB4bCNKbJCVsk68cg2OSDHpEKqhJNbck+eyLNz5zw4L87rZ3TEGc4skx9w3j8AzM6nUg=</latexit><latexit sha1_base64="daJTM+c0MGzbLqAwsNEmTltAuIU=">ACJnicbVDLSgMxFM34tr6qLt0Ei6CbMiOCulBENy4rWK10SslkbtgJjMkd6RlHL/Gjb/iRvCBuPNTGsRXwcCh3PO5eaeIJHCoOu+OSOjY+MTk1PThZnZufmF4uLSmYlTzaHKYxnrWsAMSKGgigIl1BINLAoknAeXR3/Aq0EbE6xV4CjYi1lWgJztBKzeK+32GYXeR0j/qpCm0SMPNRyBCsmlsKXcyYbkesm+c3lfUvj17T2kazWHL7gD0L/GpESGqDSLj34Y8zQChVwyY+qem2DLkDBJeQFPzWQMH7J2lC3VLEITCMb3JnTNauEtBVr+xTSgfp9ImORMb0osMmIYcf89vrif149xdZOIxMqSREU/1zUSiXFmPZLo6HQwFH2LGFcC/tXyjtM462sItwft98l9S3Szvlr2TrdLB4bCNKbJCVsk68cg2OSDHpEKqhJNbck+eyLNz5zw4L87rZ3TEGc4skx9w3j8AzM6nUg=</latexit>

ˆ Y = argmin

˜ Y

X

Y 0

P(Y 0|X)error(Y 0, ˜ Y )

<latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit><latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit><latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit>
slide-8
SLIDE 8

Search Errors, Model Errors

(example from Neubig (2015))

  • Search error: the search algorithm fails to find an
  • utput that optimizes its search criterion
  • Model error: the output that optimizes the search

criterion does not optimize accuracy

slide-9
SLIDE 9

Searching Probable Outputs

slide-10
SLIDE 10

Greedy Search

  • One by one, pick the single highest-probability word
  • Not exact, real problems:
  • Will often generate the “easy” words first
  • Will prefer multiple common words to one rare word

while yj-1 != “</s>”: yj = argmax P(yj | X, y1, …, yj-1)

slide-11
SLIDE 11

Why will this Help

Next word P(next word) Pittsburgh 0.4 New York 0.3 New Jersey 0.25 Other 0.05

slide-12
SLIDE 12

Beam Search

  • Instead of picking the highest

probability/score, maintain multiple paths

  • At each time step
  • Expand each path
  • Choose a subset paths from the

expanded set

slide-13
SLIDE 13
  • How to select which paths to keep expanding?
  • Histogram Pruning: Keep exactly k hypotheses at

every time step

  • Score Threshold Pruning: Keep all hypotheses

where score is within a threshold α of best score s1
 sn + α > s1

  • Probability Mass Pruning: Keep all hypotheses

up until probability mass α

Basic Pruning Methods

(Steinbiss et al. 1994)

slide-14
SLIDE 14

What beam size should I use?

  • Larger beam sizes will be slower
  • May not give better results due to model errors
  • Sometimes result in shorter sequences
  • May favor high-frequency words
  • Mostly done empirically -> experiment (range of

5-100 for histogram?)

slide-15
SLIDE 15

Problems w/ Disparate Search Difficulty

  • Sometimes need to cover specific content, some

easy some hard
 
 


  • Can cause the search algorithm to select the easy

thing first, then hard thing later I saw the escarpment watashi mita dangai? zeppeki? kyushamen? iwa? watashi ga mita dangai (the escarpment I saw) watashi wa dangai wo mita (I saw the escarpment)

slide-16
SLIDE 16

Future Cost

  • also predict how hard it will be to process as-of-yet-

unprocessed words, and search for maximum of sum f(n) = g(n) + h(n)

  • g(n): cost to current point
  • h(n): estimated cost to goal
  • See Koehn (2010 Chapter 6), or Li et al. (2017) for

a neural approximation

slide-17
SLIDE 17

Search and Problems with Modeling

slide-18
SLIDE 18

Better Search can Hurt Results!

(Koehn and Knowles 2017)

  • Better search (=better model score) can result in

worse BLEU score!

  • Why? Model errors!
slide-19
SLIDE 19

How to Fix Model Errors?

  • Train the model to maximize accuracy/minimize

risk (best!, covered previously)

  • Change the decision rule to minimize risk (best!)
  • Heuristically modify the model score post-hoc

(OK)

  • Hobble the search algorithm so it makes more

search errors, but the kind of errors you want (meh)

slide-20
SLIDE 20

Minimum Bayes Risk Decoding

slide-21
SLIDE 21

Basic Concept

  • We want outputs that look "safe" given all the other

high-probability outputs p=0.3 I don't know p=0.2 My name is Graham p=0.18 My name is Graham Neubig p=0.17 My name is Neubig ... Higher in Aggregate

  • Operationalized as searching for hypothesis that

minimizes risk

ˆ Y = argmin

˜ Y

X

Y 0

P(Y 0|X)error(Y 0, ˜ Y )

<latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit><latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit><latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit>
slide-22
SLIDE 22

Minimum Bayes Risk Reranking

  • Create n-best list
  • Create error matrix and probability vector

Ei,j = error(Yi, Yj)

<latexit sha1_base64="WRGOxGKQ3EoG2UrP/eEi1bJpIA=">ACHicbVDLSgNBEJz1GeNr1aOXwSBECGFXBPUgBEXwGMGYSBKW2UlHJ5l9MNMrhiVXL/6KFw8qXv0Eb/6Nk8dBEwsaiqpurv8WAqNjvNtzczOzS8sZpayura+v2xua1jhLFocIjGamazRIEUIFBUqoxQpY4Euo+t2zgV+9B6VF5hL4ZmwG5D0RacoZE8m57qSh0+vSENhAeMAWlItXP3iQG+8zp5n5yiMwSdJu6Y5MgYZc/+arQingQIpdM67rxNhMmULBJfSzjURDzHiX3ULd0JAFoJvp8JM+3TVKi7YjZSpEOlR/T6Qs0LoX+KYzYHinJ72B+J9XT7B91ExFGCcIR8taieSYkQHsdCWUMBR9gxhXAlzK+V3TDGOJrysCcGdfHmaVPaLx0X38iBXOh2nkSHbZIfkiUsOSYlckDKpE4eyTN5JW/Wk/VivVsfo9YZazyzRf7A+vwBQOCY+g=</latexit><latexit sha1_base64="WRGOxGKQ3EoG2UrP/eEi1bJpIA=">ACHicbVDLSgNBEJz1GeNr1aOXwSBECGFXBPUgBEXwGMGYSBKW2UlHJ5l9MNMrhiVXL/6KFw8qXv0Eb/6Nk8dBEwsaiqpurv8WAqNjvNtzczOzS8sZpayura+v2xua1jhLFocIjGamazRIEUIFBUqoxQpY4Euo+t2zgV+9B6VF5hL4ZmwG5D0RacoZE8m57qSh0+vSENhAeMAWlItXP3iQG+8zp5n5yiMwSdJu6Y5MgYZc/+arQingQIpdM67rxNhMmULBJfSzjURDzHiX3ULd0JAFoJvp8JM+3TVKi7YjZSpEOlR/T6Qs0LoX+KYzYHinJ72B+J9XT7B91ExFGCcIR8taieSYkQHsdCWUMBR9gxhXAlzK+V3TDGOJrysCcGdfHmaVPaLx0X38iBXOh2nkSHbZIfkiUsOSYlckDKpE4eyTN5JW/Wk/VivVsfo9YZazyzRf7A+vwBQOCY+g=</latexit><latexit sha1_base64="WRGOxGKQ3EoG2UrP/eEi1bJpIA=">ACHicbVDLSgNBEJz1GeNr1aOXwSBECGFXBPUgBEXwGMGYSBKW2UlHJ5l9MNMrhiVXL/6KFw8qXv0Eb/6Nk8dBEwsaiqpurv8WAqNjvNtzczOzS8sZpayura+v2xua1jhLFocIjGamazRIEUIFBUqoxQpY4Euo+t2zgV+9B6VF5hL4ZmwG5D0RacoZE8m57qSh0+vSENhAeMAWlItXP3iQG+8zp5n5yiMwSdJu6Y5MgYZc/+arQingQIpdM67rxNhMmULBJfSzjURDzHiX3ULd0JAFoJvp8JM+3TVKi7YjZSpEOlR/T6Qs0LoX+KYzYHinJ72B+J9XT7B91ExFGCcIR8taieSYkQHsdCWUMBR9gxhXAlzK+V3TDGOJrysCcGdfHmaVPaLx0X38iBXOh2nkSHbZIfkiUsOSYlckDKpE4eyTN5JW/Wk/VivVsfo9YZazyzRf7A+vwBQOCY+g=</latexit>

pi = P(Yi|X)

<latexit sha1_base64="gOw62kTydYi/gouZFHEkDopHVCc=">ACB3icbVC7TsMwFHXKq5RXgJEBiwqpLFWCkIABqYKFsUiEFrVR5DhOa9VxItBqkJGFn6FhQEQK7/Axt/gtBmgcCTLR+fcq3v8RNGpbKsL6MyN7+wuFRdrq2srq1vmJtbNzJOBSYOjlksuj6ShFOHEUVI91EBT5jHT80UXhd+6IkDTm12qcEDdCA05DipHSkmfu9v2YBXIc6S9Lci+jOTyD7catR+E97B54Zt1qWhPAv8QuSR2UaHvmZz+IcRoRrjBDUvZsK1FuhoSimJG81k8lSRAeoQHpacpRKSbTQ7J4b5WAhjGQj+u4ET92ZGhSBa76soIqaGc9QrxP6+XqvDEzShPUkU4ng4KUwZVDItUYEAFwYqNUFYUL0rxEMkEFY6u5oOwZ49+S9xDpunTfvqN46L9Oogh2wBxrABsegBS5BGzgAgwfwBF7Aq/FoPBtvxvu0tGKUPdvgF4yPb91EmM=</latexit><latexit sha1_base64="gOw62kTydYi/gouZFHEkDopHVCc=">ACB3icbVC7TsMwFHXKq5RXgJEBiwqpLFWCkIABqYKFsUiEFrVR5DhOa9VxItBqkJGFn6FhQEQK7/Axt/gtBmgcCTLR+fcq3v8RNGpbKsL6MyN7+wuFRdrq2srq1vmJtbNzJOBSYOjlksuj6ShFOHEUVI91EBT5jHT80UXhd+6IkDTm12qcEDdCA05DipHSkmfu9v2YBXIc6S9Lci+jOTyD7catR+E97B54Zt1qWhPAv8QuSR2UaHvmZz+IcRoRrjBDUvZsK1FuhoSimJG81k8lSRAeoQHpacpRKSbTQ7J4b5WAhjGQj+u4ET92ZGhSBa76soIqaGc9QrxP6+XqvDEzShPUkU4ng4KUwZVDItUYEAFwYqNUFYUL0rxEMkEFY6u5oOwZ49+S9xDpunTfvqN46L9Oogh2wBxrABsegBS5BGzgAgwfwBF7Aq/FoPBtvxvu0tGKUPdvgF4yPb91EmM=</latexit><latexit sha1_base64="gOw62kTydYi/gouZFHEkDopHVCc=">ACB3icbVC7TsMwFHXKq5RXgJEBiwqpLFWCkIABqYKFsUiEFrVR5DhOa9VxItBqkJGFn6FhQEQK7/Axt/gtBmgcCTLR+fcq3v8RNGpbKsL6MyN7+wuFRdrq2srq1vmJtbNzJOBSYOjlksuj6ShFOHEUVI91EBT5jHT80UXhd+6IkDTm12qcEDdCA05DipHSkmfu9v2YBXIc6S9Lci+jOTyD7catR+E97B54Zt1qWhPAv8QuSR2UaHvmZz+IcRoRrjBDUvZsK1FuhoSimJG81k8lSRAeoQHpacpRKSbTQ7J4b5WAhjGQj+u4ET92ZGhSBa76soIqaGc9QrxP6+XqvDEzShPUkU4ng4KUwZVDItUYEAFwYqNUFYUL0rxEMkEFY6u5oOwZ49+S9xDpunTfvqN46L9Oogh2wBxrABsegBS5BGzgAgwfwBF7Aq/FoPBtvxvu0tGKUPdvgF4yPb91EmM=</latexit>

r = Ep

<latexit sha1_base64="W39nwn2cURorbenzCWDaw0veGHA=">ACXicbVDLSsNAFJ34rPUVdelmtAiuSiKCuhCKIrisYGyhDWUymbRDJzNhZiKUkLUbf8WNCxW3/oE7/8ZJm0VtPTDM4Zx7ufeIGFUacf5sRYWl5ZXVitr1fWNza1te2f3QYlUYuJhwYRsB0gRjnxNWMtBNJUBw0gqG14XfeiRSUcHv9Sghfoz6nEYUI2kn3QDQL1Sg2XyZzeAlv4LSU5D275tSdMeA8cUtSAyWaPfu7GwqcxoRrzJBSHdJtJ8hqSlmJK92U0UShIeoTzqGchQT5WfjU3J4ZJQRkKaxzUcq9MdGYpVsZqpjJEeqFmvEP/zOqmOzv2M8iTVhOPJoChlUAtY5AJDKgnWbGQIwpKaXSEeImwNulVTQju7MnzxDupX9Tdu9Na46pMowL2wSE4Bi4Aw1wC5rAxg8gRfwBt6tZ+vV+rA+J6ULVtmzB/7A+voF8yaow=</latexit><latexit sha1_base64="W39nwn2cURorbenzCWDaw0veGHA=">ACXicbVDLSsNAFJ34rPUVdelmtAiuSiKCuhCKIrisYGyhDWUymbRDJzNhZiKUkLUbf8WNCxW3/oE7/8ZJm0VtPTDM4Zx7ufeIGFUacf5sRYWl5ZXVitr1fWNza1te2f3QYlUYuJhwYRsB0gRjnxNWMtBNJUBw0gqG14XfeiRSUcHv9Sghfoz6nEYUI2kn3QDQL1Sg2XyZzeAlv4LSU5D275tSdMeA8cUtSAyWaPfu7GwqcxoRrzJBSHdJtJ8hqSlmJK92U0UShIeoTzqGchQT5WfjU3J4ZJQRkKaxzUcq9MdGYpVsZqpjJEeqFmvEP/zOqmOzv2M8iTVhOPJoChlUAtY5AJDKgnWbGQIwpKaXSEeImwNulVTQju7MnzxDupX9Tdu9Na46pMowL2wSE4Bi4Aw1wC5rAxg8gRfwBt6tZ+vV+rA+J6ULVtmzB/7A+voF8yaow=</latexit><latexit sha1_base64="W39nwn2cURorbenzCWDaw0veGHA=">ACXicbVDLSsNAFJ34rPUVdelmtAiuSiKCuhCKIrisYGyhDWUymbRDJzNhZiKUkLUbf8WNCxW3/oE7/8ZJm0VtPTDM4Zx7ufeIGFUacf5sRYWl5ZXVitr1fWNza1te2f3QYlUYuJhwYRsB0gRjnxNWMtBNJUBw0gqG14XfeiRSUcHv9Sghfoz6nEYUI2kn3QDQL1Sg2XyZzeAlv4LSU5D275tSdMeA8cUtSAyWaPfu7GwqcxoRrzJBSHdJtJ8hqSlmJK92U0UShIeoTzqGchQT5WfjU3J4ZJQRkKaxzUcq9MdGYpVsZqpjJEeqFmvEP/zOqmOzv2M8iTVhOPJoChlUAtY5AJDKgnWbGQIwpKaXSEeImwNulVTQju7MnzxDupX9Tdu9Na46pMowL2wSE4Bi4Aw1wC5rAxg8gRfwBt6tZ+vV+rA+J6ULVtmzB/7A+voF8yaow=</latexit>
  • Multiply to get the risk
  • Find the element with lowest risk
slide-23
SLIDE 23

Improving Diversity in top N Choices

(Li et al., 2016)

  • Entries in the beam can be very similar
  • Improving the diversity of the top N list can help
  • Score using source->target and target-> source translation

models, language model

slide-24
SLIDE 24

Sampling without Replacement

(Kool et. al 2019)

  • Ancestral sampling samples hypotheses with replacement, how can we

do it without replacement?

  • Gumbel distribution: If U is uniform(0,1)
  • G(φ) = φ − log(− log U)
  • Perturbing log probabilities w/ Gumbel noise and find the largest

elements = sampling from a categorical distribution without replacement

slide-25
SLIDE 25

Heuristic Modifications to Model Score

slide-26
SLIDE 26

A Typical Model Error: Length Bias

  • In many tasks (eg. MT), the output sequences will

be of variable length

  • Maximum likelihood training+local normalization

results in gradually decreasing probability

  • Running beam search may then favor short

sentences

slide-27
SLIDE 27

Length Normalization

  • Normalize by the length, dividing by |Y| (Cho et al. 2014)
  • More complicated heuristics (Wu et al. 2016)
slide-28
SLIDE 28

Predict the output length

(Eriguchi et al. 2016)

  • Add a penalty based off of length differences

between sentences

  • Calculate P(len(y) | len(x)) using corpus statistics
slide-29
SLIDE 29

Hobbled Search Algorithms

slide-30
SLIDE 30

Remember Limited Beam Search Can "Help"

  • How else can we modify our search algorithm?
slide-31
SLIDE 31

Limited Sampling

  • top-K sampling: like beam search w/ histogram pruning, but

sample from top K instead of enumerate

  • nucleus sampling: like beam search w/ probability mass pruning,

but sample from remaining hypotheses (Holtzman et al. 2020)

slide-32
SLIDE 32

Cautions about Sampling- based Search

  • Is sampling necessary for diversity?:

questionable, we could do diverse beam search instead.

  • Results are inconsistent from run-to-run: need to

consider variance from this in reporting (in addition to variance in training and data selection)

  • Conflates model and search errors: if you make a

better model you might get worse results, because the search algorithm can't find the outputs your model likes

slide-33
SLIDE 33

Search in Training

slide-34
SLIDE 34

Using Beam Search in Training

(Wiseman et al., 2016)

  • Decoding with beam search has biases
  • Exposure: Model not exposed to errors during training
  • Label: scores are locally normalized
  • Possible solution: train with beam search
slide-35
SLIDE 35

Continuous Beam Search

(Goyal et al., 2017)

slide-36
SLIDE 36

Actor Critic

(Bahdanau et. al., 2017)

  • Basic idea:
  • Use Neural Model as an actor that predicts

actions (say, the next word)

  • Use a critic to predict final reward (in this case,

BLEU) for MT models

  • Actor trained similarly to REINFORCE, critic

trained with TD

slide-37
SLIDE 37

Actor Critic (continued)

  • T is the sequence, M in the set of examples, and a

the potential next actions, Q reward Actor: Critic:

  • C is a measure of reward over average reward

similar to REINFORCE style algorithms

slide-38
SLIDE 38

Questions?