Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed - - PowerPoint PPT Presentation

garbage in reward out bootstrapping exploration in multi
SMART_READER_LITE
LIVE PREVIEW

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed - - PowerPoint PPT Presentation

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits Branislav Kveton, Google Research Csaba Szepesvri, DeepMind and University of Alberta Sharan Vaswani, Mila, University of Montreal Zheng Wen, Adobe Research Mohammad


slide-1
SLIDE 1

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

Branislav Kveton, Google Research Csaba Szepesvári, DeepMind and University of Alberta Sharan Vaswani, Mila, University of Montreal Zheng Wen, Adobe Research Mohammad Ghavamzadeh, Facebook AI Research Tor Lattimore, DeepMind

slide-2
SLIDE 2
  • Learning agent sequentially pulls K arms in n rounds
  • The agent pulls arm It in round t ∈ [n] and observes its reward
  • Reward of arm i is in [0, 1] and drawn i.i.d. from a distribution with mean μi
  • Goal: Maximize the expected n-round reward
  • Challenge: Exploration-exploitation trade-off

Stochastic Multi-Armed Bandit

Arm 1 Arm 2 Arm K

slide-3
SLIDE 3

Thompson Sampling (Thompson, 1933)

  • Sample μi,t from posterior distribution Pi,t and pull arm It = argmaxi μi,t

μ1 P1,t μ2 P2,t Expected reward

slide-4
SLIDE 4

Thompson Sampling (Thompson, 1933)

  • Sample μi,t from posterior distribution Pi,t and pull arm It = argmaxi μi,t
  • Key properties

○ Pi,t concentrates at μi with the number of pulls

μ1 P1,t μ2 P2,t Expected reward

slide-5
SLIDE 5

Thompson Sampling (Thompson, 1933)

  • Sample μi,t from posterior distribution Pi,t and pull arm It = argmaxi μi,t
  • Key properties

○ Pi,t concentrates at μi with the number of pulls ○ μi,t overestimates μi with a sufficient probability

μ1 P1,t μ2 P2,t Expected reward

slide-6
SLIDE 6

Thompson Sampling (Thompson, 1933)

  • Sample μi,t from posterior distribution Pi,t and pull arm It = argmaxi μi,t
  • Key properties

○ Pi,t concentrates at μi with the number of pulls ○ μi,t overestimates μi with a sufficient probability

Bernoulli bandit Pi,t = beta Gaussian bandit Pi,t = normal Neural network Pi,t = ??? μ1 P1,t μ2 P2,t Expected reward

slide-7
SLIDE 7

General Randomized Exploration

  • Sample μi,t from posterior distribution Pi,t and pull arm It = argmaxi μi,t
  • Key properties

○ Pi,t concentrates at (scaled and shifted) μi with the number of pulls ○ μi,t overestimates (scaled and shifted) μi with a sufficient probability

How do we design distribution Pi,t? μ1 P1,t μ2 P2,t Expected reward

slide-8
SLIDE 8
  • μi,t is the mean of a non-parametric bootstrap sample of the history of arm i

with pseudo-rewards (garbage)

Giro (Garbage In, Reward Out) with [0, 1] Rewards

slide-9
SLIDE 9
  • μi,t is the mean of a non-parametric bootstrap sample of the history of arm i

with pseudo-rewards (garbage)

Giro (Garbage In, Reward Out) with [0, 1] Rewards

History Arm 1 Arm 2 1 1

slide-10
SLIDE 10
  • μi,t is the mean of a non-parametric bootstrap sample of the history of arm i

with pseudo-rewards (garbage)

Giro (Garbage In, Reward Out) with [0, 1] Rewards

History Garbage Arm 1 Arm 2 1 1 1 1 1 1 1

slide-11
SLIDE 11
  • μi,t is the mean of a non-parametric bootstrap sample of the history of arm i

with pseudo-rewards (garbage)

Giro (Garbage In, Reward Out) with [0, 1] Rewards

History Garbage Bootstrap sample Arm 1 Arm 2 μi,t 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 / 3 5 / 9

slide-12
SLIDE 12
  • μi,t is the mean of a non-parametric bootstrap sample of the history of arm i

with pseudo-rewards (garbage)

  • Benefits and challenges of randomized garbage

○ μi,t overestimates scaled and shifted μi with a sufficient probability ○ Bias in the estimate of μi

Giro (Garbage In, Reward Out) with [0, 1] Rewards

History Garbage Bootstrap sample Arm 1 Arm 2 μi,t 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 / 3 5 / 9

slide-13
SLIDE 13

Contextual Giro with [0, 1] Rewards

  • Straightforward generalization to complex structured problems
  • μi,t is the estimated reward of arm i in a model trained on a non-parametric

bootstrap sample of the history with pseudo-rewards (garbage)

  • Giro is as general as the ε-greedy policy... but no tuning!

History Garbage Bootstrap sample μi,t (x1, ) Estimate from a learned model (x2, ) 1 (x3, ) (x1, ) (x2, ) (x3, ) (x1, ) 1 (x2, ) 1 (x3, ) 1 (x1, ) (x3, ) (x1, ) 1 (x2, ) 1 (x3, ) 1 (x2, ) (x2, ) 1 (x3, ) 1 (x2, ) 1

slide-14
SLIDE 14

See you at poster #125!

How to do bandits with neural networks easily? How does Giro compare to Thompson sampling?