Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed - PowerPoint PPT Presentation

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits Branislav Kveton, Google Research Csaba Szepesvári, DeepMind and University of Alberta Sharan Vaswani, Mila, University of Montreal Zheng Wen, Adobe Research Mohammad Ghavamzadeh, Facebook AI Research Tor Lattimore, DeepMind

Stochastic Multi-Armed Bandit ● Learning agent sequentially pulls K arms in n rounds … Arm 1 Arm 2 Arm K ● The agent pulls arm I t in round t ∈ [n] and observes its reward ● Reward of arm i is in [0, 1] and drawn i.i.d. from a distribution with mean μ i ● Goal: Maximize the expected n-round reward ● Challenge: Exploration-exploitation trade-off

Thompson Sampling (Thompson, 1933) ● Sample μ i,t from posterior distribution P i,t and pull arm I t = argmax i μ i,t μ 2 μ 1 P 1,t P 2,t Expected reward

Thompson Sampling (Thompson, 1933) ● Sample μ i,t from posterior distribution P i,t and pull arm I t = argmax i μ i,t μ 2 μ 1 P 1,t P 2,t Expected reward ● Key properties ○ P i,t concentrates at μ i with the number of pulls

Thompson Sampling (Thompson, 1933) ● Sample μ i,t from posterior distribution P i,t and pull arm I t = argmax i μ i,t μ 2 μ 1 P 1,t P 2,t Expected reward ● Key properties ○ P i,t concentrates at μ i with the number of pulls ○ μ i,t overestimates μ i with a sufficient probability

Thompson Sampling (Thompson, 1933) ● Sample μ i,t from posterior distribution P i,t and pull arm I t = argmax i μ i,t μ 2 μ 1 Bernoulli bandit P 1,t P 2,t P i,t = beta Gaussian bandit P i,t = normal Expected reward ● Key properties Neural network P i,t = ??? ○ P i,t concentrates at μ i with the number of pulls ○ μ i,t overestimates μ i with a sufficient probability

General Randomized Exploration ● Sample μ i,t from posterior distribution P i,t and pull arm I t = argmax i μ i,t μ 2 μ 1 P 1,t P 2,t How do we design distribution P i,t ? Expected reward ● Key properties ○ P i,t concentrates at (scaled and shifted) μ i with the number of pulls ○ μ i,t overestimates (scaled and shifted) μ i with a sufficient probability

Giro (Garbage In, Reward Out) with [0, 1] Rewards ● μ i,t is the mean of a non-parametric bootstrap sample of the history of arm i with pseudo-rewards (garbage)

Giro (Garbage In, Reward Out) with [0, 1] Rewards ● μ i,t is the mean of a non-parametric bootstrap sample of the history of arm i with pseudo-rewards (garbage) History Arm 1 0 0 Arm 2 1 0 1

Giro (Garbage In, Reward Out) with [0, 1] Rewards ● μ i,t is the mean of a non-parametric bootstrap sample of the history of arm i with pseudo-rewards (garbage) History Garbage Arm 1 0 0 0 0 1 1 Arm 2 1 0 1 0 0 0 1 1 1

Giro (Garbage In, Reward Out) with [0, 1] Rewards ● μ i,t is the mean of a non-parametric bootstrap sample of the history of arm i with pseudo-rewards (garbage) History Garbage Bootstrap sample μ i,t Arm 1 2 / 3 0 0 0 0 1 1 1 1 1 1 0 0 Arm 2 5 / 9 1 0 1 0 0 0 1 1 1 1 1 0 1 1 1 0 0 0

Giro (Garbage In, Reward Out) with [0, 1] Rewards ● μ i,t is the mean of a non-parametric bootstrap sample of the history of arm i with pseudo-rewards (garbage) History Garbage Bootstrap sample μ i,t Arm 1 2 / 3 0 0 0 0 1 1 1 1 1 1 0 0 Arm 2 5 / 9 1 0 1 0 0 0 1 1 1 1 1 0 1 1 1 0 0 0 ● Benefits and challenges of randomized garbage ○ μ i,t overestimates scaled and shifted μ i with a sufficient probability ○ Bias in the estimate of μ i

Contextual Giro with [0, 1] Rewards ● Straightforward generalization to complex structured problems ● μ i,t is the estimated reward of arm i in a model trained on a non-parametric bootstrap sample of the history with pseudo-rewards (garbage) History Garbage Bootstrap sample μ i,t (x 1 , ) 0 (x 1 , ) 0 (x 1 , ) 1 (x 1 , ) 0 (x 1 , ) 1 (x 2 , ) 0 Estimate from a (x 2 , ) 1 (x 2 , ) 0 (x 2 , ) 1 (x 2 , ) 1 (x 2 , ) 1 (x 2 , ) 1 learned model (x 3 , ) 0 (x 3 , ) 0 (x 3 , ) 1 (x 3 , ) 0 (x 3 , ) 1 (x 3 , ) 1 ● Giro is as general as the ε-greedy policy... but no tuning!

How to do bandits with neural networks easily? How does Giro compare to Thompson sampling? See you at poster #125!

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed - PowerPoint PPT Presentation

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits Branislav Kveton, Google Research Csaba Szepesvri, DeepMind and University of Alberta Sharan Vaswani, Mila, University of Montreal Zheng Wen, Adobe Research Mohammad

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

Uniprocessor Garbage Collection Techniques Presented by: Shiri Dori Shai Erera Outline

Garbage Collection Akim Demaille, Etienne Renault, Roland Levillain June 4, 2019 TYLA Garbage

Garbage Collection Jan Midtgaard Michael I. Schwartzbach Aarhus University The Garbage

GARBAGE BAGE CO COLLECTIO LLECTION: N: @EvaAndreasson, @Cloudera AGENDA Garbage

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Garbage Collection Last time Compiling Object-Oriented Languages Today

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Is Big Data ready to improve outcomes or is it a new generation of garbage in/garbage out? Yves

Sometimes There are Dumb Questions Garbage in-Garbage out: Why most surveys are worse than

Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep Uttamchandani Chief Data

Motivation Garbage-in, garbage-out Cannot get good mining results from bad data Data

Motivation Garbage-in, garbage-out Cannot get good mining results from bad data Data

Extending Answer Set Programs with Interpreted Functions as First-class Citizens Christoph Redl

Meeting Agenda Networking - Light Breakfast Provided 7:30 8:00 am Welcoming remarks and

Bayesian Networks 1 Recap of last lecture Probability: precise representation of uncertainty

Tow ards Understanding Articulated Objects Jrgen Sturm 1 Cyrill Stachniss 1 Vijay Pradeep 2

Algorithmic Bias Machine Learning An area of AI that studies how to get computers to learn

You can view your nodes* *And you can view your friends, but you cant view your friends

ML in Practice: Dealing with imbalanced data CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T

D AMMIF Update Get the latest version of D AMMIF together with the latest release of ATSAS! ATSAS