Bandits and Exploration: How do we (optimally) gather information? - - PowerPoint PPT Presentation

bandits and exploration how do we optimally gather
SMART_READER_LITE
LIVE PREVIEW

Bandits and Exploration: How do we (optimally) gather information? - - PowerPoint PPT Presentation

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18 Announcements... HW 4 posted soon


slide-1
SLIDE 1

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Machine Learning for Big Data CSE547/STAT548 University of Washington

  • S. M. Kakade (UW)

Optimization for Big data 1 / 18

slide-2
SLIDE 2

Announcements...

HW 4 posted soon (short) Poster session: June 1, 9-11:30a; ask TA/CSE students for help printing Projects: the term is approaching the end.... Today: Quick overview: Parallelization and Deep learning Bandits:

1

Vanilla k-arm setting

2

Linear bandits and ad-placement

3

Game trees?

  • S. M. Kakade (UW)

Optimization for Big data 2 / 18

slide-3
SLIDE 3

The problem

In unsupervised learning, we just have data... In supervised learning, we have inputs X and labels Y (often we spend resources to get these labels). In reinforcement learning (very general), we act in the world, there is “state” and we observe rewards. Bandit Settings: We have K decisions each round and we do only received feedback for the chosen decision...

  • S. M. Kakade (UW)

Optimization for Big data 3 / 18

slide-4
SLIDE 4

Gambling in casino...

  • S. M. Kakade (UW)

Optimization for Big data 4 / 18

slide-5
SLIDE 5

Multi-Armed Bandit Game

K Independent Arms: a ∈ {1, . . . K} Each arm a returns a random reward Ra if pulled. (simpler case) assume Ra is not time varying. Game:

You chose arm at at time t. You then observe: Xt = Rat where Rat is sampled from the underlying distribution of that arm.

The distribution of Ra is not known.

  • S. M. Kakade (UW)

Optimization for Big data 5 / 18

slide-6
SLIDE 6

More real motivations...

  • S. M. Kakade (UW)

Optimization for Big data 5 / 18

slide-7
SLIDE 7

Ad placement...

  • S. M. Kakade (UW)

Optimization for Big data 5 / 18

slide-8
SLIDE 8

The Goal

We would like to maximize our long term future reward. Our (possibly randomized) sequential strategy/algorithm A is: at = A(a1, X1, a2, X2, . . . at−1, Xt−1) In T rounds, our reward is: E[

T

  • t=1

Xt|A] where the expectation is with respect to the reward process and our algorithm. Objective: What is a strategy which maximizes our long term reward?

  • S. M. Kakade (UW)

Optimization for Big data 6 / 18

slide-9
SLIDE 9

Our Regret

Suppose: µa = E[Ra] Assume 0 ≤ µa ≤ 1. Let µ∗ = maxa µa In expectation, the best we can do is obtain µ∗T reward in T steps. In T rounds, our regret is: µ∗T − E T

  • t=1

Xt|A

  • ≤??

Objective: What is a strategy which makes our regret small?

  • S. M. Kakade (UW)

Optimization for Big data 7 / 18

slide-10
SLIDE 10

A Naive Strategy

For the first τ rounds, sample each arm τ/K times. For the remainder of the rounds, choose the arm with best observed empirical reward. How goes is this strategy? How do we set τ? Let’s look at confidence intervals.

  • S. M. Kakade (UW)

Optimization for Big data 8 / 18

slide-11
SLIDE 11

Hoeffding’s bound

If we pull arm Na times, our empirical estimate for arm a is: ˆ µa = 1 Na

  • t:at=a

Xt By Hoeffding’s bound, with probability greater than 1 − δ, |ˆ µa − µa| ≤ O

  • log(1/δ)

Na By the union bound, with probability greater than 1 − δ, ∀a, |ˆ µa − µa| ≤ O

  • log(K/δ)

Na

  • S. M. Kakade (UW)

Optimization for Big data 9 / 18

slide-12
SLIDE 12

Our regret

(Exploration rounds) What is our regret for the first τ rounds? (Exploitation rounds) What is our regret for the remainder τ rounds? Our total regret is: µ∗T −

T

  • t=1

Xt ≤ τ + O

  • log(K/δ)

τ/K (T − τ) How do we choose τ?

  • S. M. Kakade (UW)

Optimization for Big data 10 / 18

slide-13
SLIDE 13

The Naive Strategy’s Regret

Choose τ = K 1/3T 2/3 and δ = 1/T. Theorem: Our total (expected) regret is: µ∗T − E[

T

  • t=1

Xt|A] ≤ O(K 1/3T 2/3(log(KT))1/3)

  • S. M. Kakade (UW)

Optimization for Big data 11 / 18

slide-14
SLIDE 14

Can we be more adaptive?

Are we still pulling arms that we know are sub-optimal? How do we know this?? Let Na,t be the number of times we pulled arm a up to time t. Confidence interval at time t: with probability greater than 1 − δ, |ˆ µa,t − µa| ≤ O

  • log(1/δ)

Na,t with δ → δ/(TK), the above bound will hold for all time arms a ∈ [K] and timesteps t ≤ T.

  • S. M. Kakade (UW)

Optimization for Big data 12 / 18

slide-15
SLIDE 15

Example

  • S. M. Kakade (UW)

Optimization for Big data 12 / 18

slide-16
SLIDE 16

Example

  • S. M. Kakade (UW)

Optimization for Big data 12 / 18

slide-17
SLIDE 17

Example

  • S. M. Kakade (UW)

Optimization for Big data 12 / 18

slide-18
SLIDE 18

Confidence Bounds...

  • S. M. Kakade (UW)

Optimization for Big data 13 / 18

slide-19
SLIDE 19

UCB: a reasonable state of our uncertainty...

  • S. M. Kakade (UW)

Optimization for Big data 14 / 18

slide-20
SLIDE 20

Upper Confidence Bound (UCB) Algorithm

At each time t,

Pull arm: at = argmaxˆ µa,t + c

  • log(KT/δ)

Na,t := argmaxˆ µa,t + ConfBounda,t (where c ≤ 10 is a constant). Observe reward Xt. Update µa,t, Na,t, and ConfBounda,t.

How well does this do?

  • S. M. Kakade (UW)

Optimization for Big data 15 / 18

slide-21
SLIDE 21

Instantaneous Regret

With probability greater than 1 − δ all the confidence bounds will hold. Question: If argmaxˆ µa,t + ConfBounda,t ≤ µ∗ could UCB pull arm a at time t? Question: If pull arm a at time t, how much regret do we pay? i.e. µ∗ − µat ≤??

  • S. M. Kakade (UW)

Optimization for Big data 16 / 18

slide-22
SLIDE 22

Total Regret

Theorem: The total (expected) regret of UCB is: µ∗T − E[

T

  • t=1

Xt|A] ≤

  • KT log(KT)

This better than the Naive strategy. Up to log factors, it is optimal. Practical algorithm?

  • S. M. Kakade (UW)

Optimization for Big data 17 / 18

slide-23
SLIDE 23

Proof Idea: for K = 2

Suppose arm a = 2 is not optimal. Claim 1: All confidence intervals will be valid (with Pr ≥ 1 − δ). Claim 2: If we pull arm a = 1, then no regret. Claim 3: If we pull a = 2, then we pay 2Ca,t regret. To see this:

Why? ˆ µa,t + Ca,t ≥ ˆ µ1,t + C1,t ≥ µ∗ Why? µa ≥ ˆ µa,t − Ca,t

The total regret is:

  • t

Ca,t ≤

  • t

1

  • Na,t

Note that Na,t ≤ t (and increasing).

  • S. M. Kakade (UW)

Optimization for Big data 18 / 18

slide-24
SLIDE 24

Acknowledgements

http://gdrro.lip6.fr/sites/default/files/ JourneeCOSdec2015-Kaufman.pdf https://sites.google.com/site/banditstutorial/

  • S. M. Kakade (UW)

Optimization for Big data 18 / 18