The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... - - PowerPoint PPT Presentation

the nonstochastic multi armed bandit problem part 2 and
SMART_READER_LITE
LIVE PREVIEW

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... - - PowerPoint PPT Presentation

Reminder from last week Goals Lower bounds on the weak regret The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15, 2017 Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem Reminder from last


slide-1
SLIDE 1

Reminder from last week Goals Lower bounds on the weak regret

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting...

Shahaf Nacson

TAU

Nov 15, 2017

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-2
SLIDE 2

Reminder from last week Goals Lower bounds on the weak regret

Reminder from last week

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-3
SLIDE 3

Reminder from last week Goals Lower bounds on the weak regret

Background

Problem setup: K arms

Assume K is known to player in advance

Rewards Xi(t) are bounded to [0, 1]

Generalized trivially to [a, b] by (b − a)x + a

Partial information

Player learns only rewards of arms he chose

Slot machines need not be of fixed distribution

Can even be adversarial

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-4
SLIDE 4

Reminder from last week Goals Lower bounds on the weak regret

Background

Problem setup: Rewards assignment is determined in advance

I.e. before the first arm is pulled Assignments can be picked after player strategy is already known

We want to minimize the regret

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-5
SLIDE 5

Reminder from last week Goals Lower bounds on the weak regret

Notations

K - number of possible actions (i.e arms)

denoted usually by i ∈ {1, ..., K}

T- total time

denoted usually by time t ∈ {1, ..., T} One action i per time t

xi(t) - reward of arm i at time t

xi(t) ∈ [0, 1]

A - player’s strategy

Choose arm it at time t (and receive reward xit(t)) Player only knows xi1(1), ..., xit(t) of previously chosen actions i1, ..., it

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-6
SLIDE 6

Reminder from last week Goals Lower bounds on the weak regret

Notations take 2

K - number of possible actions (i.e arms)

denoted usually by i ∈ {1, ..., K}

T- total time

denoted usually by time t ∈ {1, ..., T} One action i per time t

xi(t) - reward of arm i at time t

xi(t) ∈ [0, 1]

A - player’s strategy

Can be viewed as a sequence I1, I2, ... where each It is a mapping from ({1, ..., K} × [0, 1])t−1 → {1, ..., K}, that is the set of action indices and previous rewards to the set of indices

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-7
SLIDE 7

Reminder from last week Goals Lower bounds on the weak regret

Notations take 3

GA(T) - Total reward of strategy A at time horizon T GA(T) :=

T

  • t=1

xit(t)

denoted as GA when context of T is obvious

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-8
SLIDE 8

Reminder from last week Goals Lower bounds on the weak regret

Notations take 4

Regret take 1: Given a ssequence of actions (j1, ..., jT), we denote G(j1,...,jT ) :=

T

  • t=1

xjt(t) as the return of the sequence (Worst-case) regret defined as G(j1,...,jT ) − GA(T)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-9
SLIDE 9

Reminder from last week Goals Lower bounds on the weak regret

Notations take 5

Regret take 2: Gmax(T) - Total reward of the best arm at time horizon T Gmax(T) := max

j T

  • t=1

xj(t)

denoted as Gmax as well

Weak regret defined as Gmax − GA(T) We will consider the weak regret from now on and will refer to it simply as ”the regret”

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-10
SLIDE 10

Reminder from last week Goals Lower bounds on the weak regret

Goals

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-11
SLIDE 11

Reminder from last week Goals Lower bounds on the weak regret

Goals

Lower bounds on the weak regret is Ω( √ KT)

Does not match the upper bound of previous week’s algorithm

  • f O(
  • KT ln(K))

Closing the gap is still an open problem (today??)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-12
SLIDE 12

Reminder from last week Goals Lower bounds on the weak regret

Goals

Lower bounds on the weak regret is Ω( √ KT) Upper bounds on the weak regret that hold with probability 1 If time permits...

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-13
SLIDE 13

Reminder from last week Goals Lower bounds on the weak regret

Lower bounds on the weak regret

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-14
SLIDE 14

Reminder from last week Goals Lower bounds on the weak regret

Theorem 5.1. For any number of actions K ≥ 2 and for any time horizon T, there exists a distribution over the assignment of rewards such that the expected weak regret of any algorithm is Ω( √ KT)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-15
SLIDE 15

Reminder from last week Goals Lower bounds on the weak regret

Proof overview

Construct the random distribution of rewards

s.t all strategies would reach an expected regret of our lower bound

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-16
SLIDE 16

Reminder from last week Goals Lower bounds on the weak regret

Proof overview

Construct the random distribution of rewards

s.t all strategies would reach an expected regret of our lower bound

Find lower bound to the expected gain of the best arm Gmax

Pretty straightforward

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-17
SLIDE 17

Reminder from last week Goals Lower bounds on the weak regret

Proof overview

Construct the random distribution of rewards

s.t all strategies would reach an expected regret of our lower bound

Find an lower bound to the expected gain of the best arm Gmax

Pretty straightforward

Find a upper bound to the expected gain of any given strategy GA

Here is where all the magic happens

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-18
SLIDE 18

Reminder from last week Goals Lower bounds on the weak regret

Proof overview

Construct the random distribution of rewards

s.t all strategies would reach an expected regret of our lower bound

Find an lower bound to the expected gain of the best arm Gmax

Pretty straightforward

Find a upper bound to the expected gain of any given strategy GA

Here is where all the magic happens

Deduce lower bound on their diffrence

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-19
SLIDE 19

Reminder from last week Goals Lower bounds on the weak regret

Proof overview

Construct the random distribution of rewards

s.t all strategies would reach an expected regret of our lower bound

Find an lower bound to the expected gain of the best arm Gmax Find a upper bound to the expected gain of any given strategy GA

Here is where all the magic happens

Deduce lower bound on their diffrence Proof by notations :)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-20
SLIDE 20

Reminder from last week Goals Lower bounds on the weak regret

Constructing the distribution

Before play begins, one action I is chosen uniformly at random to be the ”good” action. Define binary rewards: if j = I, meaning j is the ”good” action: Pr[xj(t) = 1] = 1 2 + ǫ Pr[xj(t) = 0] = 1 2 − ǫ if j = I, meaning j is the ”good” action: Pr[xj(t) = 1] = 1 2 Pr[xj(t) = 0] = 1 2 for some small, fixed ǫ ∈ (0, 1

2] to be chosen later down the road

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-21
SLIDE 21

Reminder from last week Goals Lower bounds on the weak regret

Constructing the distribution

Before play begins, one action I is chosen uniformly at random to be the ”good” action. Define binary rewards: if j = I, meaning j is the ”good” action: Pr[xj(t) = 1] = 1 2 + ǫ Pr[xj(t) = 0] = 1 2 − ǫ if j = I, meaning j is the ”good” action: Pr[xj(t) = 1] = 1 2 Pr[xj(t) = 0] = 1 2 Then the expected reward of the best action is ( 1

2 + ǫ)T

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-22
SLIDE 22

Reminder from last week Goals Lower bounds on the weak regret

Constructing the distribution

Translation of our problem: Our goal now is to show that for any given strategy A, we can find an ǫ s.t A’s expected regret is of Ω( √ TK). We will soon see that ǫ depends only on number of actions K and total time T.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-23
SLIDE 23

Reminder from last week Goals Lower bounds on the weak regret

Some more notations

P∗{·} - probabilty w.r.t the afromentioned distribution Pi{·} - probabilty conditioned on i being the good action

Pi{·} = P∗{ · |i = I}

Punif {·} - probabilty probabilty w.r.t the uniform distribution Same for expectations E∗[·], Ei[·], Eunif [·]

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-24
SLIDE 24

Reminder from last week Goals Lower bounds on the weak regret

Some more notations

P∗{·} - probabilty w.r.t the afromentioned distribution Pi{·} - probabilty conditioned on i being the good action

Pi{·} = P∗{ · |i = I}

Punif {·} - probabilty probabilty w.r.t the uniform distribution Same for expectations E∗[·], Ei[·], Eunif [·] We want to show: E∗[Gmax − GA] ≥ Ω( √ KT)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-25
SLIDE 25

Reminder from last week Goals Lower bounds on the weak regret

Some more notations (...)

A - as before, player strategy rt = xit(t) - random variable denoting reward received at time t rt = r1, ..., rt - sequence of rewards up to time t

r = r T - the entire sequence

Ni - number of times action i is chosen by A

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-26
SLIDE 26

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Lemma A.1. Let f : {0, 1}T → [0, T] be any function defined on reward sequences r. Then for any action i, Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

for some small constant c < 1.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-27
SLIDE 27

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Lemma A.1. Let f : {0, 1}T → [0, T] be any function defined on reward sequences r. Then for any action i, Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

for some small constant c < 1 ( actually for c = 2(

  • ln 4

3) ≈ 0.925, but let’s keep it simple).

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-28
SLIDE 28

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Lemma A.1. Let f : {0, 1}T → [0, T] be any function defined on reward sequences r. Then for any action i, Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

for some small constant c < 1. Our first lemma bounds the difference between expectations when measured using our constructed distribution Ei[·] vs the uniform distribution Eunif [·]

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-29
SLIDE 29

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Lemma A.1. Let f : {0, 1}T → [0, T] be any function defined on reward sequences r. Then for any action i, Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

for some small constant c. We can view Ni as a function of reward sequence r to [0, T] and apply Lemma A.1 to it.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-30
SLIDE 30

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1 - a bit of intuition

Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

The smaller our bias towards the ”good” arm ǫ is, the closer we are to the uniform distribution, the harder it becomes to set them apart.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-31
SLIDE 31

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1 - a bit of intuition

Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

The smaller our bias towards the ”good” arm ǫ is, the closer we are to the uniform distribution, the harder it becomes to set them apart. The more time we play the machines ( i.e. larger T), the more information we have on the reward distribution and so we can guess the ”good” arm with better probabilty, or more times over the sequence.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-32
SLIDE 32

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1 - a bit of intuition

Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

The smaller our bias towards the ”good” arm ǫ is, the closer we are to the uniform distribution, the harder it becomes to set them apart. The more time we play the machines ( i.e. larger T), the more information we have on the reward distribution and so we can guess the ”good” arm with better probabilty, or more times over the sequence. Larger K means fewer times we choose each arm ( i.e. smaller Eunif [Ni]) The more arms we have to choose from (i.e. larger K) the harder it is to distinguish between our distribution and the uniform.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-33
SLIDE 33

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Proof of Lemma A.1.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-34
SLIDE 34

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Proof of Lemma A.1. That will have to wait a bit... Let’s see what it gives us first! We will assume it’s correctness for now, and keep it in mind as our ”techincal debt”

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-35
SLIDE 35

Reminder from last week Goals Lower bounds on the weak regret

Theorem A.2

Theorem A.2 For any player strategy A, and for the distribution on rewards described before, the expected regret of algorithm A is lower bounded by E∗[Gmax − GA] ≥ ǫ

  • T − T

K − cǫT

  • T

K

  • Shahaf Nacson

The Nonstochastic Multi Armed Bandit Problem

slide-36
SLIDE 36

Reminder from last week Goals Lower bounds on the weak regret

Theorem A.2

Theorem A.2 For any player strategy A, and for the distribution on rewards described before, the expected regret of algorithm A is lower bounded by E∗[Gmax − GA] ≥ ǫ

  • T − T

K − cǫT

  • T

K

  • Notice that the lhs is exactly what we are trying to bound!

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-37
SLIDE 37

Reminder from last week Goals Lower bounds on the weak regret

Theorem A.2

Theorem A.2 For any player strategy A, and for the distribution on rewards described before, the expected regret of algorithm A is lower bounded by E∗[Gmax − GA] ≥ ǫ

  • T − T

K − cǫT

  • T

K

  • Let’s start with the easy one, and bound E∗[Gmax]

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-38
SLIDE 38

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

Let’s start with the easy one, and bound E∗[Gmax]: The expected gain of the best action is at least the expected gain of the good action, so E∗[Gmax] ≥ T(1 2 + ǫ)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-39
SLIDE 39

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

The expected gain of the best action is at least the expected gain

  • f the good action, so

E∗[Gmax] ≥ T(1 2 + ǫ) Now let’s find an upper bound for E∗[GA], yielding the theorem.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-40
SLIDE 40

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

If action I is chosen to be the good action, then clearly the expected reward at time t is 1

2 + ǫ if it = I and 1 2 if it = I:

Ei[rt] = (1 2 + ǫ)Pi{it = I} + 1 2Pi{it = I} = 1 2 + ǫPi{it = I} Last equality achieved by inserting Pi{it = I} = 1 − Pi{it = I}

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-41
SLIDE 41

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

If action I is chosen to be the good action, then clearly the expected reward at time t is 1

2 + ǫ if it = I and 1 2 if it = I:

Ei[rt] = (1 2 + ǫ)Pi{it = I} + 1 2Pi{it = I} = 1 2 + ǫPi{it = I} Thus, the expected gain of algorithm A is: (31) Ei[GA] =

T

  • t=1

Ei[rt] = T 2 + ǫEi[Ni]

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-42
SLIDE 42

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

Ei[GA] =

T

  • t=1

Ei[rt] = T 2 + ǫEi[Ni] Note that: E∗[GA] = 1 K

K

  • i=1

Ei[GA] = T 2 + ǫ 1 K

K

  • i=1

Ei[Ni] And so, by applying Lemma A.1 to Ni we will get an upper bound for Ei[GA], And using the equality above, we will get an upper bound for E∗[GA]

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-43
SLIDE 43

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

Applying Lemma A.1 to Ni: Ni ∈ [0, T]. so, Ei[Ni] ≤ Eunif [Ni] + cǫT

  • Eunif [Ni]

K

  • i=1

Ei[Ni] ≤

K

  • i=1
  • Eunif [Ni] + cǫT
  • Eunif [Ni]
  • ≤ T + cǫT

√ TK relying on K

i=1 Eunif [Ni] = T and so, K i=1

  • Eunif [Ni] ≤

√ TK by Cauchy-Schwarz inequality

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-44
SLIDE 44

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

K

  • i=1

Ei[Ni] ≤ T + cǫT √ TK Combining with (31) we get: E∗[GA] = T 2 + ǫ 1 K

K

  • i=1

Ei[Ni] ≤ T 2 + ǫ

  • T

K + cǫT

  • T

K

  • And we got an upper bound :)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-45
SLIDE 45

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

E∗[GA] ≤ T 2 + ǫ

  • T

K + cǫT

  • T

K

  • And finally, the regret:

E∗[Gmax] − E∗[GA] ≥ T(1 2 + ǫ) −

  • T

2 + ǫ

  • T

K + cǫT

  • T

K

  • = ǫ
  • T − T

K − cǫT

  • T

K

  • Shahaf Nacson

The Nonstochastic Multi Armed Bandit Problem

slide-46
SLIDE 46

Reminder from last week Goals Lower bounds on the weak regret

Proof of Theorem A.2

E∗[Gmax] − E∗[GA] ≥ ǫ

  • T − T

K − cǫT

  • T

K

  • Since c < 1, choosing ǫ =
  • K

T gives a lower bound of Ω(

√ KT).

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-47
SLIDE 47

Reminder from last week Goals Lower bounds on the weak regret

Back to Lemma A.1

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-48
SLIDE 48

Reminder from last week Goals Lower bounds on the weak regret

Lemma A.1

Lemma A.1. Let f : {0, 1}T → [0, T] be any function defined on reward sequences r. Then for any action i, Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

for some small constant c < 1.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-49
SLIDE 49

Reminder from last week Goals Lower bounds on the weak regret

Statistical distance

For any distributions P and Q, let P − Q1 . =

  • x∈{0,1}n

|P{x} − Q{x}| denote the statistical distance between P and Q. Intuitively, this is the largest possible difference between the probabilities that the two probability distributions can assign to the same event.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-50
SLIDE 50

Reminder from last week Goals Lower bounds on the weak regret

Statistical distance

P − Q1 . =

  • x∈{0,1}n

|P{x} − Q{x}| Thus, 1 2P − Q1 =

  • x:P{x}≥Q{x}

P{x} − Q{x} Proof on board

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-51
SLIDE 51

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

We have that Ei[f (r)] − Eunif [f (r)] =

  • r

f (r)(Pi{r} − Punif {r}) ≤

  • r:Pi{r}≥Punif {r}

f (r)(Pi{r} − Punif {r}) f (r) ∈ [0, T], so ≤ T

  • r:Pi{r}≥Punif {r}

(Pi{r} − Punif {r}) and from the property we just proved on board = T 2 Pi − Punif 1

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-52
SLIDE 52

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Lemma A.1. Ei[f (r)] − Eunif [f (r)] ≤ cǫT

  • Eunif [Ni]

What we’ve got so far: Ei[f (r)] − Eunif [f (r)] ≤ T 2 Pi − Punif 1 We’re left to prove that Pi − Punif 1 ≤ 2cǫ

  • Eunif [Ni]

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-53
SLIDE 53

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

We’re left to prove that Pi − Punif 1 ≤ 2cǫ

  • Eunif [Ni]

We will actually see Pi − Punif 1 ≤

  • −Eunif [Ni] ln(1 − 4ǫ2)

And since − ln(1 − x) ≤ 4 ln( 4

3)x for x ∈ [0, 1 4], The Lemma will

hold.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-54
SLIDE 54

Reminder from last week Goals Lower bounds on the weak regret

Story time

Let P, Q be any two distributions. Define random variable Y and distribution ¯ P the following: y = log P{x} Q{x}

  • ¯

P {y} = P{x} Then

  • y

¯ P{y} =

  • x

P{x} = 1 ; 0 ≤ ¯ P{y} ≤ 1 meaning ¯ P{y} is indeed a well defined probability function.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-55
SLIDE 55

Reminder from last week Goals Lower bounds on the weak regret

Story time

Let P, Q be any two distributions. Define random variable Y and distribution ¯ P the following: y = log P{x} Q{x}

  • ¯

P {y} = P{x} Then the expectation of Y is given by E[Y ] =

  • y

y ¯ P{y} =

  • x

P{x} log P{x} Q{x}

  • E[Y ] is defined as the KL divergence, or relative entropy of

distributions P, Q, denoted KL(PQ)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-56
SLIDE 56

Reminder from last week Goals Lower bounds on the weak regret

KL divergence (a.k.a relative entropy)

Let KL(PQ) . =

  • x∈{0,1}n

P{x} log P{x} Q{x}

  • be the Kullback-Liebler divergence or relative entropy between the

two distributions. In other words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-57
SLIDE 57

Reminder from last week Goals Lower bounds on the weak regret

KL divergence (a.k.a relative entropy)

Let KL(PQ) . =

  • x∈{0,1}n

P{x} log P{x} Q{x}

  • Or in our notations:

KL(PQ) . =

  • r∈{0,1}T

P{r} log P{r} Q{r}

  • Shahaf Nacson

The Nonstochastic Multi Armed Bandit Problem

slide-58
SLIDE 58

Reminder from last week Goals Lower bounds on the weak regret

KL divergence (a.k.a relative entropy)

So naturally, the conditional relative entropy of rt given rt−1 is KL(P{rt | rt−1}Q{rt | rt−1}) . =

  • rt−1∈{0,1}t−1

P{rt−1}

  • rt∈{0,1}

P{rt | rt−1} log P{rt | rt−1} Q{rt | rt−1}

  • =
  • rt∈{0,1}t

P{rt} log P{rt | rt−1} Q{rt | rt−1}

  • Shahaf Nacson

The Nonstochastic Multi Armed Bandit Problem

slide-59
SLIDE 59

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Note that in our case, we are discussing Bernoulli random

  • variables. So, relative entropy, Bernoulli case:

KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • Shahaf Nacson

The Nonstochastic Multi Armed Bandit Problem

slide-60
SLIDE 60

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Note that in our case, we are discussing Bernoulli random

  • variables. So, relative entropy, Bernoulli case:

KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • So, what is KL( 1

2 1 2)?

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-61
SLIDE 61

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Note that in our case, we are discussing Bernoulli random

  • variables. So, relative entropy, Bernoulli case:

KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • So, what is KL( 1

2 1 2)? 0

Note that this is the divergence between the uniform distribution and our own conditional distribution, given we didn’t choose the good action.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-62
SLIDE 62

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Note that in our case, we are discussing Bernoulli random

  • variables. So, relative entropy, Bernoulli case:

KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • So, what is KL( 1

2 1 2 + ǫ)?

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-63
SLIDE 63

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Note that in our case, we are discussing Bernoulli random

  • variables. So, relative entropy, Bernoulli case:

KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • So, what is KL( 1

2 1 2 + ǫ)? − 1 2 ln(1 − 4ǫ2)

Note that this is the divergence between the uniform distribution and our own conditional distribution, given we chose the good action.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-64
SLIDE 64

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1.

Note that in our case, we are discussing Bernoulli random

  • variables. So, relative entropy, Bernoulli case:

KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • We’re left to prove that

Pi − Punif 1 ≤

  • −Eunif [Ni] ln(1 − 4ǫ2)

We will see now Punif − Pi2

1 ≤ (2 ln 2)KL(Punif Pi)

Remember, Punif , Pi are Bernoulli distributions with punif = 1

2 and

pi = 1

2 + ǫ

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-65
SLIDE 65

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1., Bernoulli case

For two Bernoulli distributions P, Q with parameters p ≥ q respectively: P − Q2

1 ≤ (2 ln 2)KL( P Q)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-66
SLIDE 66

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1., Bernoulli case

For two Bernoulli distributions P, Q with parameters p ≥ q respectively: P − Q2

1 ≤ (2 ln 2)KL( P Q)

Actually, this is true for any two distributions, but we only need it for Bernoulli distributions here.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-67
SLIDE 67

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1., Bernoulli case

For two Bernoulli distributions P, Q with parameters p ≥ q respectively: P − Q2

1 ≤ (2 ln 2)KL( P Q)

Recall that KL( P Q) . = KL(p q) . = p log p q

  • + (1 − p) log

1 − p 1 − q

  • Shahaf Nacson

The Nonstochastic Multi Armed Bandit Problem

slide-68
SLIDE 68

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1., Bernoulli case

Reminder 1 2P − Q1 =

  • r:P{r}≥Q{r}

P{r} − Q{r} Thus, P − Q1 = 2

  • r:P{r}≥Q{r}

P{r} − Q{r} P − Q2

1 = 4

 

  • r:P{r}≥Q{r}

P{r} − Q{r}  

2

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-69
SLIDE 69

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1., Bernoulli case

So in our case this P − Q2

1 ≤ (2 ln 2)KL( p q)

translates to 4  

  • r:P{r}≥Q{r}

P{r} − Q{r}  

2

≤ (2 ln 2)KL( p q) But the lhs is simply 4(p − q)2 for Bernoulli distributions. Then we are left to prove 4(p − q)2 ≤ (2 ln 2)KL( p q)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-70
SLIDE 70

Reminder from last week Goals Lower bounds on the weak regret

Proof of Lemma A.1., Bernoulli case

4(p − q)2 ≤ (2 ln 2)KL( p q) Indeed, denote g(p, q) = (2 ln 2)KL( p q) − 4(p − q)2 Then ∂g(p, q) ∂q = − p q ln 2 + 1 − p (1 − q) ln 2 − 4 2 ln 22(p − q)2 = q − p q(1 − q) ln 2 − 4 ln 2(q − p) ≥ 0 since q(1 − q) ≤ 1

4 and q ≤ p. For q = p, g(p, q) = 0, hence

q(p, q) ≥ 0 for q ≤ p, which concludes the proof for the Bernoulli case.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-71
SLIDE 71

Reminder from last week Goals Lower bounds on the weak regret

Finishing the proof

We’re left to prove that Pi − Punif 1 ≤

  • −Eunif [Ni] ln(1 − 4ǫ2)

We have just proven Punif − Pi2

1 ≤ (2 ln 2)KL(Punif Pi)

≤ 2KL(Punif Pi) Thus, Punif − Pi1 ≤

  • 2KL(Punif Pi)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-72
SLIDE 72

Reminder from last week Goals Lower bounds on the weak regret

Finishing the proof

We’re left to prove that Pi − Punif 1 ≤

  • −Eunif [Ni] ln(1 − 4ǫ2)

We have just proven Punif − Pi1 ≤

  • 2KL(Punif Pi)

So, by proving KL(Punif Pi) ≤ −1 2Eunif [Ni] ln(1 − 4ǫ2) We conclude the proof and all is well ;)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-73
SLIDE 73

Reminder from last week Goals Lower bounds on the weak regret

Proof.finalize()

To prove KL(Punif Pi) ≤ −1 2Eunif [Ni] ln(1 − 4ǫ2) we will use the ”chain rule for relative entropy”: KL(P{x, y} Q{x, y}) = KL(P{x} Q{x})+KL(P{y|x} Q{y|x}) Intuitively, the equivalent of conditional probability in relative entropy. Correctens derived directly from the conditional probabily, conditional relative entropy definition and logarithm product rule.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-74
SLIDE 74

Reminder from last week Goals Lower bounds on the weak regret

Proof.finalize()

As Punif = Punif {r1, ..., rT} and Pi = Pi{r1, ..., rT} chain rule for relative entropy gives KL(Punif Pi) =

T

  • t=1

KL(Punif {rt|rt−1} Pi{rt|rt−1})

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-75
SLIDE 75

Reminder from last week Goals Lower bounds on the weak regret

Proof.finalize()

KL(Punif Pi) =

T

  • t=1

KL(Punif {rt|rt−1} Pi{rt|rt−1}) =

T

  • t=1

Punif {it = i}KL(Punif {rt|rt−1|it = i} Pi{rt|rt−1|it = i}) +Punif {it = i}KL(Punif {rt|rt−1|it = i} Pi{rt|rt−1|it = i}) This equality is given by the law of total expectation ( i.e. smoothing thereom)

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-76
SLIDE 76

Reminder from last week Goals Lower bounds on the weak regret

Proof.finalize()

KL(Punif Pi) =

T

  • t=1

KL(Punif {rt|rt−1} Pi{rt|rt−1}) =

T

  • t=1

Punif {it = i}KL(Punif {rt|rt−1|it = i} Pi{rt|rt−1|it = i}) +Punif {it = i}KL(Punif {rt|rt−1|it = i} Pi{rt|rt−1|it = i}) =

T

  • t=1

(Punif {it = i}KL( 1

2 1 2) + Punif {it = i}KL( 1 2 1 2 + ǫ))

Punif is always 1

2 regardless of past results. Pi is 1 2 if we didn’t

choose the good action, and 1

2 + ǫ if we did.

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-77
SLIDE 77

Reminder from last week Goals Lower bounds on the weak regret

Proof.finalize()

KL(Punif Pi) =

T

  • t=1

KL(Punif {rt|rt−1} Pi{rt|rt−1}) =

T

  • t=1

Punif {it = i}KL(Punif {rt|rt−1|it = i} Pi{rt|rt−1|it = i}) +Punif {it = i}KL(Punif {rt|rt−1|it = i} Pi{rt|rt−1|it = i}) =

T

  • t=1

(Punif {it = i}KL( 1

2 1 2) + Punif {it = i}KL( 1 2 1 2 + ǫ))

( Note that KL( 1

2 1 2) = 0, KL( 1 2 1 2 + ǫ) = − 1 2 ln(1 − 4ǫ2) ) Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-78
SLIDE 78

Reminder from last week Goals Lower bounds on the weak regret

delete Proof

KL(Punif Pi) =

T

  • t=1

KL(Punif {rt|rt−1} Pi{rt|rt−1}) =

T

  • t=1

(Punif {it = i}KL( 1

2 1 2) + Punif {it = i}KL( 1 2 1 2 + ǫ))

=

T

  • t=1

Punif {it = i}(− 1

2 ln(1 − 4ǫ2))

= Eunif [Ni](−1 2 ln(1 − 4ǫ2))

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem

slide-79
SLIDE 79

Reminder from last week Goals Lower bounds on the weak regret

delete Proof

Q.E.D

Shahaf Nacson The Nonstochastic Multi Armed Bandit Problem