Delay and Cooperation in Nonstochastic Bandits Nicol` o - - PowerPoint PPT Presentation

delay and cooperation in nonstochastic bandits
SMART_READER_LITE
LIVE PREVIEW

Delay and Cooperation in Nonstochastic Bandits Nicol` o - - PowerPoint PPT Presentation

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 /


slide-1
SLIDE 1

Delay and Cooperation in Nonstochastic Bandits

Nicol`

  • Cesa-Bianchi

Universit` a degli Studi di Milano

Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 1 / 23

slide-2
SLIDE 2

Themes

Learning with partial and delayed feedback Distributed online learning Trade-off between quality and quantity of feedback information

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 2 / 23

slide-3
SLIDE 3

The nonstochastic bandit problem

A sequential decision problem K actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(K)
  • ∈ [0, 1]K for t = 1, 2, . . .

? ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 3 / 23

slide-4
SLIDE 4

The nonstochastic bandit problem

A sequential decision problem K actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(K)
  • ∈ [0, 1]K for t = 1, 2, . . .

? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 3 / 23

slide-5
SLIDE 5

The nonstochastic bandit problem

A sequential decision problem K actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(K)
  • ∈ [0, 1]K for t = 1, 2, . . .

? .3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

2

Player gets partial information: Only ℓt(It) is revealed

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 3 / 23

slide-6
SLIDE 6

The nonstochastic bandit problem

A sequential decision problem K actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(K)
  • ∈ [0, 1]K for t = 1, 2, . . .

? .3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

2

Player gets partial information: Only ℓt(It) is revealed Ad placement, recommender systems, online auctions, . . .

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 3 / 23

slide-7
SLIDE 7

Regret

Regret of randomized agent I1, I2, . . . RT

def

= E T

  • t=1

ℓt(It)

  • − min

i=1,...,K T

  • t=1

ℓt(i) want = o(T) Lower bound: RT √ KT

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 4 / 23

slide-8
SLIDE 8

The Exp3 algorithm

[Auer et al., 2002]

Agent’s strategy Pt(It = i) ∝ exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • i = 1, . . . , N
  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if It = i
  • therwise

Only one non-zero component in ℓt

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 5 / 23

slide-9
SLIDE 9

The Exp3 algorithm

[Auer et al., 2002]

Agent’s strategy Pt(It = i) ∝ exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • i = 1, . . . , N
  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if It = i
  • therwise

Only one non-zero component in ℓt Properties of importance weighting estimator Et

  • ℓt(i)
  • = ℓt(i)

unbiasedness Et

  • ℓt(i)2
  • 1

Pt

  • ℓt(i) observed
  • variance control
  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 5 / 23

slide-10
SLIDE 10

Regret bounds

Matching the lower bound up to logarithmic factors RT ln K η + η 2 E T

  • t=1

K

  • i=1

Pt(It = i)Et

  • ℓt(i)2

ln K η + η 2 E T

  • t=1

K

  • i=1

Pt(It = i) Pt

  • ℓt(i) is observed
  • = ln K

η + η 2KT = √ KT ln K

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 6 / 23

slide-11
SLIDE 11

Regret bounds

Matching the lower bound up to logarithmic factors RT ln K η + η 2 E T

  • t=1

K

  • i=1

Pt(It = i)Et

  • ℓt(i)2

ln K η + η 2 E T

  • t=1

K

  • i=1

Pt(It = i) Pt

  • ℓt(i) is observed
  • = ln K

η + η 2KT = √ KT ln K The full information (experts) setting Agent observes vector of losses ℓt after each play Pt(ℓt(i) is observed) = 1 RT √ T ln K

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 6 / 23

slide-12
SLIDE 12

Learning with delayed losses

At the end of each round t > d the agent pays ℓt(It) and observes ℓt−d(It−d) Upper bound

[Neu et al., 2010; Joulani et al., 2013]

RT

  • (d + 1)KT

Proof (by reduction): Run d + 1 instances of a bandit algorithm for the standard (no delay) setting in parallel. At each time step t = (d + 1)r + s, use instance s + 1 for the current play. Lower bound max √ KT

  • bandit

lower bound

,

  • (d + 1)T ln K
  • delayed experts

lower bound

  • = Ω

(d + K) T

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 7 / 23

slide-13
SLIDE 13

Simpler and better solution

Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available

  • ℓt(i) =

   ℓt−d(i) Pt−d

  • ℓt−d(i) observed
  • if It−d = i
  • therwise
  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 8 / 23

slide-14
SLIDE 14

Simpler and better solution

Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available

  • ℓt(i) =

   ℓt−d(i) Pt−d

  • ℓt−d(i) observed
  • if It−d = i
  • therwise

Regret bound RT = d +

  • (d + K) T ln K

matching the lower bound up to logarithmic factors

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 8 / 23

slide-15
SLIDE 15

Properties of the delayed loss estimate

Recall key step in Exp3 analysis (a.k.a. “bandit magic”)

K

  • i=1

Pt

  • It = i
  • Pt
  • ℓt(i) is observed

= K For the delayed loss estimate we have

K

  • i=1

Pt

  • It = i
  • Pt−d
  • ℓt−d(i) is observed

Ke for η 1 eK(d + 1)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 9 / 23

slide-16
SLIDE 16

Cooperation with delay

N agents sitting

  • n the vertices of

an unknown communication graph G = (V, E) Agents cooperate to solve a common bandit problem Each agent runs an instance of the same bandit algorithm 1 2 6 7 10 3 5 4 9 8

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 10 / 23

slide-17
SLIDE 17

Some related works

Cooperative nonstochastic bandits without delays

[Awerbuch and Kleinberg, 2008]

Cooperative stochastic bandits on dynamic P2P networks

[Szorenyi et al., 2013]

Stochastic bandits that compete for shared resources (cognitive radio networks) Distributed gradient descent

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 11 / 23

slide-18
SLIDE 18

The communication protocol with fixed delay d

For each t = 1, . . . , T each agent v ∈ V does the following:

1

Plays an action It(v) drawn according to his private distribution pt(v) observing loss ℓt

  • It(v)
  • (same loss vector for all agents)

2

Sends to his neighbors the message mt(v) =

  • t, v, It(v), ℓt
  • It(v)
  • , pt(v)
  • 3

Receives messages from his neighbors, forwarding those that are not older than d

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 12 / 23

slide-19
SLIDE 19

The communication protocol with fixed delay d

For each t = 1, . . . , T each agent v ∈ V does the following:

1

Plays an action It(v) drawn according to his private distribution pt(v) observing loss ℓt

  • It(v)
  • (same loss vector for all agents)

2

Sends to his neighbors the message mt(v) =

  • t, v, It(v), ℓt
  • It(v)
  • , pt(v)
  • 3

Receives messages from his neighbors, forwarding those that are not older than d An agent receives a message from another agent with a delay equal to the shortest path between them A message sent by some agent v at time t will be received by all agents whose shortest-path distance from v is at most d

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 12 / 23

slide-20
SLIDE 20

Average welfare regret

Rcoop

T

= 1 N

  • v∈V

E T

  • t=1

ℓt

  • It(v)
  • − min

i=1,...,K T

  • t=1

ℓt(i) Remarks Clearly, Rcoop

T

TK ln K when agents run vanilla Exp3 (no cooperation) By using other agent’s plays, each agent may estimate ℓt better (thus learning nearly at full info rate) In general, d trades off between quality and quantity of information

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 13 / 23

slide-21
SLIDE 21

Cooperative delayed loss estimator

Each agent v uses the messages received from the other agents in order to estimate ℓt better

  • ℓt(i, v) =

   ℓt−d(i) × Bt−d(i, v) Pt−d

  • Bt−d(i, v)
  • if t > d
  • therwise

Bt−d(i, v) is the event that some agent in a d-neighborhood of v played action i at time t − d

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 14 / 23

slide-22
SLIDE 22

Cooperative delayed loss estimator

Each agent v uses the messages received from the other agents in order to estimate ℓt better

  • ℓt(i, v) =

   ℓt−d(i) × Bt−d(i, v) Pt−d

  • Bt−d(i, v)
  • if t > d
  • therwise

Bt−d(i, v) is the event that some agent in a d-neighborhood of v played action i at time t − d Now ℓ(v) may have many non-zero components (better estimate)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 14 / 23

slide-23
SLIDE 23

Cooperative delayed loss estimator

Each agent v uses the messages received from the other agents in order to estimate ℓt better

  • ℓt(i, v) =

   ℓt−d(i) × Bt−d(i, v) Pt−d

  • Bt−d(i, v)
  • if t > d
  • therwise

Bt−d(i, v) is the event that some agent in a d-neighborhood of v played action i at time t − d Now ℓ(v) may have many non-zero components (better estimate) Agents need pt−d(v′) in order to compute P

  • Bt−d(i, v)
  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 14 / 23

slide-24
SLIDE 24

Cooperative delayed loss estimator

Each agent v uses the messages received from the other agents in order to estimate ℓt better

  • ℓt(i, v) =

   ℓt−d(i) × Bt−d(i, v) Pt−d

  • Bt−d(i, v)
  • if t > d
  • therwise

Bt−d(i, v) is the event that some agent in a d-neighborhood of v played action i at time t − d Now ℓ(v) may have many non-zero components (better estimate) Agents need pt−d(v′) in order to compute P

  • Bt−d(i, v)
  • A message mt(v′) received by some agent v is always used at time

t + d (even when v′ = v)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 14 / 23

slide-25
SLIDE 25

Key inequality

E K

  • i=1
  • v∈V

Pt

  • It(v) = i
  • Pt−d
  • ℓt−d(i) is observed by v
  • e

1 + e−1

  • Kαd + N
  • αd is the independence number of the graph obtained from G by

connecting any two vertices whose shortest path distance is at most d

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 15 / 23

slide-26
SLIDE 26

Average welfare regret bound

Rcoop

T

  • (d + 1) + K

N αd

  • main term

T ln K

  • unavoidable

+ d ln T

  • doubling trick
  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 16 / 23

slide-27
SLIDE 27

Average welfare regret bound

Rcoop

T

  • (d + 1) + K

N αd

  • main term

T ln K

  • unavoidable

+ d ln T

  • doubling trick

Case N ≈ K and d diam(G) Then αd = 1 and Rcoop

T

  • (d + 1)T ln K + d ln T

Each agent sees the loss of K random actions with delay d All agents see the same information and update in the same way: no need to have pt(v) in the messages

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 16 / 23

slide-28
SLIDE 28

More examples

Case G is a clique Then αd = 1 and Rcoop

T

  • 1 + K

N

  • T ln K + ln T

Compare to RT

  • K

NT ln K in the standard (single-agent) bandit

when the loss N actions can be observed in each play

[Seldin et al., 2014]

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 17 / 23

slide-29
SLIDE 29

More examples

Case G is a clique Then αd = 1 and Rcoop

T

  • 1 + K

N

  • T ln K + ln T

Compare to RT

  • K

NT ln K in the standard (single-agent) bandit

when the loss N actions can be observed in each play

[Seldin et al., 2014]

d = K1/2 for any connected graph G Then αd

2N d+2 and Rcoop T

  • K1/2T ln K +
  • ln T

√ K This is better than √ KT (minimax for non-cooperating bandits)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 17 / 23

slide-30
SLIDE 30

Individual delays

Agents may use personalized delay parameters d(v) in order to decrease the regret

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 18 / 23

slide-31
SLIDE 31

Individual delays

Agents may use personalized delay parameters d(v) in order to decrease the regret We do not assume agents know the delay parameters of the other agents

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 18 / 23

slide-32
SLIDE 32

Individual delays

Agents may use personalized delay parameters d(v) in order to decrease the regret We do not assume agents know the delay parameters of the other agents Hence each agent sends out his messages with a time-to-live parameter ttl(v) possibly different from d(v)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 18 / 23

slide-33
SLIDE 33

Individual delays

Agents may use personalized delay parameters d(v) in order to decrease the regret We do not assume agents know the delay parameters of the other agents Hence each agent sends out his messages with a time-to-live parameter ttl(v) possibly different from d(v) Now an agent v uses a message sent from v′ if their shortest-path distance is not greater than min{ttl(v′), d(v)}

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 18 / 23

slide-34
SLIDE 34

Individual delays

Agents may use personalized delay parameters d(v) in order to decrease the regret We do not assume agents know the delay parameters of the other agents Hence each agent sends out his messages with a time-to-live parameter ttl(v) possibly different from d(v) Now an agent v uses a message sent from v′ if their shortest-path distance is not greater than min{ttl(v′), d(v)} This defines a directed communication graph derived from the

  • riginal graph G
  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 18 / 23

slide-35
SLIDE 35

Individual delays

Agents may use personalized delay parameters d(v) in order to decrease the regret We do not assume agents know the delay parameters of the other agents Hence each agent sends out his messages with a time-to-live parameter ttl(v) possibly different from d(v) Now an agent v uses a message sent from v′ if their shortest-path distance is not greater than min{ttl(v′), d(v)} This defines a directed communication graph derived from the

  • riginal graph G

Time-to-live parameters also bound the number of times each message is replicated

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 18 / 23

slide-36
SLIDE 36

Cooperative delayed loss estimator, revisited

  • ℓt(i, v) =

   ℓt−d(v)(i) × Bt−d(v)(i, v) Pt−d(v)

  • Bt−d(v)(i, v)
  • if t > d(v)
  • therwise

Bt−d(v)(i, v) is the event that some agent v′ within shortest path distance min{ttl(v′), d(v)} from v played action i at time t − d(v)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 19 / 23

slide-37
SLIDE 37

Cooperative delayed loss estimator, revisited

  • ℓt(i, v) =

   ℓt−d(v)(i) × Bt−d(v)(i, v) Pt−d(v)

  • Bt−d(v)(i, v)
  • if t > d(v)
  • therwise

Bt−d(v)(i, v) is the event that some agent v′ within shortest path distance min{ttl(v′), d(v)} from v played action i at time t − d(v) Because the communication graph is now directed, an exploration term is added to Exp3

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 19 / 23

slide-38
SLIDE 38

Example

Nodes in dense regions (red agents) can afford small delay/time-to-live parameters Nodes in sparse regions (white agents) prefer large delay/time-to-live parameters

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 20 / 23

slide-39
SLIDE 39

Regret bound

Rcoop

T

  • dave + 1 + K

N αP ln(TNK)

  • T ln K
  • unavoidable

+ doubling trick dave is the delay averaged over all agents αP is the independence number of the directed communication graph induced by the delay and time-to-live parameters

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 21 / 23

slide-40
SLIDE 40

Regret bound

Rcoop

T

  • dave + 1 + K

N αP ln(TNK)

  • T ln K
  • unavoidable

+ doubling trick dave is the delay averaged over all agents αP is the independence number of the directed communication graph induced by the delay and time-to-live parameters Compare the main terms in the regret bounds (when K ≈ N): Common delay parameter: d + αd Personalized delay parameter: dave + αP

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 21 / 23

slide-41
SLIDE 41

Improving regret via the personalized parameters

N − √ N red agents and √ N white agents Single delay parameter d = N1/4 d + αd = N1/4 Personalized delay parameters d(v) = 1 for red agents and d(v) = √ N for white agents dave + αP = O(1)

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 22 / 23

slide-42
SLIDE 42

Open problems

1

Simultaneous regret bounds that hold for each agent individually

2

Local and adaptive tuning of personalized delay parameters

3

Drop distributions pt(v) from messages

  • N. Cesa-Bianchi (UNIMI)

Delay and Cooperation 23 / 23