From Bandits to Experts: A Tale of Domination and Independence - - PowerPoint PPT Presentation

from bandits to experts a tale of domination and
SMART_READER_LITE
LIVE PREVIEW

From Bandits to Experts: A Tale of Domination and Independence - - PowerPoint PPT Presentation

From Bandits to Experts: A Tale of Domination and Independence Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A Tale of Domination and Independence


slide-1
SLIDE 1

From Bandits to Experts: A Tale of Domination and Independence

Nicol`

  • Cesa-Bianchi

Universit` a degli Studi di Milano

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 1 / 1

slide-2
SLIDE 2

From Bandits to Experts: A Tale of Domination and Independence

Nicol`

  • Cesa-Bianchi

Universit` a degli Studi di Milano

Joint work with: Noga Alon Ofer Dekel Tomer Koren

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 1 / 1

slide-3
SLIDE 3

Theory of repeated games

James Hannan (1922–2010) David Blackwell (1919–2010) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 2 / 1

slide-4
SLIDE 4

Zero-sum 2-person games played more than once

1 2 . . . M 1 ℓ(1, 1) ℓ(1, 2) . . . 2 ℓ(2, 1) ℓ(2, 2) . . . . . . . . . . . . ... N N × M known loss matrix over R Row player (player) has N actions Column player (opponent) has M actions For each game round t = 1, 2, . . . Player chooses action it and opponent chooses action yt The player suffers loss ℓ(it, yt) (= gain of opponent) Player can learn from opponent’s history of past choices y1, . . . , yt−1

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 3 / 1

slide-5
SLIDE 5

Prediction with expert advice

Volodya Vovk Manfred Warmuth t = 1 t = 2 . . . 1 ℓ1(1) ℓ2(1) . . . 2 ℓ1(2) ℓ2(2) . . . . . . . . . . . . ... N ℓ1(N) ℓ2(N) Play an unknown loss matrix Opponent’s moves y1, y2, . . . define a sequential prediction problem with a time-varying loss function ℓ(it, yt) = ℓt(it)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 4 / 1

slide-6
SLIDE 6

Playing the experts game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 5 / 1

slide-7
SLIDE 7

Playing the experts game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 5 / 1

slide-8
SLIDE 8

Playing the experts game

N actions 7 3 6 7 2 1 4 9 4 For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: ℓt =

  • ℓt(1), . . . , ℓt(N)
  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 5 / 1

slide-9
SLIDE 9

Oblivious opponents

The loss process ℓtt1 is deterministic and unknown to the (randomized) player I1, I2, . . . Oblivious regret minimization RT

def

= E T

  • t=1

ℓt(It)

min

i=1,...,N T

  • t=1

ℓt(i) want = o(T)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 6 / 1

slide-10
SLIDE 10

Bounds on regret

[How to use expert advice, 1997]

Lower bound using random losses Losses ℓt(i) are independent random coin flips Lt(i) ∈ {0, 1} For any player strategy E T

  • t=1

Lt(It)

  • = T

2 Then the expected regret is E

  • max

i=1,...,N T

  • t=1

1 2 − Lt(i)

  • =
  • 1 − o(1)

T ln N 2

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 7 / 1

slide-11
SLIDE 11

Exponentially weighted forecaster

At time t pick action It = i with probability proportional to exp

  • −η

t−1

  • s=1

ℓs(i)

  • the sum at the exponent is the total loss of action i up to now

Regret bound

[How to use expert advice, 1997]

If η =

  • (ln N)/(8T) then

RT

  • T ln N

2 Matching lower bound including constants Dynamic choice ηt =

  • (ln N)/(8t)
  • nly loses small constants
  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 8 / 1

slide-12
SLIDE 12

The bandit problem: playing an unknown game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 9 / 1

slide-13
SLIDE 13

The bandit problem: playing an unknown game

N actions ? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 9 / 1

slide-14
SLIDE 14

The bandit problem: playing an unknown game

N actions ? 3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: Only ℓt(It) is revealed

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 9 / 1

slide-15
SLIDE 15

The bandit problem: playing an unknown game

N actions ? 3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Loss ℓt(i) ∈ [0, 1] is assigned to every action i = 1, . . . , N (hidden from the player)

2

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

3

Player gets feedback information: Only ℓt(It) is revealed Many applications Ad placement, dynamic content adaptation, routing, online auctions

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 9 / 1

slide-16
SLIDE 16

Relationships between actions

[Mannor and Shamir, 2011]

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 10 / 1

slide-17
SLIDE 17

A graph of relationships over actions

? ? ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 11 / 1

slide-18
SLIDE 18

A graph of relationships over actions

? ? ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 11 / 1

slide-19
SLIDE 19

A graph of relationships over actions

7 3 6 7 ? 2 ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 11 / 1

slide-20
SLIDE 20

Recovering expert and bandit settings

Experts: clique 7 3 6 7 2 2 1 4 9 4 Bandits: empty graph ? 3 ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 12 / 1

slide-21
SLIDE 21

Exponentially weighted forecaster — Reprise

Player’s strategy

[Alon, C-B, Gentile, Mannor, Mansour and Shamir, 2013]

Pt(It = i) ∝ exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • i = 1, . . . , N
  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if ℓt(i) is observed
  • therwise

Importance sampling estimator Et

  • ℓt(i)
  • = ℓt(i)

unbiasedness Et

  • ℓt(i)2
  • 1

Pt

  • ℓt(i) observed
  • variance control
  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 13 / 1

slide-22
SLIDE 22

Regret bounds

Analysis (undirected graphs) RT ln N η + η 2

T

  • t=1

N

  • i=1

Pt(It = i) Pt(It = i) +

  • j∈NG(i)

Pt(It = j) Lemma For any undirected graph G = (V, E) and for any probability assignment p1, . . . , pN over its vertices

N

  • i=1

pi pi +

  • j∈NG(i)

pj α(G) α(G) is the independence number of G (largest subset of V such that no two distinct vertices in it are adjacent in G)

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 14 / 1

slide-23
SLIDE 23

Regret bounds

Analysis (undirected graphs) RT ln N η + η 2

T

  • t=1

α(G) =

  • Tα(G) ln N

by choosing η Special cases Experts (clique): α(G) = 1 RT √ T ln N Bandits (empty graph): α(G) = N RT √ TN ln N Minimax rate The general bound is tight: RT = Θ

  • Tα(G) ln N
  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 15 / 1

slide-24
SLIDE 24

More general feedback models

Directed Interventions

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 16 / 1

slide-25
SLIDE 25

Old and new examples

Experts Bandits Cops & Robbers Revealing Action

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 17 / 1

slide-26
SLIDE 26

Exponentially weighted forecaster with exploration

Player’s strategy

[Alon, C-B, Dekel and Koren, 2015]

Pt(It = i) ∝ 1 − γ Zt exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • + γ UG

i = 1, . . . , N

  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if ℓt(i) is observed
  • therwise

UG is uniform distribution supported on a subset of V

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 18 / 1

slide-27
SLIDE 27

A characterization of feedback graphs

A vertex of G is:

  • bservable if it has at least one incoming edge (possibly a

self-loop) strongly observable if it has either a self-loop or incoming edges from all other vertices weakly observable if it is observable but not strongly observable

1 2 3 4 5

3 is not observable 2 and 5 are weakly observable 1 and 4 are strongly observable

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 19 / 1

slide-28
SLIDE 28

Minimax rates

G is strongly observable RT = Θ

  • α(G)T
  • UG is uniform on V

G is weakly observable RT = Θ

  • T 2/3δ(G)
  • UG is uniform on a weakly dominating set

G is not observable RT = Θ(T)

2 3 4 1 5

Weakly dominating set δ(G) is the size of the smallest set that dominates all weakly observable nodes of G

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 20 / 1

slide-29
SLIDE 29

Minimax regret

Presence of red loops does not affect minimax regret RT = Θ √ T ln N

  • 1

2 3 4 5

With red loop: strongly

  • bservable with α(G) = N − 1

RT = Θ √ NT

  • Without red loop: weakly
  • bservable with δ(G) = 1

RT = Θ

  • T 2/3
  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 21 / 1

slide-30
SLIDE 30

Reactive opponents

[Dekel, Koren and Peres, 2014]

The loss of action i at time t depends on the player’s past m actions ℓt(i) → Lt(It−m, . . . , It−1, i) Adaptive regret Rada

T

= E  

T

  • t=1

Lt(It−m, . . . , It−1, It) − min

i=1,...,N T

  • t=1

Lt(i, . . . , i

  • m times

, i)  

10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14

t ℓt(1) ℓt(2)

Minimax rate (m > 0) Rada

T

= Θ

  • T 2/3
  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 22 / 1

slide-31
SLIDE 31

Conclusions

An abstract, game-theoretic framework for studying a variety of sequential decisions problems Applicable to machine learning (e.g., binary classification) and

  • nline convex optimization settings

Exponential weights can be replaced by polynomial weights (cfr. Mirror Descent for convex optimization) Connections to gambling, portfolio management, competitive analysis of algorithms

  • N. Cesa-Bianchi (UNIMI)

Domination and Independence 23 / 1