Online Learning and Online Convex Optimization Nicol` o - - PowerPoint PPT Presentation

online learning and online convex optimization
SMART_READER_LITE
LIVE PREVIEW

Online Learning and Online Convex Optimization Nicol` o - - PowerPoint PPT Presentation

Online Learning and Online Convex Optimization Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary My beautiful regret 1 A supposedly fun game Ill play again 2 The joy of


slide-1
SLIDE 1

Online Learning and Online Convex Optimization

Nicol`

  • Cesa-Bianchi

Universit` a degli Studi di Milano

  • N. Cesa-Bianchi (UNIMI)

Online Learning 1 / 49

slide-2
SLIDE 2

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

The joy of convex

  • N. Cesa-Bianchi (UNIMI)

Online Learning 2 / 49

slide-3
SLIDE 3

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

The joy of convex

  • N. Cesa-Bianchi (UNIMI)

Online Learning 3 / 49

slide-4
SLIDE 4

Machine learning

Classification/regression tasks Predictive models h mapping data instances X to labels Y (e.g., binary classifier) Training data ST =

  • (X1, Y1), . . . , (XT, YT)
  • (e.g., email messages with spam vs. nonspam annotations)

Learning algorithm A (e.g., Support Vector Machine) maps training data ST to model h = A(ST) Evaluate the risk of the trained model h with respect to a given loss function

  • N. Cesa-Bianchi (UNIMI)

Online Learning 4 / 49

slide-5
SLIDE 5

Two notions of risk

View data as a statistical sample: statistical risk E

  • A( ST)
  • trained

model

, (X, Y)

  • test

example

  • Training set ST =
  • (X1, Y1), . . . , (XT, YT)
  • and test example (X, Y) drawn

i.i.d. from the same unknown and fixed distribution

  • N. Cesa-Bianchi (UNIMI)

Online Learning 5 / 49

slide-6
SLIDE 6

Two notions of risk

View data as a statistical sample: statistical risk E

  • A( ST)
  • trained

model

, (X, Y)

  • test

example

  • Training set ST =
  • (X1, Y1), . . . , (XT, YT)
  • and test example (X, Y) drawn

i.i.d. from the same unknown and fixed distribution View data as an arbitrary sequence: sequential risk

T

  • t=1

  • A(St−1)
  • trained

model

, (Xt, Yt)

  • test

example

  • Sequence of models trained on growing prefixes

St =

  • (X1, Y1), . . . , (Xt, Yt)
  • f the data sequence
  • N. Cesa-Bianchi (UNIMI)

Online Learning 5 / 49

slide-7
SLIDE 7

Regrets, I had a few

Learning algorithm A maps datasets to models in a given class H Variance error in statistical learning E

  • A(ST), (X, Y)
  • − inf

h∈H E

  • h, (X, Y)
  • compare to expected loss of best model in the class
  • N. Cesa-Bianchi (UNIMI)

Online Learning 6 / 49

slide-8
SLIDE 8

Regrets, I had a few

Learning algorithm A maps datasets to models in a given class H Variance error in statistical learning E

  • A(ST), (X, Y)
  • − inf

h∈H E

  • h, (X, Y)
  • compare to expected loss of best model in the class

Regret in online learning

T

  • t=1

  • A(St−1), (Xt, Yt)
  • − inf

h∈H T

  • t=1

  • h, (Xt, Yt)
  • compare to cumulative loss of best model in the class
  • N. Cesa-Bianchi (UNIMI)

Online Learning 6 / 49

slide-9
SLIDE 9

Incremental model update

A natural blueprint for online learning algorithms For t = 1, 2, . . .

1

Apply current model ht−1 to next data element (Xt, Yt)

2

Update current model: ht−1 → ht ∈ H (local optimization)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 7 / 49

slide-10
SLIDE 10

Incremental model update

A natural blueprint for online learning algorithms For t = 1, 2, . . .

1

Apply current model ht−1 to next data element (Xt, Yt)

2

Update current model: ht−1 → ht ∈ H (local optimization) Goal: control regret

T

  • t=1

  • ht−1, (Xt, Yt)
  • − inf

h∈H T

  • t=1

  • h, (Xt, Yt)
  • N. Cesa-Bianchi (UNIMI)

Online Learning 7 / 49

slide-11
SLIDE 11

Incremental model update

A natural blueprint for online learning algorithms For t = 1, 2, . . .

1

Apply current model ht−1 to next data element (Xt, Yt)

2

Update current model: ht−1 → ht ∈ H (local optimization) Goal: control regret

T

  • t=1

  • ht−1, (Xt, Yt)
  • − inf

h∈H T

  • t=1

  • h, (Xt, Yt)
  • View this as a repeated game between a player generating predictors

ht ∈ H and an opponent generating data (Xt, Yt)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 7 / 49

slide-12
SLIDE 12

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

The joy of convex

  • N. Cesa-Bianchi (UNIMI)

Online Learning 8 / 49

slide-13
SLIDE 13

Theory of repeated games

James Hannan (1922–2010) David Blackwell (1919–2010) Learning to play a game (1956) Play a game repeatedly against a possibly suboptimal opponent

  • N. Cesa-Bianchi (UNIMI)

Online Learning 9 / 49

slide-14
SLIDE 14

Zero-sum 2-person games played more than once

1 2 . . . M 1 ℓ(1, 1) ℓ(1, 2) . . . 2 ℓ(2, 1) ℓ(2, 2) . . . . . . . . . . . . ... N N × M known loss matrix Row player (player) has N actions Column player (opponent) has M actions For each game round t = 1, 2, . . . Player chooses action it and opponent chooses action yt The player suffers loss ℓ(it, yt) (= gain of opponent) Player can learn from opponent’s history of past choices y1, . . . , yt−1

  • N. Cesa-Bianchi (UNIMI)

Online Learning 10 / 49

slide-15
SLIDE 15

Prediction with expert advice

Volodya Vovk Manfred Warmuth t = 1 t = 2 . . . 1 ℓ1(1) ℓ2(1) . . . 2 ℓ1(2) ℓ2(2) . . . . . . . . . . . . ... N ℓ1(N) ℓ2(N) Opponent’s moves y1, y2, . . . define a sequential prediction problem with a time-varying loss function ℓ(it, yt) = ℓt(it)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 11 / 49

slide-16
SLIDE 16

Playing the experts game

A sequential decision problem N actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(N)
  • ∈ [0, 1]N for t = 1, 2, . . .

? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

  • N. Cesa-Bianchi (UNIMI)

Online Learning 12 / 49

slide-17
SLIDE 17

Playing the experts game

A sequential decision problem N actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(N)
  • ∈ [0, 1]N for t = 1, 2, . . .

? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 12 / 49

slide-18
SLIDE 18

Playing the experts game

A sequential decision problem N actions Unknown deterministic assignment of losses to actions ℓt =

  • ℓt(1), . . . , ℓt(N)
  • ∈ [0, 1]N for t = 1, 2, . . .

.7 .3 .6 .7 .2 .1 .4 .9 .4 For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

2

Player gets feedback information: ℓt(1), . . . , ℓt(N)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 12 / 49

slide-19
SLIDE 19

Regret analysis

Regret RT

def

= E T

  • t=1

ℓt(It)

min

i=1,...,N T

  • t=1

ℓt(i) want = o(T)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 13 / 49

slide-20
SLIDE 20

Regret analysis

Regret RT

def

= E T

  • t=1

ℓt(It)

min

i=1,...,N T

  • t=1

ℓt(i) want = o(T) Lower bound using random losses

[Experts’ paper, 1997]

ℓt(i) → Lt(i) ∈ {0, 1} independent random coin flip For any player strategy E T

  • t=1

Lt(It)

  • = T

2 Then the expected regret is E

  • max

i=1,...,N T

  • t=1

1 2 − Lt(i)

  • =
  • 1 − o(1)

T ln N 2 for N, T → ∞

  • N. Cesa-Bianchi (UNIMI)

Online Learning 13 / 49

slide-21
SLIDE 21

Exponentially weighted forecaster (Hedge)

At time t pick action It = i with probability proportional to exp

  • −η

t−1

  • s=1

ℓs(i)

  • the sum at the exponent is the total loss of action i up to now

Regret bound

[Experts’ paper, 1997]

If η =

  • (ln N)/(8T) then

RT

  • T ln N

2 Matching lower bound including constants Dynamic choice ηt =

  • (ln N)/(8t)
  • nly loses small constants
  • N. Cesa-Bianchi (UNIMI)

Online Learning 14 / 49

slide-22
SLIDE 22

The nonstochastic bandit problem

? ? ? ? ? ? ? ? ?

  • N. Cesa-Bianchi (UNIMI)

Online Learning 15 / 49

slide-23
SLIDE 23

The nonstochastic bandit problem

? ? ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 15 / 49

slide-24
SLIDE 24

The nonstochastic bandit problem

? .3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

2

Player gets partial information: Only ℓt(It) is revealed

  • N. Cesa-Bianchi (UNIMI)

Online Learning 15 / 49

slide-25
SLIDE 25

The nonstochastic bandit problem

? .3 ? ? ? ? ? ? ? For t = 1, 2, . . .

1

Player picks an action It (possibly using randomization) and incurs loss ℓt(It)

2

Player gets partial information: Only ℓt(It) is revealed Player still competing agaist best offline action RT = E T

  • t=1

ℓt(It)

min

i=1,...,N T

  • t=1

ℓt(i)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 15 / 49

slide-26
SLIDE 26

The Exp3 algorithm

[Auer et al., 2002]

Hedge with estimated losses Pt(It = i) ∝ exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • i = 1, . . . , N
  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if It = i
  • therwise

Only one non-zero component in ℓt

  • N. Cesa-Bianchi (UNIMI)

Online Learning 16 / 49

slide-27
SLIDE 27

The Exp3 algorithm

[Auer et al., 2002]

Hedge with estimated losses Pt(It = i) ∝ exp

  • −η

t−1

  • s=1
  • ℓs(i)
  • i = 1, . . . , N
  • ℓt(i) =

   ℓt(i) Pt

  • ℓt(i) observed
  • if It = i
  • therwise

Only one non-zero component in ℓt Properties of importance weighting estimator Et

  • ℓt(i)
  • = ℓt(i)

unbiasedness Et

  • ℓt(i)2
  • 1

Pt

  • ℓt(i) observed
  • variance control
  • N. Cesa-Bianchi (UNIMI)

Online Learning 16 / 49

slide-28
SLIDE 28

Exp3 regret bound

RT ln N η + η 2 E T

  • t=1

N

  • i=1

Pt(It = i)Et

  • ℓt(i)2

ln N η + η 2 E T

  • t=1

N

  • i=1

Pt(It = i) Pt

  • ℓt(i) is observed
  • = ln N

η + η 2NT = √ NT ln N lower bound Ω √ NT

  • N. Cesa-Bianchi (UNIMI)

Online Learning 17 / 49

slide-29
SLIDE 29

Exp3 regret bound

RT ln N η + η 2 E T

  • t=1

N

  • i=1

Pt(It = i)Et

  • ℓt(i)2

ln N η + η 2 E T

  • t=1

N

  • i=1

Pt(It = i) Pt

  • ℓt(i) is observed
  • = ln N

η + η 2NT = √ NT ln N lower bound Ω √ NT

  • Improved matching upper bound by [Audib´

ert and Bubeck, 2009]

  • N. Cesa-Bianchi (UNIMI)

Online Learning 17 / 49

slide-30
SLIDE 30

Exp3 regret bound

RT ln N η + η 2 E T

  • t=1

N

  • i=1

Pt(It = i)Et

  • ℓt(i)2

ln N η + η 2 E T

  • t=1

N

  • i=1

Pt(It = i) Pt

  • ℓt(i) is observed
  • = ln N

η + η 2NT = √ NT ln N lower bound Ω √ NT

  • Improved matching upper bound by [Audib´

ert and Bubeck, 2009]

The full information (experts) setting Player observes vector of losses ℓt after each play Pt(ℓt(i) is observed) = 1 RT √ T ln N

  • N. Cesa-Bianchi (UNIMI)

Online Learning 17 / 49

slide-31
SLIDE 31

Nonoblivious opponents

The adaptive adversary The loss of action i at time t depends on the player’s past m actions ℓt(i) → ℓt(It−m, . . . , It−1, i)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 18 / 49

slide-32
SLIDE 32

Nonoblivious opponents

The adaptive adversary The loss of action i at time t depends on the player’s past m actions ℓt(i) → ℓt(It−m, . . . , It−1, i) Examples: bandits with switching cost

  • N. Cesa-Bianchi (UNIMI)

Online Learning 18 / 49

slide-33
SLIDE 33

Nonoblivious opponents

The adaptive adversary The loss of action i at time t depends on the player’s past m actions ℓt(i) → ℓt(It−m, . . . , It−1, i) Examples: bandits with switching cost Nonoblivious regret Rnon

T

= E T

  • t=1

ℓt(It−m, . . . , It−1, It) − min

i=1,...,N T

  • t=1

ℓt(It−m, . . . , It−1, i)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 18 / 49

slide-34
SLIDE 34

Nonoblivious opponents

The adaptive adversary The loss of action i at time t depends on the player’s past m actions ℓt(i) → ℓt(It−m, . . . , It−1, i) Examples: bandits with switching cost Nonoblivious regret Rnon

T

= E T

  • t=1

ℓt(It−m, . . . , It−1, It) − min

i=1,...,N T

  • t=1

ℓt(It−m, . . . , It−1, i)

  • Policy regret

Rpol

T

= E  

T

  • t=1

ℓt(It−m, . . . , It−1, It) − min

i=1,...,N T

  • t=1

ℓt(i, . . . , i

  • m times

, i)  

  • N. Cesa-Bianchi (UNIMI)

Online Learning 18 / 49

slide-35
SLIDE 35

Bandits and reactive opponents

Bounds on the nonoblivious regret (even when m depends on T) Rnon

T

= O √ TN ln N

  • Exp3 with biased loss estimates

Is the √ ln N factor necessary?

  • N. Cesa-Bianchi (UNIMI)

Online Learning 19 / 49

slide-36
SLIDE 36

Bandits and reactive opponents

Bounds on the nonoblivious regret (even when m depends on T) Rnon

T

= O √ TN ln N

  • Exp3 with biased loss estimates

Is the √ ln N factor necessary? Bounds on the policy regret for any constant m 1 Rpol

T

= O

  • (N ln N)1/3T 2/3

Achieved by a very simple player strategy Optimal up to log factors!

[Dekel, Koren, and Peres, 2014]

  • N. Cesa-Bianchi (UNIMI)

Online Learning 19 / 49

slide-37
SLIDE 37

Partial monitoring: not observing any loss

Dynamic pricing: Perform as the best fixed price

1

Post a T-shirt price

2

Observe if next customer buys or not

3

Adjust price Feedback does not reveal the player’s loss 1 2 3 4 5 1 1 2 3 4 2 c 1 2 3 3 c c 1 2 4 c c c 1 5 c c c c Loss matrix 1 2 3 4 5 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 4 1 1 5 1 Feedback matrix

  • N. Cesa-Bianchi (UNIMI)

Online Learning 20 / 49

slide-38
SLIDE 38

A characterization of minimax regret

Special case Multiarmed bandits: loss and feedback matrix are the same

  • N. Cesa-Bianchi (UNIMI)

Online Learning 21 / 49

slide-39
SLIDE 39

A characterization of minimax regret

Special case Multiarmed bandits: loss and feedback matrix are the same A general gap theorem

[Bartok, Foster, P´ al, Rakhlin and Szepesv´ ari, 2013]

A constructive characterization of the minimax regret for any pair

  • f loss/feedback matrix

Only three possible rates for nontrivial games:

1

Easy games (e.g., bandits): Θ √ T

  • 2

Hard games (e.g., revealing action): Θ

  • T 2/3

3

Impossible games: Θ(T)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 21 / 49

slide-40
SLIDE 40

A game equivalent to prediction with expert advice

Online linear optimization in the simplex

1

Play pt from the N-dimensional simplex ∆N

2

Incur linear loss E

  • ℓt(It)
  • = p⊤

t ℓt

3

Observe loss gradient ℓt Regret: compete against the best point in the simplex

T

  • t=1

p

⊤ t ℓt −

min

q∈∆N T

  • t=1

q⊤ℓt

  • =

min

i=1,...,N

1 T

T

  • t=1

ℓt(i)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 22 / 49

slide-41
SLIDE 41

From game theory to machine learning

OPPONENT TRUE LABEL GUESSED LABEL UNLABELED SYSTEM CLASSIFICATION DATA

Opponent’s moves yt are viewed as values or labels assigned to

  • bservations xt ∈ Rd (e.g., categories of documents)

A repeated game between the player choosing an element wt

  • f a linear space and the opponent choosing a label yt for xt

Regret with respect to best element in the linear space

  • N. Cesa-Bianchi (UNIMI)

Online Learning 23 / 49

slide-42
SLIDE 42

Summary

1

My beautiful regret

2

A supposedly fun game I’ll play again

3

The joy of convex

  • N. Cesa-Bianchi (UNIMI)

Online Learning 24 / 49

slide-43
SLIDE 43

Online convex optimization

[Zinkevich, 2003]

1

Play wt from a convex and compact subset S of a linear space

2

Observe convex loss ℓt : S → R and pay ℓt(wt)

3

Update: wt → wt+1 ∈ S

  • N. Cesa-Bianchi (UNIMI)

Online Learning 25 / 49

slide-44
SLIDE 44

Online convex optimization

[Zinkevich, 2003]

1

Play wt from a convex and compact subset S of a linear space

2

Observe convex loss ℓt : S → R and pay ℓt(wt)

3

Update: wt → wt+1 ∈ S Example Regression with square loss: ℓt(w) =

  • w⊤xt − yt

2 yt ∈ R Classification with hinge loss: ℓt(w) =

  • 1 − yt w⊤xt
  • +

yt ∈ {−1, +1}

  • N. Cesa-Bianchi (UNIMI)

Online Learning 25 / 49

slide-45
SLIDE 45

Online convex optimization

[Zinkevich, 2003]

1

Play wt from a convex and compact subset S of a linear space

2

Observe convex loss ℓt : S → R and pay ℓt(wt)

3

Update: wt → wt+1 ∈ S Example Regression with square loss: ℓt(w) =

  • w⊤xt − yt

2 yt ∈ R Classification with hinge loss: ℓt(w) =

  • 1 − yt w⊤xt
  • +

yt ∈ {−1, +1} Regret RT(u) =

T

  • t=1

ℓt(wt) −

T

  • t=1

ℓt(u) u ∈ S

  • N. Cesa-Bianchi (UNIMI)

Online Learning 25 / 49

slide-46
SLIDE 46

Finding a good online algorithm

Follow the leader wt+1 = arginf

w∈S t

  • s=1

ℓs(w) Regret can be linear due to lack of stability S = [−1, +1] ℓ1(w) = w 2 ℓt(w) = −w if t is even +w if t is odd

  • N. Cesa-Bianchi (UNIMI)

Online Learning 26 / 49

slide-47
SLIDE 47

Finding a good online algorithm

Follow the leader wt+1 = arginf

w∈S t

  • s=1

ℓs(w) Regret can be linear due to lack of stability S = [−1, +1] ℓ1(w) = w 2 ℓt(w) = −w if t is even +w if t is odd Note:

t

  • s=1

ℓs(w) = − w

2

if t is even + w

2

if t is odd Hence ℓt+1(wt+1) = 1 for all t = 1, 2 . . .

  • N. Cesa-Bianchi (UNIMI)

Online Learning 26 / 49

slide-48
SLIDE 48

Follow the regularized leader

[Shalev-Shwartz, 2007; Abernethy, Hazan and Rakhlin, 2008]

wt+1 = argmin

w∈S

  • η

t

  • s=1

ℓs(w) + Φ(w)

  • Φ is a strongly convex regularizer and η > 0 is a scale parameter
  • N. Cesa-Bianchi (UNIMI)

Online Learning 27 / 49

slide-49
SLIDE 49

Convexity, smoothness, and duality

Strong convexity Φ : S → R is β-strongly convex w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + β 2 u − v2

  • N. Cesa-Bianchi (UNIMI)

Online Learning 28 / 49

slide-50
SLIDE 50

Convexity, smoothness, and duality

Strong convexity Φ : S → R is β-strongly convex w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + β 2 u − v2 Smoothness Φ : S → R is α-smooth w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + α 2 u − v2

  • N. Cesa-Bianchi (UNIMI)

Online Learning 28 / 49

slide-51
SLIDE 51

Convexity, smoothness, and duality

Strong convexity Φ : S → R is β-strongly convex w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + β 2 u − v2 Smoothness Φ : S → R is α-smooth w.r.t. a norm · if for all u, v ∈ S Φ(v) Φ(u) + ∇Φ(u)⊤(v − u) + α 2 u − v2 If Φ is β-strongly convex w.r.t. ·2, then ∇2Φ βI If Φ is α-smooth w.r.t. ·2, then ∇2Φ αI

  • N. Cesa-Bianchi (UNIMI)

Online Learning 28 / 49

slide-52
SLIDE 52

Examples

Euclidean norm: Φ = 1

2 · 2 2 is 1-strongly convex w.r.t. · 2

  • N. Cesa-Bianchi (UNIMI)

Online Learning 29 / 49

slide-53
SLIDE 53

Examples

Euclidean norm: Φ = 1

2 · 2 2 is 1-strongly convex w.r.t. · 2

p-norm: Φ = 1

2 · 2 p is (p − 1)-strongly convex w.r.t. · p

(for 1 < p 2)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 29 / 49

slide-54
SLIDE 54

Examples

Euclidean norm: Φ = 1

2 · 2 2 is 1-strongly convex w.r.t. · 2

p-norm: Φ = 1

2 · 2 p is (p − 1)-strongly convex w.r.t. · p

(for 1 < p 2) Entropy: Φ(p) =

d

  • i=1

pi ln pi is 1-strongly convex w.r.t. · 1 (for p in the probability simplex)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 29 / 49

slide-55
SLIDE 55

Examples

Euclidean norm: Φ = 1

2 · 2 2 is 1-strongly convex w.r.t. · 2

p-norm: Φ = 1

2 · 2 p is (p − 1)-strongly convex w.r.t. · p

(for 1 < p 2) Entropy: Φ(p) =

d

  • i=1

pi ln pi is 1-strongly convex w.r.t. · 1 (for p in the probability simplex) Power norm: Φ(w) = 1

2w⊤Aw is 1-strongly convex w.r.t.

w = √ w⊤Aw (for A symmetric and positive definite)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 29 / 49

slide-56
SLIDE 56

Convex duality

Definition The convex dual of Φ is Φ∗(θ) = max

w∈S

  • θ⊤w − Φ(w)
  • 1-dimensional example

Convex f : R → R such that f(0) = 0 f∗(θ) = max

w∈R

  • w × θ − f(w)
  • The maximizer is w0 such that f′(w0) = θ

This gives f∗(θ) = w0 × f′(w0) − f(w0) As f(0) = 0, f∗(θ) is the error in approximating f(0) with a first-order expansion around f(w0)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 30 / 49

slide-57
SLIDE 57

Convex duality

(thanks to Shai Shalev-Shwartz for the image)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 31 / 49

slide-58
SLIDE 58

Convexity, smoothness, and duality

Examples Euclidean norm: Φ = 1

2 · 2 2 and Φ∗ = Φ

  • N. Cesa-Bianchi (UNIMI)

Online Learning 32 / 49

slide-59
SLIDE 59

Convexity, smoothness, and duality

Examples Euclidean norm: Φ = 1

2 · 2 2 and Φ∗ = Φ

p-norm: Φ = 1

2 · 2 p and Φ∗ = 1 2 · 2 q where 1 p + 1 q = 1

  • N. Cesa-Bianchi (UNIMI)

Online Learning 32 / 49

slide-60
SLIDE 60

Convexity, smoothness, and duality

Examples Euclidean norm: Φ = 1

2 · 2 2 and Φ∗ = Φ

p-norm: Φ = 1

2 · 2 p and Φ∗ = 1 2 · 2 q where 1 p + 1 q = 1

Entropy: Φ(p) =

d

  • i=1

pi ln pi and Φ∗(θ) = ln

  • eθ1 + · · · + eθd
  • N. Cesa-Bianchi (UNIMI)

Online Learning 32 / 49

slide-61
SLIDE 61

Convexity, smoothness, and duality

Examples Euclidean norm: Φ = 1

2 · 2 2 and Φ∗ = Φ

p-norm: Φ = 1

2 · 2 p and Φ∗ = 1 2 · 2 q where 1 p + 1 q = 1

Entropy: Φ(p) =

d

  • i=1

pi ln pi and Φ∗(θ) = ln

  • eθ1 + · · · + eθd
  • Power norm: Φ(w) = 1

2w⊤Aw and Φ∗(θ) = 1 2θ⊤A−1θ

  • N. Cesa-Bianchi (UNIMI)

Online Learning 32 / 49

slide-62
SLIDE 62

Some useful properties

If Φ : S → R is β-strongly convex w.r.t. · , then Its convex dual Φ∗ is everywhere differentiable and 1

β-smooth

w.r.t. · ∗ (the dual norm of · ) ∇Φ∗(θ) = argmax

w∈S

  • θ⊤w − Φ(w)
  • N. Cesa-Bianchi (UNIMI)

Online Learning 33 / 49

slide-63
SLIDE 63

Some useful properties

If Φ : S → R is β-strongly convex w.r.t. · , then Its convex dual Φ∗ is everywhere differentiable and 1

β-smooth

w.r.t. · ∗ (the dual norm of · ) ∇Φ∗(θ) = argmax

w∈S

  • θ⊤w − Φ(w)
  • Recall: Follow the regularized leader (FTRL)

wt+1 = argmin

w∈S

  • η

t

  • s=1

ℓs(w) + Φ(w)

  • Φ is a strongly convex regularizer and η > 0 is a scale parameter
  • N. Cesa-Bianchi (UNIMI)

Online Learning 33 / 49

slide-64
SLIDE 64

Using the loss gradient

Linearization of convex losses ℓt(wt) − ℓt(u) ∇ℓt(wt)

  • ℓt

⊤wt − ∇ℓt(wt)

  • ℓt

⊤u

FTRL with linearized losses wt+1 = argmin

w∈S

  • η

t

  • s=1
  • ℓs
  • −θt+1

⊤w + Φ(w)

  • = argmax

w∈S

  • θ⊤

t+1w − Φ(w)

  • = ∇Φ∗

θt+1

  • Note: wt+1 ∈ S always holds
  • N. Cesa-Bianchi (UNIMI)

Online Learning 34 / 49

slide-65
SLIDE 65

The Mirror Descent algorithm

[Nemirovsky and Yudin, 1983]

Recall: wt+1 = ∇Φ∗ θt

  • = ∇Φ∗
  • −η

t

  • s=1

∇ℓs(ws)

  • Online Mirror Descent (FTRL with linearized losses)

Parameters: Strongly convex regularizer Φ with domain S, η > 0 Initialize: θ1 = 0 // primal parameter For t = 1, 2, . . .

1

Use wt = ∇Φ∗(θt) // dual parameter (via mirror step)

2

Suffer loss ℓt(wt)

3

Observe loss gradient ∇ℓt(wt)

4

Update θt+1 = θt − η∇ℓt(wt) // gradient step

  • N. Cesa-Bianchi (UNIMI)

Online Learning 35 / 49

slide-66
SLIDE 66

An equivalent formulation

Under some assumptions on the regularizer Φ, OMD can be equivalently written in terms of projected gradient descent Online Mirror Descent (alternative version) Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: z1 = ∇Φ∗

  • and w1 = argmin

w∈S

  • wz1
  • For t = 1, 2, . . .

1

Use wt and suffer loss ℓt(wt)

2

Observe loss gradient ∇ℓt(wt)

3

Update zt+1 = ∇Φ∗ ∇Φ(zt) − η∇ℓt(wt)

  • // gradient step

4

wt+1 = argmin

w∈S

  • wzt+1
  • // projection step
  • N. Cesa-Bianchi (UNIMI)

Online Learning 36 / 49

slide-67
SLIDE 67

An equivalent formulation

Under some assumptions on the regularizer Φ, OMD can be equivalently written in terms of projected gradient descent Online Mirror Descent (alternative version) Parameters: Strongly convex regularizer Φ and learning rate η > 0 Initialize: z1 = ∇Φ∗

  • and w1 = argmin

w∈S

  • wz1
  • For t = 1, 2, . . .

1

Use wt and suffer loss ℓt(wt)

2

Observe loss gradient ∇ℓt(wt)

3

Update zt+1 = ∇Φ∗ ∇Φ(zt) − η∇ℓt(wt)

  • // gradient step

4

wt+1 = argmin

w∈S

  • wzt+1
  • // projection step

DΦ is the Bregman divergence induced by Φ

  • N. Cesa-Bianchi (UNIMI)

Online Learning 36 / 49

slide-68
SLIDE 68

Some examples

Online Gradient Descent (OGD)

[Zinkevich, 2003; Gentile, 2003]

Φ(w) = 1 2 w2 p-norm version: Φ(w) = 1 2 w2

p

Update: w′ = wt − η∇ℓt(wt) wt+1 = arginf

w∈S

  • w − w′
  • 2
  • N. Cesa-Bianchi (UNIMI)

Online Learning 37 / 49

slide-69
SLIDE 69

Some examples

Online Gradient Descent (OGD)

[Zinkevich, 2003; Gentile, 2003]

Φ(w) = 1 2 w2 p-norm version: Φ(w) = 1 2 w2

p

Update: w′ = wt − η∇ℓt(wt) wt+1 = arginf

w∈S

  • w − w′
  • 2

Exponentiated gradient (EG)

[Kivinen and Warmuth, 1997]

Φ(p) =

d

  • i=1

pi ln pi p ∈ S ≡ simplex pt+1,i = pt,ie−η∇ℓt(pt)i d

j=1 pt,je−η∇ℓt(pt)j

Note: when losses are linear this is Hedge

  • N. Cesa-Bianchi (UNIMI)

Online Learning 37 / 49

slide-70
SLIDE 70

Regret analysis

Regret bound

[Kakade, Shalev-Shwartz and Tewari, 2012]

RT(u) Φ(u) − minw∈S Φ(w) η + η 2

T

  • t=1

∇ℓt(wt)2

β for all u ∈ S, where ℓ1, ℓ2, . . . are arbitrary convex losses RT(u) GD √ T for all u ∈ S when η is tuned w.r.t. sup

w∈S

∇ℓt(w)∗ G

  • sup

u,w∈S

  • Φ(u) − Φ(w)
  • D

Boundedness of gradients of ℓt w.r.t. ·∗ equivalent to Lipschitzess of ℓt w.r.t. · Regret bound optimal for general convex losses ℓt

  • N. Cesa-Bianchi (UNIMI)

Online Learning 38 / 49

slide-71
SLIDE 71

Analysis relies on smoothness of Φ∗

Φ∗(θt+1) − Φ∗(θt) ∇Φ∗(θt)

  • wt

θt+1 − θt

  • −η∇ℓt(wt)
  • + 1

2β θt+1 − θt2

  • N. Cesa-Bianchi (UNIMI)

Online Learning 39 / 49

slide-72
SLIDE 72

Analysis relies on smoothness of Φ∗

Φ∗(θt+1) − Φ∗(θt) ∇Φ∗(θt)

  • wt

θt+1 − θt

  • −η∇ℓt(wt)
  • + 1

2β θt+1 − θt2

∗ T

  • t=1

− ηu⊤∇ℓt(wt) − Φ(u) = u⊤θT+1 − Φ(u) Φ∗ θT+1

  • Fenchel-Young inequality

=

T

  • t=1
  • Φ∗

θt+1

  • − Φ∗

θt

  • + Φ∗

θ1

  • T
  • t=1
  • −ηw⊤

t ∇ℓt(wt) + η2

2β ∇ℓt(wt)2

  • + Φ∗(0)
  • N. Cesa-Bianchi (UNIMI)

Online Learning 39 / 49

slide-73
SLIDE 73

Analysis relies on smoothness of Φ∗

Φ∗(θt+1) − Φ∗(θt) ∇Φ∗(θt)

  • wt

θt+1 − θt

  • −η∇ℓt(wt)
  • + 1

2β θt+1 − θt2

∗ T

  • t=1

− ηu⊤∇ℓt(wt) − Φ(u) = u⊤θT+1 − Φ(u) Φ∗ θT+1

  • Fenchel-Young inequality

=

T

  • t=1
  • Φ∗

θt+1

  • − Φ∗

θt

  • + Φ∗

θ1

  • T
  • t=1
  • −ηw⊤

t ∇ℓt(wt) + η2

2β ∇ℓt(wt)2

  • + Φ∗(0)

Φ∗(0) = max

w∈S

  • w⊤0 − Φ(w)
  • = − min

w∈S Φ(w)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 39 / 49

slide-74
SLIDE 74

Some examples

ℓt(w) → ℓt

  • w⊤xt
  • maxt |ℓ′

t| L

maxt xtp Xp

  • N. Cesa-Bianchi (UNIMI)

Online Learning 40 / 49

slide-75
SLIDE 75

Some examples

ℓt(w) → ℓt

  • w⊤xt
  • maxt |ℓ′

t| L

maxt xtp Xp Bounds for OGD with convex losses RT(u) BLX2 √ T = O

  • dL

√ T

  • for all u such that u2 B
  • N. Cesa-Bianchi (UNIMI)

Online Learning 40 / 49

slide-76
SLIDE 76

Some examples

ℓt(w) → ℓt

  • w⊤xt
  • maxt |ℓ′

t| L

maxt xtp Xp Bounds for OGD with convex losses RT(u) BLX2 √ T = O

  • dL

√ T

  • for all u such that u2 B

Bounds logarithmic in the dimension Regret bound for EG run in the simplex, S = ∆d RT(q) LX∞

  • (ln d)T = O
  • L
  • (ln d)T
  • p ∈ ∆d

Same bound for p-norm regularizer with p = ln d ln d − 1 If losses are linear with [0, 1] coefficients then we recover the bound for Hedge

  • N. Cesa-Bianchi (UNIMI)

Online Learning 40 / 49

slide-77
SLIDE 77

Exploiting curvature: minimization of SVM objective

Training set (x1, y1), . . . , (xm, ym) ∈ Rd × {−1, +1} SVM objective F(w) = 1 m

m

  • t=1
  • 1 − yt w⊤xt
  • +
  • hinge loss ht(w)

+λ 2 w2

  • ver Rd

Rewrite F(w) = 1 m

m

  • t=1

ℓt(w) where ℓt(w) = ht(w) + λ

2 w2

Each loss ℓt is λ-strongly convex

  • N. Cesa-Bianchi (UNIMI)

Online Learning 41 / 49

slide-78
SLIDE 78

Exploiting curvature: minimization of SVM objective

Training set (x1, y1), . . . , (xm, ym) ∈ Rd × {−1, +1} SVM objective F(w) = 1 m

m

  • t=1
  • 1 − yt w⊤xt
  • +
  • hinge loss ht(w)

+λ 2 w2

  • ver Rd

Rewrite F(w) = 1 m

m

  • t=1

ℓt(w) where ℓt(w) = ht(w) + λ

2 w2

Each loss ℓt is λ-strongly convex The Pegasos algorithm Run OGD on random sequence of T training examples E

  • F
  • 1

T

T

  • t=1

wt

  • min

w∈Rd F(w) + G2

2λ ln T + 1 T O(ln T) rates hold for any sequence of strongly convex losses

  • N. Cesa-Bianchi (UNIMI)

Online Learning 41 / 49

slide-79
SLIDE 79

Exp-concave losses

Exp-concavity (strong convexity along the gradient direction) A convex ℓ : S → R is α-exp-concave when g(w) = e−αℓ(w) is concave For twice-differentiable losses: ∇2ℓ(w) α∇ℓ(w)∇ℓ(w)⊤ for all w ∈ S ℓt(w) = − ln

  • w⊤xt
  • is exp-concave
  • N. Cesa-Bianchi (UNIMI)

Online Learning 42 / 49

slide-80
SLIDE 80

Online Newton Step

[Hazan, Agarwal and Kale, 2007]

Update: w′ = A−1

t ∇ℓt(wt)

wt+1 = argmin

w∈S

  • w − w′
  • At

Where At = εI +

t

  • s=1

∇ℓs(ws) ∇ℓs(ws)⊤ Note: Not an instance of OMD

  • N. Cesa-Bianchi (UNIMI)

Online Learning 43 / 49

slide-81
SLIDE 81

Online Newton Step

[Hazan, Agarwal and Kale, 2007]

Update: w′ = A−1

t ∇ℓt(wt)

wt+1 = argmin

w∈S

  • w − w′
  • At

Where At = εI +

t

  • s=1

∇ℓs(ws) ∇ℓs(ws)⊤ Note: Not an instance of OMD Logarithmic regret bound for exp-concave losses RT(u) 5d 1 α + GD

  • ln(T + 1)

u ∈ S

  • N. Cesa-Bianchi (UNIMI)

Online Learning 43 / 49

slide-82
SLIDE 82

Online Newton Step

[Hazan, Agarwal and Kale, 2007]

Update: w′ = A−1

t ∇ℓt(wt)

wt+1 = argmin

w∈S

  • w − w′
  • At

Where At = εI +

t

  • s=1

∇ℓs(ws) ∇ℓs(ws)⊤ Note: Not an instance of OMD Logarithmic regret bound for exp-concave losses RT(u) 5d 1 α + GD

  • ln(T + 1)

u ∈ S Extension of ONS to convex losses

[Luo, Agarwal, C-B, Langford, 2016]

ℓt(w) → ℓt

  • w⊤xt
  • maxt |ℓ′

t| L

RT(u) O

  • CL

√ dT

  • for all u s.t.
  • u⊤xt
  • C

Invariance to linear transformations of the data

  • N. Cesa-Bianchi (UNIMI)

Online Learning 43 / 49

slide-83
SLIDE 83

Online Ridge Regression [Vovk, 2001; Azoury and Warmuth, 2001]

Logarithmic regret for square loss ℓt(u) =

  • u⊤xt − yt

2 Y = max

t=1,...,T |yt|

X = max

t=1,...,T xt

OMD with adaptive regularizer Φt(w) = 1 2 w2

At

Where At = I +

t

  • s=1

xs x⊤

s and θt = t

  • s=1

−ysxs Regret bound (oracle inequality)

T

  • t=1

ℓt(wt) inf

u∈Rd

T

  • t=1

ℓt(u) + u2

  • + dY2 ln
  • 1 + TX2

d

  • Parameterless

Scale-free: unbounded comparison set

  • N. Cesa-Bianchi (UNIMI)

Online Learning 44 / 49

slide-84
SLIDE 84

Scale free algorithm for convex losses [Orabona and P´

al, 2015]

Scale free algorithm for convex losses OMD with adaptive regularizer Φt(w) = Φ0(w)

  • t−1
  • s=1

∇ℓs(ws)2

Φ0 is a β-strongly convex base regularizer Regret bound (oracle inequality) for convex loss functions ℓt

T

  • t=1

ℓt(wt) inf

u∈Rd T

  • t=1

ℓt(u) +

  • Φ0(u) + 1

β + max

t

∇ℓt(wt)∗ √ T

  • N. Cesa-Bianchi (UNIMI)

Online Learning 45 / 49

slide-85
SLIDE 85

Regularization via stochastic smoothing

wt+1 = EZ

  • argmin

w∈S t

  • s=1
  • η∇ℓs(ws) + Z

⊤ w

  • The distribution of Z must be “stable” (small variance and small

average sensitivity) Regret bound similar to FTRL/OMD For some choices of Z, FPL becomes equivalent to OMD

[Abernethy, Lee, Sinha and Tewari, 2014]

Linear losses: Follow the Perturbed Leader algorithm

[Kalai and Vempala, 2005]

  • N. Cesa-Bianchi (UNIMI)

Online Learning 46 / 49

slide-86
SLIDE 86

Shifting regret

[Herbster and Warmuth, 2001]

Nonstationarity If data source is not fitted well by any model in the class, then comparing to the best model u ∈ S is trivial Compare instead to the best sequence u1, u2, · · · ∈ S of models Shifting Regret for OMD

[Zinkevich, 2003]

T

  • t=1

ℓt(wt)

  • cumulative loss
  • inf

u1,...,uT ∈S T

  • t=1

ℓt(ut)

  • model fit

+

T

  • t=1

ut − ut−1

  • shifting model cost

+ diam(S) +

  • N. Cesa-Bianchi (UNIMI)

Online Learning 47 / 49

slide-87
SLIDE 87

Strongly adaptive regret

[Daniely, Gonen, Shalev-Shwartz, 2015]

Definition For all intervals I = {r, . . . , s} with 1 r < s T RT,I(u) =

  • t∈I

ℓt(wt) −

  • t∈I

ℓt(u)

  • N. Cesa-Bianchi (UNIMI)

Online Learning 48 / 49

slide-88
SLIDE 88

Strongly adaptive regret

[Daniely, Gonen, Shalev-Shwartz, 2015]

Definition For all intervals I = {r, . . . , s} with 1 r < s T RT,I(u) =

  • t∈I

ℓt(wt) −

  • t∈I

ℓt(u) Regret bound for strongly adaptive OGD RT,I(u)

  • BLX2 + ln(T + 1)

|I| for all u such that u2 B

  • N. Cesa-Bianchi (UNIMI)

Online Learning 48 / 49

slide-89
SLIDE 89

Strongly adaptive regret

[Daniely, Gonen, Shalev-Shwartz, 2015]

Definition For all intervals I = {r, . . . , s} with 1 r < s T RT,I(u) =

  • t∈I

ℓt(wt) −

  • t∈I

ℓt(u) Regret bound for strongly adaptive OGD RT,I(u)

  • BLX2 + ln(T + 1)

|I| for all u such that u2 B Remarks Generic black-box reduction applicable to any online learning algorithm It runs a logarithmic number of instances of the base learner

  • N. Cesa-Bianchi (UNIMI)

Online Learning 48 / 49

slide-90
SLIDE 90

Online bandit convex optimization

1

Play wt from a convex and compact subset S of a linear space

2

Observe ℓt(wt), where ℓ : S → R is unobserved convex loss

3

Update: wt → wt+1 ∈ S Regret: RT(u) =

T

  • t=1

ℓt(wt) −

T

  • t=1

ℓt(u) u ∈ S

  • N. Cesa-Bianchi (UNIMI)

Online Learning 49 / 49

slide-91
SLIDE 91

Online bandit convex optimization

1

Play wt from a convex and compact subset S of a linear space

2

Observe ℓt(wt), where ℓ : S → R is unobserved convex loss

3

Update: wt → wt+1 ∈ S Regret: RT(u) =

T

  • t=1

ℓt(wt) −

T

  • t=1

ℓt(u) u ∈ S Results Linear losses: Ω

  • d

√ T

  • [Dani, Hayes, and Kakade, 2008]
  • N. Cesa-Bianchi (UNIMI)

Online Learning 49 / 49

slide-92
SLIDE 92

Online bandit convex optimization

1

Play wt from a convex and compact subset S of a linear space

2

Observe ℓt(wt), where ℓ : S → R is unobserved convex loss

3

Update: wt → wt+1 ∈ S Regret: RT(u) =

T

  • t=1

ℓt(wt) −

T

  • t=1

ℓt(u) u ∈ S Results Linear losses: Ω

  • d

√ T

  • [Dani, Hayes, and Kakade, 2008]

Linear losses: O

  • d

√ T

  • [Bubeck, C-B, and Kakade, 2012]
  • N. Cesa-Bianchi (UNIMI)

Online Learning 49 / 49

slide-93
SLIDE 93

Online bandit convex optimization

1

Play wt from a convex and compact subset S of a linear space

2

Observe ℓt(wt), where ℓ : S → R is unobserved convex loss

3

Update: wt → wt+1 ∈ S Regret: RT(u) =

T

  • t=1

ℓt(wt) −

T

  • t=1

ℓt(u) u ∈ S Results Linear losses: Ω

  • d

√ T

  • [Dani, Hayes, and Kakade, 2008]

Linear losses: O

  • d

√ T

  • [Bubeck, C-B, and Kakade, 2012]

Strongly convex and smooth losses: O

  • d3/2 √

T

  • [Hazan and Levy, 2014]
  • N. Cesa-Bianchi (UNIMI)

Online Learning 49 / 49

slide-94
SLIDE 94

Online bandit convex optimization

1

Play wt from a convex and compact subset S of a linear space

2

Observe ℓt(wt), where ℓ : S → R is unobserved convex loss

3

Update: wt → wt+1 ∈ S Regret: RT(u) =

T

  • t=1

ℓt(wt) −

T

  • t=1

ℓt(u) u ∈ S Results Linear losses: Ω

  • d

√ T

  • [Dani, Hayes, and Kakade, 2008]

Linear losses: O

  • d

√ T

  • [Bubeck, C-B, and Kakade, 2012]

Strongly convex and smooth losses: O

  • d3/2 √

T

  • [Hazan and Levy, 2014]

Convex losses: O

  • d9.5 √

T

  • [Bubeck, Eldan, and Lee, 2016]
  • N. Cesa-Bianchi (UNIMI)

Online Learning 49 / 49