Adaptivity and Optimism: An Improved Exponentiated Gradient - - PowerPoint PPT Presentation

adaptivity and optimism an improved exponentiated
SMART_READER_LITE
LIVE PREVIEW

Adaptivity and Optimism: An Improved Exponentiated Gradient - - PowerPoint PPT Presentation

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University { jsteinhardt,pliang } @cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated


slide-1
SLIDE 1

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Jacob Steinhardt Percy Liang

Stanford University

{jsteinhardt,pliang}@cs.stanford.edu

Jun 11, 2013

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 1 / 10

slide-2
SLIDE 2

Setup

Setting is learning from experts: n experts, T rounds For t = 1,...,T:

Learner chooses distribution wt ∈ ∆n over the experts Nature reveals losses zt ∈ [−1,1]n of the experts Learner suffers loss w⊤

t zt

Goal: minimize Regret def

=

T

t=1

w⊤

t zt − T

t=1

zt,i∗, where i∗ is the best fixed expert. Typical algorithm: multiplicative weights (aka exponentiated gradient): wt+1,i ∝ wt,i exp(−ηzt,i).

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 2 / 10

slide-3
SLIDE 3

Outline

Compare two variants of the multiplicative weights (exponentiated gradient) algorithm Understand the difference through lens of adaptive mirror descent (Orabona et al., 2013) Combine with machinery of optimistic updates (Rakhlin & Sridharan, 2012) to beat best existing bounds.

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 3 / 10

slide-4
SLIDE 4

Two Types of Updates

In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): wt+1,i ∝ wt,i exp(−ηzt,i) (MW1) wt+1,i ∝ wt,i(1−ηzt,i) (MW2) The regret is bounded as Regret ≤ log(n)

η +η

T

t=1

zt2

(Regret:MW1) Regret ≤ log(n)

η +η

T

t=1

z2

t,i∗

(Regret:MW2) If best expert i∗ has loss close to zero, then second bound better than first. Gap can be Θ(

T) (in actual performance, not just upper bounds).

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

slide-5
SLIDE 5

Two Types of Updates

In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): wt+1,i ∝ wt,i exp(−ηzt,i) (MW1) wt+1,i ∝ wt,i(1−ηzt,i) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it?

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

slide-6
SLIDE 6

Two Types of Updates

In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): wt+1,i ∝ wt,i exp(−ηzt,i) (MW1) wt+1,i ∝ wt,i(1−ηzt,i) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1

η ∑n

i=1 wi log(wi)

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

slide-7
SLIDE 7

Two Types of Updates

In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): wt+1,i ∝ wt,i exp(−ηzt,i) (MW1) wt+1,i ∝ wt,i(1−ηzt,i) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1

η ∑n

i=1 wi log(wi)

(MW2) is NOT mirror descent for any fixed regularizer

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

slide-8
SLIDE 8

Two Types of Updates

In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): wt+1,i ∝ wt,i exp(−ηzt,i) (MW1) wt+1,i ∝ wt,i(1−ηzt,i) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1

η ∑n

i=1 wi log(wi)

(MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard?

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

slide-9
SLIDE 9

Two Types of Updates

In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): wt+1,i ∝ wt,i exp(−ηzt,i) (MW1) wt+1,i ∝ wt,i(1−ηzt,i) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1

η ∑n

i=1 wi log(wi)

(MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard?

No: can cast (MW2) as adaptive mirror descent (Orabona et al., 2013)

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

slide-10
SLIDE 10

Adaptive Mirror Descent to the Rescue

Recall that mirror descent is the (meta-)algorithm wt = argminw ψ(w)+

t−1

s=1

w⊤zs. For ψ(w) = 1

η ∑n

i=1 wi log(wi), we recover (MW1).

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

slide-11
SLIDE 11

Adaptive Mirror Descent to the Rescue

Recall that mirror descent is the (meta-)algorithm wt = argminw ψ(w)+

t−1

s=1

w⊤zs. For ψ(w) = 1

η ∑n

i=1 wi log(wi), we recover (MW1).

Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm wt = argminw ψt(w)+

t−1

s=1

w⊤zs.

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

slide-12
SLIDE 12

Adaptive Mirror Descent to the Rescue

Recall that mirror descent is the (meta-)algorithm wt = argminw ψ(w)+

t−1

s=1

w⊤zs. For ψ(w) = 1

η ∑n

i=1 wi log(wi), we recover (MW1).

Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm wt = argminw ψt(w)+

t−1

s=1

w⊤zs.

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

slide-13
SLIDE 13

Adaptive Mirror Descent to the Rescue

Recall that mirror descent is the (meta-)algorithm wt = argminw ψ(w)+

t−1

s=1

w⊤zs. For ψ(w) = 1

η ∑n

i=1 wi log(wi), we recover (MW1).

Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm wt = argminw ψt(w)+

t−1

s=1

w⊤zs. For ψt(w) = 1

η ∑n

i=1 wi log(wi)+η ∑n i=1 ∑t−1 s=1 wiz2 s,i, we approximately

recover (MW2).

Update: wt+1,i ∝ wt,i exp(−ηzt,i −η2z2

t,i) ≈ wt,i(1−ηzt,i)

Enough to achieve better regret bound. Can recover (MW2) exactly with more complicated ψt.

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

slide-14
SLIDE 14

Advantages of Our Perspective

So far, we have cast (MW2) as adaptive mirror descent, with regularizer

ψt(w) = ∑n

i=1 wi

  • 1

η log(wi)+η ∑t−1

s=1 z2 s,i

  • .

Explains the better regret bound while staying within the mirror descent framework, which is nice. Our new perspective also allows us to apply lots of modern machinery:

  • ptimistic updates (Rakhlin & Sridharan, 2012)

matrix multiplicative weights (Tsuda et al., 2005; Arora & Kale, 2007)

By “turning the crank”, we get results that beat state of the art!

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 6 / 10

slide-15
SLIDE 15

Beating State of the Art

Optimism Adaptivity

In the above we let

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

slide-16
SLIDE 16

Beating State of the Art

Optimism Adaptivity S∞

Kivinen & Warmuth, 1997

In the above we let S∞

def

= ∑T

t=1 zt2

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

slide-17
SLIDE 17

Beating State of the Art

Optimism Adaptivity Si∗ maxi Si S∞

Cesa-Bianchi et al., 2007 Kivinen & Warmuth, 1997

In the above we let S∞

def

= ∑T

t=1 zt2

Si

def

= ∑T

t=1 z2 t,i

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

slide-18
SLIDE 18

Beating State of the Art

Optimism Adaptivity maxi Vi V∞ Si∗ maxi Si S∞

Hazan & Kale, 2008 Cesa-Bianchi et al., 2007 Kivinen & Warmuth, 1997

In the above we let V∞

def

= ∑T

t=1 zt −¯

z2

∞,

S∞

def

= ∑T

t=1 zt2

Vi

def

= ∑T

t=1(zt,i −¯

zi)2, Si

def

= ∑T

t=1 z2 t,i

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

slide-19
SLIDE 19

Beating State of the Art

Optimism Adaptivity D∞ maxi Vi V∞ Si∗ maxi Si S∞

Chiang et al., 2012 Hazan & Kale, 2008 Cesa-Bianchi et al., 2007 Kivinen & Warmuth, 1997

In the above we let D∞

def

= ∑T

t=1 zt − zt−12

∞,

V∞

def

= ∑T

t=1 zt −¯

z2

∞,

S∞

def

= ∑T

t=1 zt2

Vi

def

= ∑T

t=1(zt,i −¯

zi)2, Si

def

= ∑T

t=1 z2 t,i

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

slide-20
SLIDE 20

Beating State of the Art

Optimism Adaptivity Di∗ maxi Di D∞ Vi∗ maxi Vi V∞ Si∗ maxi Si S∞

this work Chiang et al., 2012 Hazan & Kale, 2008 Cesa-Bianchi et al., 2007 Kivinen & Warmuth, 1997

In the above we let D∞

def

= ∑T

t=1 zt − zt−12

∞,

V∞

def

= ∑T

t=1 zt −¯

z2

∞,

S∞

def

= ∑T

t=1 zt2

Di

def

= ∑T

t=1(zt,i − zt−1,i)2,

Vi

def

= ∑T

t=1(zt,i −¯

zi)2, Si

def

= ∑T

t=1 z2 t,i

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

slide-21
SLIDE 21

Optimistic Updates: A Brief Review

(Normal) mirror descent: wt = argminw ψ(w)+ w⊤

  • t−1

s=1

zs

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

slide-22
SLIDE 22

Optimistic Updates: A Brief Review

(Normal) mirror descent: wt = argminw ψ(w)+ w⊤

  • t−1

s=1

zs

  • Optimistic mirror descent (Rakhlin & Sridharan, 2012); add a hint mt:

wt = argminw ψ(w)+ w⊤

  • mt +

t−1

s=1

zs

  • Guesses (mt) the next term (zt) in the cost function.

Pay regret in terms of zt − mt rather than zt.

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

slide-23
SLIDE 23

Optimistic Updates: A Brief Review

(Normal) mirror descent: wt = argminw ψ(w)+ w⊤

  • t−1

s=1

zs

  • Optimistic mirror descent (Rakhlin & Sridharan, 2012); add a hint mt:

wt = argminw ψ(w)+ w⊤

  • mt +

t−1

s=1

zs

  • Guesses (mt) the next term (zt) in the cost function.

Pay regret in terms of zt − mt rather than zt. E.g.: mt = zt−1, mt = 1

t ∑t−1 s=1 zs

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

slide-24
SLIDE 24

Multiplicative Weights with Optimism

Name Auxiliary Update Prediction (wt) MW1

βt+1,i = βt,i −ηzt,i

wt,i ∝ exp(βt,i)

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

slide-25
SLIDE 25

Multiplicative Weights with Optimism

Name Auxiliary Update Prediction (wt) MW1

βt+1,i = βt,i −ηzt,i

wt,i ∝ exp(βt,i) MW2

βt+1,i = βt,i −ηzt,i −η2z2

t,i

wt,i ∝ exp(βt,i)

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

slide-26
SLIDE 26

Multiplicative Weights with Optimism

Name Auxiliary Update Prediction (wt) MW1

βt+1,i = βt,i −ηzt,i

wt,i ∝ exp(βt,i) MW2

βt+1,i = βt,i −ηzt,i −η2z2

t,i

wt,i ∝ exp(βt,i) MW3

βt+1,i = βt,i −ηzt,i −η2(zt,i − zt−1,i)2

wt,i ∝ exp(βt,i −ηzt−1,i)

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

slide-27
SLIDE 27

Multiplicative Weights with Optimism

Name Auxiliary Update Prediction (wt) MW1

βt+1,i = βt,i −ηzt,i

wt,i ∝ exp(βt,i) MW2

βt+1,i = βt,i −ηzt,i −η2z2

t,i

wt,i ∝ exp(βt,i) MW3

βt+1,i = βt,i −ηzt,i −η2(zt,i − zt−1,i)2

wt,i ∝ exp(βt,i −ηzt−1,i) Regret of MW3: Regret ≤ log(n)

η +η

T

t=1

(zt,i∗ − zt−1,i∗)2

Dominates all existing bounds in this setting!

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

slide-28
SLIDE 28

Summary

Cast multiplicative weights algorithm as adaptive mirror descent Applied machinery of optimistic updates to beat best existing bounds Also in paper:

extension to general convex losses extension to matrices generalization of FTRL lemma to convex cones

  • J. Steinhardt & P. Liang (Stanford)

Adaptive Exponentiated Gradient Jun 11, 2013 10 / 10