Follow the Leader with Dropout Perturbations Tim van Erven COLT, - - PowerPoint PPT Presentation

follow the leader with dropout perturbations
SMART_READER_LITE
LIVE PREVIEW

Follow the Leader with Dropout Perturbations Tim van Erven COLT, - - PowerPoint PPT Presentation

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech Kotowski Manfred Warmuth Neural Network Neural Network Dropout Training Stochastic gradient descent Randomly remove every hidden/input


slide-1
SLIDE 1

Follow the Leader with Dropout Perturbations

Tim van Erven

COLT, 2014 Joint work with: Wojciech Kotłowski

Manfred Warmuth

slide-2
SLIDE 2

Neural Network

slide-3
SLIDE 3

Neural Network

slide-4
SLIDE 4

Dropout Training

  • Stochastic gradient

descent

  • Randomly remove

every hidden/input unit with probability 1/2 before each gradient descent update

[Hinton et al., 2012]

slide-5
SLIDE 5

Dropout Training

  • Very successful in e.g.

image classification, speech recognition

  • Many people trying to

analyse why it works

[Wager, Wang, Liang, 2013]

slide-6
SLIDE 6

Prediction with Expert Advice

  • Every round :
  • 1. (Randomly) choose expert
  • 2. Observe expert losses
  • 3. Our loss is

Goal: minimize expected regret where

Loss of the best expert

slide-7
SLIDE 7

Follow-the-Leader

  • Deterministically choose the expert that has

predicted best in the past:

  • Can be fooled: regret grows linearly in T for

adversarial data is the leader.

slide-8
SLIDE 8

Dropout Perturbations

is the perturbed leader

slide-9
SLIDE 9
  • For losses in it works: for any dropout

probability

  • No tuning required!

Dropout Perturbations for Binary Losses

slide-10
SLIDE 10
  • For losses in it works: for any dropout

probability

  • No tuning required!
  • But it does not work for continuous losses in

[0,1]: there exist losses such that

Dropout Perturbations for Binary Losses

slide-11
SLIDE 11

Binarized Dropout Perturbations: Continuous Losses

  • The right generalization: for losses in [0,1]
slide-12
SLIDE 12

Small Regret for IID Data

If loss vectors are

– independent, identically distributed

between trials,

– with a gap between expected loss of best

expert and the rest,

then regret is constant:

  • Algorithms that rely on doubling trick for or

do not get this. w.h.p.

slide-13
SLIDE 13

Instance of Follow-the-Perturbed Leader

  • Follow-the-Perturbed-Leader [Kalai,Vempala,2005]:

We have data-dependent perturbations that differ between experts.

  • Standard analysis: bound probability of leader

change in the be-the-leader lemma.

  • Elegant simple bound for perturbations of

Kalai&Vempala, but not for us.

slide-14
SLIDE 14

Related Work: RWP

  • Random walk perturbation [Devroye et al. 2013]:

for a centered Bernoulli variable

  • Equivalent to dropout if
  • But perturbations do not adapt to data, so

no -bound

slide-15
SLIDE 15

Proof Outline

  • Find worst-case loss sequence
slide-16
SLIDE 16

Proof Outline

  • Find worst-case loss sequence: e.g. for 3

experts with cumulative losses 1, 3 and 5

slide-17
SLIDE 17

Proof Outline

  • Find worst-case loss sequence: e.g. for 3

experts with cumulative losses 1, 3 and 5

  • 1. Cumulative losses approximately equal: apply

lemma from RWP roughly once per K rounds

  • 2. Expert 1 much smaller cum. loss: Hoeffding
slide-18
SLIDE 18

Summary

  • Simple algorithm: Follow-the-leader on losses

that are perturbed by binarized dropout

  • No tuning necessary
  • On any losses:
  • On i.i.d. loss vectors with gap between best

expert and rest: w.h.p.

slide-19
SLIDE 19

Many Open Questions

  • Can we use dropout for:

– Tracking the best expert? – Combinatorial settings (e.g. online shortest path)?

  • Need to reuse randomness between experts
  • What about variations on the dropout

perturbations?

– Drop the whole loss vector at once?

To discuss at the poster!

slide-20
SLIDE 20

References

  • Hinton, Srivastava, Krizhevsky, Sutskever, Salakhutdinov.

Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

  • Wager, Wang, Liang. Dropout training as adaptive
  • regularization. NIPS, 2013.
  • Kalai, Vempala. Effcient algorithms for online decision
  • problems. Journal of Computer and System Sciences,

71(3):291–307, 2005.

  • Devroye, Lugosi, Neu. Prediction by random-walk
  • perturbation. COLT, 2013.
  • Van Erven, Kotłowski, Warmuth. Follow the leader with

dropout perturbations. COLT, 2014.