SLIDE 1
Follow the Leader with Dropout Perturbations Tim van Erven COLT, - - PowerPoint PPT Presentation
Follow the Leader with Dropout Perturbations Tim van Erven COLT, - - PowerPoint PPT Presentation
Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech Kotowski Manfred Warmuth Neural Network Neural Network Dropout Training Stochastic gradient descent Randomly remove every hidden/input
SLIDE 2
SLIDE 3
Neural Network
SLIDE 4
Dropout Training
- Stochastic gradient
descent
- Randomly remove
every hidden/input unit with probability 1/2 before each gradient descent update
[Hinton et al., 2012]
SLIDE 5
Dropout Training
- Very successful in e.g.
image classification, speech recognition
- Many people trying to
analyse why it works
[Wager, Wang, Liang, 2013]
SLIDE 6
Prediction with Expert Advice
- Every round :
- 1. (Randomly) choose expert
- 2. Observe expert losses
- 3. Our loss is
Goal: minimize expected regret where
Loss of the best expert
SLIDE 7
Follow-the-Leader
- Deterministically choose the expert that has
predicted best in the past:
- Can be fooled: regret grows linearly in T for
adversarial data is the leader.
SLIDE 8
Dropout Perturbations
is the perturbed leader
SLIDE 9
- For losses in it works: for any dropout
probability
- No tuning required!
Dropout Perturbations for Binary Losses
SLIDE 10
- For losses in it works: for any dropout
probability
- No tuning required!
- But it does not work for continuous losses in
[0,1]: there exist losses such that
Dropout Perturbations for Binary Losses
SLIDE 11
Binarized Dropout Perturbations: Continuous Losses
- The right generalization: for losses in [0,1]
SLIDE 12
Small Regret for IID Data
If loss vectors are
– independent, identically distributed
between trials,
– with a gap between expected loss of best
expert and the rest,
then regret is constant:
- Algorithms that rely on doubling trick for or
do not get this. w.h.p.
SLIDE 13
Instance of Follow-the-Perturbed Leader
- Follow-the-Perturbed-Leader [Kalai,Vempala,2005]:
We have data-dependent perturbations that differ between experts.
- Standard analysis: bound probability of leader
change in the be-the-leader lemma.
- Elegant simple bound for perturbations of
Kalai&Vempala, but not for us.
SLIDE 14
Related Work: RWP
- Random walk perturbation [Devroye et al. 2013]:
for a centered Bernoulli variable
- Equivalent to dropout if
- But perturbations do not adapt to data, so
no -bound
SLIDE 15
Proof Outline
- Find worst-case loss sequence
SLIDE 16
Proof Outline
- Find worst-case loss sequence: e.g. for 3
experts with cumulative losses 1, 3 and 5
SLIDE 17
Proof Outline
- Find worst-case loss sequence: e.g. for 3
experts with cumulative losses 1, 3 and 5
- 1. Cumulative losses approximately equal: apply
lemma from RWP roughly once per K rounds
- 2. Expert 1 much smaller cum. loss: Hoeffding
SLIDE 18
Summary
- Simple algorithm: Follow-the-leader on losses
that are perturbed by binarized dropout
- No tuning necessary
- On any losses:
- On i.i.d. loss vectors with gap between best
expert and rest: w.h.p.
SLIDE 19
Many Open Questions
- Can we use dropout for:
– Tracking the best expert? – Combinatorial settings (e.g. online shortest path)?
- Need to reuse randomness between experts
- What about variations on the dropout
perturbations?
– Drop the whole loss vector at once?
To discuss at the poster!
SLIDE 20
References
- Hinton, Srivastava, Krizhevsky, Sutskever, Salakhutdinov.
Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
- Wager, Wang, Liang. Dropout training as adaptive
- regularization. NIPS, 2013.
- Kalai, Vempala. Effcient algorithms for online decision
- problems. Journal of Computer and System Sciences,
71(3):291–307, 2005.
- Devroye, Lugosi, Neu. Prediction by random-walk
- perturbation. COLT, 2013.
- Van Erven, Kotłowski, Warmuth. Follow the leader with