Learning Faster from Easy Data II Wouter Koolen Tim van - - PowerPoint PPT Presentation

learning faster from easy data ii
SMART_READER_LITE
LIVE PREVIEW

Learning Faster from Easy Data II Wouter Koolen Tim van - - PowerPoint PPT Presentation

Learning Faster from Easy Data II Wouter Koolen Tim van Erven Aim of the Workshop Minimax analysis gives robust algorithms But in common easy cases these are overly conservative Large gap between performance predicted by


slide-1
SLIDE 1

Learning Faster from Easy Data II

Wouter Koolen Tim van Erven

slide-2
SLIDE 2

Aim of the Workshop

  • Minimax analysis gives robust algorithms
  • But in common easy cases these are overly

conservative

– Large gap between performance predicted by theory and observed in

practice

  • This workshop:

– Bring together easy cases in different learning settings – New algorithms: robust to worst case, but

automatically adapt to easy cases to learn faster

slide-3
SLIDE 3

Learning Settings Easy Cases

(non-exhaustive list)

Standard statistical learning Active learning

  • Margin condition (classification),

Bernstein condition

  • Data fit low-complexity model
  • Sparsity

Online learning

  • Curvature of the loss:

strong convexity, exp-concavity, mixability

  • Small variance:

2nd-order bounds, IID losses + gap, small losses, ...

  • Many “good” experts

Bandits

  • Stochastic = IID losses + gap

Clustering

  • K-Means “works”
slide-4
SLIDE 4

Easy Land

Statistical Learning Bandits Online Learning

margin condition

This talk

Posters

slide-5
SLIDE 5

Outline

  • Easy data

– statistical learning – online learning – bandits

  • How to exploit easy data

– statistical learning – online learning

  • The price of adaptivity
slide-6
SLIDE 6

Statistical Learning

small risk compared to minimizer of risk in model

slide-7
SLIDE 7

Easy Data in Classification

For worst-case learning is slow: Margin condition:

– common case: not too close to – then learning is much faster, up to

[Tsybakov, 2004]

slide-8
SLIDE 8

The Margin Condition

easy moderate hard

slide-9
SLIDE 9

Large Margin Reduces Variance

  • Important source of excess risk is

variance in excess loss:

  • Margin condition Bernstein condition:
  • Smaller excess risk smaller variance
slide-10
SLIDE 10

Large Margin Reduces Variance

  • Important source of excess risk is

variance in excess loss:

  • Margin condition Bernstein condition:
  • Smaller excess risk smaller variance
slide-11
SLIDE 11

Online Learning

small cumulative loss compared to minimizer of cumulative loss in model

slide-12
SLIDE 12

Easy Data in Online Learning

  • Curved losses:

strongly convex, exp-concave, mixable linear loss easier than

slide-13
SLIDE 13

Easy Data in Online Learning

  • Curved losses:
  • Small empirical variance in excess losses:

Implied by:

– small losses ( -bounds) – i.i.d. losses + gap

strongly convex, exp-concave, mixable linear loss easier than

slide-14
SLIDE 14

Easy Data in Online Learning

  • Curved losses:
  • Small empirical variance in excess losses:

Implied by:

– small losses ( -bounds) – i.i.d. losses + gap – Bernstein condition!

strongly convex, exp-concave, mixable linear loss easier than

Grünwald

slide-15
SLIDE 15

Bandit Online Learning

small cumulative loss compared to best fixed arm

  • K arms/treatments with losses
  • Only observe own (randomized) choice
slide-16
SLIDE 16

Easy Data for Bandits

  • Stochastic bandits (easier):

– Losses for arms are independent, identically distributed (i.i.d.) – Positive gap between expected performance of best arm and all

  • thers
  • Adversarial bandits (harder):

– Losses can be anything, even chosen to make learning as

difficult as possible

slide-17
SLIDE 17

Easy Data for Bandits

  • Stochastic bandits (easier):

– Losses for arms are independent, identically distributed (i.i.d.) – Positive gap between expected performance of best arm and all

  • thers
  • Adversarial bandits (harder):

– Losses can be anything, even chosen to make learning as

difficult as possible

  • Can a single algorithm adapt to:

– iid+gap + adversarial? – small losses + adversarial? – small variance in general + adversarial?

Auer Neu

slide-18
SLIDE 18

Outline

  • Easy data

– statistical learning – online learning – bandits

  • How to exploit easy data

– statistical learning – online learning

  • The price of adaptivity
slide-19
SLIDE 19

We consider exploiting -Bernstein cases: Method: penalized ERM minimizes (for simplicity: prior on countable model ) How to tune ?

Adaptive Statistical Learning

slide-20
SLIDE 20

Adaptive Statistical Learning

  • Knowing , penalized ERM with :
  • Adaptive method through holdout estimate
  • More sophisticated adaptive methods:

– Slope heuristic – Lepski's method – Safe Bayes

[Birgé, Massart] [Grünwald]

slide-21
SLIDE 21

Adaptive Online Learning: Probabilistic Estimators

  • Penalized ERM:
  • Allow probability distributions :
slide-22
SLIDE 22

Adaptive Online Learning: Probabilistic Estimators

  • Penalized ERM:
  • Allow probability distributions :
  • Solution: exponential weights
slide-23
SLIDE 23

Adaptive Online Learning: Probabilistic Estimators

  • Penalized ERM:
  • Allow probability distributions :
  • Solution: exponential weights

Remark: Obtain other methods like gradient descent by:

  • changing KL to other regularizers +
  • more general sets for p
slide-24
SLIDE 24

Adaptive Online Learning

  • For convex losses, play mean:
  • Standard tuning for the worst case
  • Gives worst-case regret bound
  • Can we do better if we get -Bernstein data?
slide-25
SLIDE 25

Adaptive Online Learning

  • Turns out can indeed exploit -Bernstein data

with correctly tuned . In fact want .

  • But cannot do holdout
  • Then how to tune eta?

– One approach: tune in terms of upper

bound on regret that includes some measure of variance

– Next slide: learn empirically best learning

rate for data at hand

slide-26
SLIDE 26

Squint

  • Exponential weights: needs external tuning

exponential in regret .

[Koolen and Van Erven 2015]

slide-27
SLIDE 27

Squint

  • Exponential weights: needs external tuning

exponential in regret .

  • Squint: learn best for the data

with variance penalty .

[Koolen and Van Erven 2015]

slide-28
SLIDE 28

Squint

  • Philosophy: learn best for the data
  • Important for current overview:

– Optimal rate in Bernstein cases

  • Further advantages beyond stochastic case:

– Fast rates on sub-adversarial data – Second-order and quantile adaptivity

slide-29
SLIDE 29

Outline

  • Easy data

– statistical learning – online learning – bandits

  • How to exploit easy data

– statistical learning – online learning

  • The price of adaptivity
slide-30
SLIDE 30

Price of adaptivity

  • Settings where adaptivity is cheap

– Statistical learning: holdout, etc. – Online learning (full inf.): Squint

  • Settings where adaptivity subtle/unknown

– Bandits (IID stochastic / adversarial)

  • Adaptivity to both settings affordable (Auer).
  • Can adapt to small losses ( ) but general intermediate

case very very tricky (Neu).

– Active learning (Singh) – Online boosting: (Kale)

  • Newly introduced setting (ICML best paper)
  • Seems some cost for adaptivity

– Clustering: (Ben-David)

– ...

(Grünwald, Foster)

slide-31
SLIDE 31

Schedule

  • Invited speakers
  • Spotlights + posters:

– Online learning, online convex optimization – Clustering – Statistical learning – Non-i.i.d. data – Bandits

  • Panel discussion
slide-32
SLIDE 32

Easy Land: great unknowns

Statistical Learning Bandits Online Learning

margin condition

Clustering

?

Active Learning Non-Stationarity