ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 - - PowerPoint PPT Presentation

ecml 2015 big targets workshop
SMART_READER_LITE
LIVE PREVIEW

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 - - PowerPoint PPT Presentation

How can we generalize well? ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can we generalize well? Extreme Challenges How can we generalize well? Can we compete with OAA? When can we predict


slide-1
SLIDE 1

How can we generalize well?

ECML 2015 Big Targets Workshop

Paul Mineiro

Paul Mineiro ECML 2015 Big Targets Workshop

slide-2
SLIDE 2

How can we generalize well?

Extreme Challenges

How can we generalize well? Can we compete with OAA? When can we predict quickly?

Paul Mineiro ECML 2015 Big Targets Workshop

slide-3
SLIDE 3

How can we generalize well?

How can we generalize well?

Paul Mineiro ECML 2015 Big Targets Workshop

slide-4
SLIDE 4

How can we generalize well?

Chasing Tails

Typical extreme datasets have many rare classes.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-5
SLIDE 5

How can we generalize well?

Chasing Tails

Typical extreme datasets have many rare classes. What are the implications for generalization?

Paul Mineiro ECML 2015 Big Targets Workshop

slide-6
SLIDE 6

How can we generalize well?

Chasing Tails

Typical extreme datasets have many rare classes. What are the implications for generalization? Let’s use the bootstrap to get intuition.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-7
SLIDE 7

How can we generalize well?

Bootstrap Lesson

Observation (Tail Frequencies)

The true frequencies of tail classes is not clear given the training set.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-8
SLIDE 8

How can we generalize well?

Two Loss Patterns

All classes below have 1 training example. Which hypothesis do you like better? h1 h2 class 1 1 0.6 class 2 1 0.6 class 3 0.42 class 4 0.42

Paul Mineiro ECML 2015 Big Targets Workshop

slide-9
SLIDE 9

How can we generalize well?

Two Loss Patterns

All classes below have 1 training example. Which hypothesis do you like better? h1 h2 class 1 1 0.6 class 2 1 0.6 class 3 0.42 class 4 0.42 ERM likes h1 better. I like h2 better.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-10
SLIDE 10

How can we generalize well?

The Extreme Deficiencies of ERM

ERM cares only about average loss. h∗ = argmin

h∈H

E(x,y)∼D [l(h(x); y)] . . . but extreme learning empirical losses can have high variance. ERM doesn’t care about empirical loss variance. ERM is based upon a uniform bound on the hypothesis space.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-11
SLIDE 11

How can we generalize well?

eXtreme Risk Minimization

Sample Variance Penalization (XRM) penalizes combination of expected loss and loss variance. h∗ = argmin

h∈H

(E [l(h(x); y)] + κV [l(h(x); y)]) (κ is a hyperparameter in practice) XRM is based upon empirical Bernstein bounds.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-12
SLIDE 12

How can we generalize well?

Example: Neural Language Modeling

Mini-batch XRM gradient: Ei          1 + κ li(φ) − Ej [lj(φ)]

  • Ej
  • l2

j (φ)

  • − Ej [lj(φ)]2

     ∂li(φ) ∂φ      Smaller than average loss = ⇒ lower learning rate Larger than average loss = ⇒ larger learning rate Loss variance is the unit of loss measurement

Paul Mineiro ECML 2015 Big Targets Workshop

slide-13
SLIDE 13

How can we generalize well?

Example: Neural Language Modeling

enwiki9 data set FNN-LM of Zhang et. al. Same everything except κ. method perplexity ERM (κ = 0) 106.3 XRM (κ = 0.25) 104.1 Modest lift, but over SOTA baseline and with minimal code changes.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-14
SLIDE 14

How can we generalize well?

Example: Neural Language Modeling

Example #

104 106 108 1010

Progressive Loss Variance

1 2 3 4

ERM XRM

Paul Mineiro ECML 2015 Big Targets Workshop

slide-15
SLIDE 15

How can we generalize well?

Example: Randomized Embeddings

Based upon (randomized) SVD.

X W = TV ⊤ Y ≈

n d d c n c k k

V ⊤ T

How to adapt black-box technique to XRM? Idea: proxy model = ⇒ importance weights.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-16
SLIDE 16

How can we generalize well?

Imbalanced binary XRM

Binary classification with constant predictor. l(y; q) = y log(q) + (1 − y) log(1 − q) 1 + κ l(y; q) − E [l(·; q)]

  • E [l2(·; q)] − E [l(·; q)]2
  • q=p

=    1 − κ

  • p

1−p

y = 0 1 + κ

  • 1−p

p

y = 1 (p ≤ 0.5)

Paul Mineiro ECML 2015 Big Targets Workshop

slide-17
SLIDE 17

How can we generalize well?

XRM Rembed for ODP

Compute base rate qc each class c. Importance weight (1 + κ(1/√qyi)). method error rate (%) ODP ERM [80.3, 80.4] ODP XRM (κ = 1) [78.5, 78.7] Modest lift, but over SOTA baseline and with minimal code changes.

Paul Mineiro ECML 2015 Big Targets Workshop

slide-18
SLIDE 18

How can we generalize well?

Summary

The tail can deviate wildly between train and test. Controlling loss variance helps a little bit. Speculation: explicitly treat the head and tail differently?

Paul Mineiro ECML 2015 Big Targets Workshop