Logarithmic Time Prediction John Langford Microsoft Research - - PowerPoint PPT Presentation

logarithmic time prediction
SMART_READER_LITE
LIVE PREVIEW

Logarithmic Time Prediction John Langford Microsoft Research - - PowerPoint PPT Presentation

Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict y { 1 , ..., K } 3 See y The Multiclass


slide-1
SLIDE 1

Logarithmic Time Prediction

John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms

slide-2
SLIDE 2

The Multiclass Prediction Problem

Repeatedly

1 See x 2 Predict ˆ

y ∈ {1, ..., K}

3 See y

slide-3
SLIDE 3

The Multiclass Prediction Problem

Repeatedly

1 See x 2 Predict ˆ

y ∈ {1, ..., K}

3 See y

Goal: Find h(x) minimizing error rate: Pr

(x,y)∼D(h(x) = y)

with h(x) fast.

slide-4
SLIDE 4

Why?

slide-5
SLIDE 5

Why?

slide-6
SLIDE 6

Trick #1

K is small

slide-7
SLIDE 7

Trick #2: A hierarchy exists

slide-8
SLIDE 8

Trick #2: A hierarchy exists

So use Trick #1 repeatedly.

slide-9
SLIDE 9

Trick #3: Shared representation

slide-10
SLIDE 10

Trick #3: Shared representation

Very helpful... but computation in the last layer can still blow up.

slide-11
SLIDE 11

Trick #4: “Structured Prediction”

slide-12
SLIDE 12

Trick #4: “Structured Prediction”

But what if the structure is unclear?

slide-13
SLIDE 13

Trick #5: GPU

slide-14
SLIDE 14

Trick #5: GPU

4 Teraflops is great... yet still burns energy.

slide-15
SLIDE 15

How fast can we hope to go?

slide-16
SLIDE 16

How fast can we hope to go?

Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K) time to train or test per example.

slide-17
SLIDE 17

How fast can we hope to go?

Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K) time to train or test per example. Proof: By construction Pick y ∼ U(1, ..., K)

slide-18
SLIDE 18

How fast can we hope to go?

Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K) time to train or test per example. Proof: By construction Pick y ∼ U(1, ..., K) Any prediction algorithm outputting less than log2 K bits loses with constant probability. Any training algorithm reading an example requires Ω(log2 K) time.

slide-19
SLIDE 19

Can we predict in time O(log2 K)?

1 10 100 1000 10000 100000 10 100 1000 10000 100000 1e+06 Benefit K Computational Advantage of Log Time K / log(K)

slide-20
SLIDE 20

Not it #1: Sparse Error Correcting Output Codes

1 Create O(log K) binary vectors biy of length K

slide-21
SLIDE 21

Not it #1: Sparse Error Correcting Output Codes

1 Create O(log K) binary vectors biy of length K 2 Train O(log K) binary classifiers hi to minimize error rate:

Prx,y(hi(x) = biy)

slide-22
SLIDE 22

Not it #1: Sparse Error Correcting Output Codes

1 Create O(log K) binary vectors biy of length K 2 Train O(log K) binary classifiers hi to minimize error rate:

Prx,y(hi(x) = biy)

3 Predict by finding y with minimal error.

slide-23
SLIDE 23

Not it #1: Sparse Error Correcting Output Codes

1 Create O(log K) binary vectors biy of length K 2 Train O(log K) binary classifiers hi to minimize error rate:

Prx,y(hi(x) = biy)

3 Predict by finding y with minimal error.

Prediction is Ω(K)

slide-24
SLIDE 24

Not it #2: Hierarchy Construction

1 Build confusion matrix of errors.

slide-25
SLIDE 25

Not it #2: Hierarchy Construction

1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy.

slide-26
SLIDE 26

Not it #2: Hierarchy Construction

1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution.

slide-27
SLIDE 27

Not it #2: Hierarchy Construction

1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution.

Training is Ω(K) or worse.

slide-28
SLIDE 28

Not it #3: Unnormalized learning

Train K regressors by For each example (x, y)

1 Train regressor y with (x, 1).

slide-29
SLIDE 29

Not it #3: Unnormalized learning

Train K regressors by For each example (x, y)

1 Train regressor y with (x, 1). 2 Pick y′ = y uniformly at random. 3 Train regressor y′ with (x, −1).

slide-30
SLIDE 30

Not it #3: Unnormalized learning

Train K regressors by For each example (x, y)

1 Train regressor y with (x, 1). 2 Pick y′ = y uniformly at random. 3 Train regressor y′ with (x, −1).

Prediction is still Ω(K).

slide-31
SLIDE 31

Can we predict in time O(log2 K)?

slide-32
SLIDE 32

Is logarithmic time even possible?

P(y=1) = .4 P(y=2) = .3 P(y=3) = .3 1 2 3 1 v {2,3} 2 v 3

P({2, 3}) > P(1) ⇒ lose for divide and conquer

slide-33
SLIDE 33

Filter Trees [BLR09]

P(y=1) = .4 P(y=2) = .3 P(y=3) = .3 1 2 3 1 v {2,3} 2 v 3

1 Learn 2v3 first 2 Throw away all error examples 3 Learn 1 v Survivors

Theorem: For all multiclass problems, for all binary classifiers, Multiclass Regret ≤ Average Binary Regret * log(K)

slide-34
SLIDE 34

Can you make it robust?

1 2 4 5 6 7 8 3 Winner

slide-35
SLIDE 35

Can you make it robust?

1 2 4 5 6 7 8 3 Winners

slide-36
SLIDE 36

Can you make it robust?

1 2 4 5 6 7 8 3 Winners

slide-37
SLIDE 37

Can you make it robust?

1 2 4 5 6 7 8 3 Winners

Theorem: [BLR09] For all multiclass problems, for all binary classifiers, a log(K)-correcting tournament satisfies: Multiclass Regret ≤ Average Binary Regret * 5.5 Determined best paper prize for ICML2012 (area chair decisions).

slide-38
SLIDE 38

How do you learn structure?

Not all partitions are equally difficult. Compare {1, 7}v{3, 8} to {1, 8}v{3, 7} What is better?

slide-39
SLIDE 39

How do you learn structure?

Not all partitions are equally difficult. Compare {1, 7}v{3, 8} to {1, 8}v{3, 7} What is better? [BWG10]: Better to confuse near leaves than near root. Intuition: the root predictor tends to be overconstrained while the leafwards predictors are less constrained.

slide-40
SLIDE 40

The Partitioning Problem [CL14]

Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Ex,y| Pr(h(x) = 1, y) − Pr(h(x) = 1) Pr(y)|

slide-41
SLIDE 41

The Partitioning Problem [CL14]

Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Ex

  • y

Pr(y)| Pr(h(x) = 1|x ∈ Xy) − Pr(h(x) = 1)| where Xy is the set of x associated with y.

slide-42
SLIDE 42

The Partitioning Problem [CL14]

Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Nonconvex for any symmetric hypothesis class (ouch)

slide-43
SLIDE 43

Bottom Up doesn’t work

1 2 3

Suppose you use linear representations.

slide-44
SLIDE 44

Bottom Up doesn’t work

1 2 3

Suppose you use linear representations. Suppose you first build a 1v3 predictor.

slide-45
SLIDE 45

Bottom Up doesn’t work

1 2 3

Suppose you use linear representations. Suppose you first build a 1v3 predictor. Suppose you then build a 2v{1v3} predictor. You lose.

slide-46
SLIDE 46

Does partitioning recurse well?

Theorem: If at every node n, Ex,y| Pr(h(x) = 1, y) − Pr(h(x) = 1) Pr(y)| > γ then after 1 ǫ 4(1−γ)2 ln k

γ2

splits, the multiclass error is less than ǫ.

slide-47
SLIDE 47

Online Partitioning

Relax the optimization criteria: Ex,y

  • Ex|y [ˆ

y(x)] − Ex [ˆ y(x)]

  • ... and approximate with running average
slide-48
SLIDE 48

Online Partitioning

Relax the optimization criteria: Ex,y

  • Ex|y [ˆ

y(x)] − Ex [ˆ y(x)]

  • ... and approximate with running average

Let e = 0 and for all y, ey = 0, ny = 0 For each example (x, y)

1 if ey < e then b = −1 else b = 1 2 Update w using (x, b) 3 ny ← ny + 1 4 ey ← (ny−1)ey

ny

+ ˆ

y(x) ny

5 e ← (t−1)e

t

+ ˆ

y(x) t

Apply recursively to construct a tree structure.

slide-49
SLIDE 49

Accuracy for a fixed training time

0.001 0.01 0.1 1 26 isolet 105 sector 1000 aloi 21841 imagenet 105033 ODP accuracy number of classes LOMtree vs one-against-all LOMtree OAA

slide-50
SLIDE 50

Test Error %, optimized, no train-time constraint

10 20 30 40 50 60 70 80 90 100 Isolet Sector Aloi Imagenet ODP Test Error % Performance of Log-time algorithms Rand Filter LOM

slide-51
SLIDE 51

Test Error %, optimized, no train-time constraint

10 20 30 40 50 60 70 80 90 100 Isolet Sector Aloi Imagenet ODP Test Error % Compared to OAA Rand Filter LOM OAA

slide-52
SLIDE 52

Classes vs Test time ratio

6 8 10 12 14 16 2 4 6 8 10 12 log2(number of classes) log2(time ratio) LOMtree vs one−against−all

slide-53
SLIDE 53

Can we predict in time O(log2 K)?

slide-54
SLIDE 54

Can we predict in time O(log2 K)?

What is the right way to achieve consistency and dynamic partition?

slide-55
SLIDE 55

Can we predict in time O(log2 K)?

What is the right way to achieve consistency and dynamic partition? How can you balance representation complexity and sample complexity?

slide-56
SLIDE 56

Bibliography

Alina Beygelzimer, John Langford, Pradeep Ravikumar, Error-Correcting Tournaments, http://arxiv.org/abs/0902.3176 Samy Bengio, Jason Weston, David Grangier, Label embedding trees for large multi-class tasks, NIPS 2010. Anna Choromanska, John Langford, Logarithmic Time Online Multiclass prediction, http://arxiv.org/abs/1406.1822