Logarithmic Time Prediction John Langford Microsoft Research - PowerPoint PPT Presentation

Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms

The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ˆ y ∈ { 1 , ..., K } 3 See y

The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ˆ y ∈ { 1 , ..., K } 3 See y Goal: Find h ( x ) minimizing error rate: ( x , y ) ∼ D ( h ( x ) � = y ) Pr with h ( x ) fast.

Trick #1 K is small

Trick #2: A hierarchy exists

Trick #2: A hierarchy exists So use Trick #1 repeatedly.

Trick #3: Shared representation

Trick #3: Shared representation Very helpful... but computation in the last layer can still blow up.

Trick #4: “Structured Prediction”

Trick #4: “Structured Prediction” But what if the structure is unclear?

Trick #5: GPU

Trick #5: GPU 4 Teraflops is great... yet still burns energy.

How fast can we hope to go?

How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example.

How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example. Proof: By construction Pick y ∼ U (1 , ..., K )

How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example. Proof: By construction Pick y ∼ U (1 , ..., K ) Any prediction algorithm outputting less than log 2 K bits loses with constant probability. Any training algorithm reading an example requires Ω(log 2 K ) time.

Can we predict in time O (log 2 K )? Computational Advantage of Log Time 100000 K / log(K) 10000 Benefit 1000 100 10 1 10 100 1000 10000 100000 1e+06 K

Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K

Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy )

Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy ) 3 Predict by finding y with minimal error.

Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy ) 3 Predict by finding y with minimal error. Prediction is Ω( K )

Not it #2: Hierarchy Construction 1 Build confusion matrix of errors.

Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy.

Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution.

Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution. Training is Ω( K ) or worse.

Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1).

Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1). 2 Pick y ′ � = y uniformly at random. 3 Train regressor y ′ with ( x , − 1).

Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1). 2 Pick y ′ � = y uniformly at random. 3 Train regressor y ′ with ( x , − 1). Prediction is still Ω( K ).

Can we predict in time O (log 2 K )?

Is logarithmic time even possible? 1 v {2,3} P(y=1) = .4 P(y=2) = .3 2 v 3 P(y=3) = .3 1 3 2 P ( { 2 , 3 } ) > P (1) ⇒ lose for divide and conquer

Filter Trees [BLR09] 1 v {2,3} P(y=1) = .4 P(y=2) = .3 2 v 3 P(y=3) = .3 1 3 2 1 Learn 2 v 3 first 2 Throw away all error examples 3 Learn 1 v Survivors Theorem: For all multiclass problems, for all binary classifiers, Multiclass Regret ≤ Average Binary Regret * log( K )

Can you make it robust? Winner 1 2 3 4 5 6 7 8

Can you make it robust? Winners 1 2 3 4 5 6 7 8

Can you make it robust? Winners 1 2 3 4 5 6 7 8 Theorem: [BLR09] For all multiclass problems, for all binary classifiers, a log(K)-correcting tournament satisfies: Multiclass Regret ≤ Average Binary Regret * 5.5 Determined best paper prize for ICML2012 (area chair decisions).

How do you learn structure? Not all partitions are equally difficult. Compare { 1 , 7 } v { 3 , 8 } to { 1 , 8 } v { 3 , 7 } What is better?

How do you learn structure? Not all partitions are equally difficult. Compare { 1 , 7 } v { 3 , 8 } to { 1 , 8 } v { 3 , 7 } What is better? [BWG10]: Better to confuse near leaves than near root. Intuition: the root predictor tends to be overconstrained while the leafwards predictors are less constrained.

The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: E x , y | Pr( h ( x ) = 1 , y ) − Pr( h ( x ) = 1) Pr( y ) |

The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: � E x Pr( y ) | Pr( h ( x ) = 1 | x ∈ X y ) − Pr( h ( x ) = 1) | y where X y is the set of x associated with y .

The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Nonconvex for any symmetric hypothesis class (ouch)

Bottom Up doesn’t work 1 2 3 Suppose you use linear representations.

Bottom Up doesn’t work 1 2 3 Suppose you use linear representations. Suppose you first build a 1v3 predictor.

Bottom Up doesn’t work 1 2 3 Suppose you use linear representations. Suppose you first build a 1v3 predictor. Suppose you then build a 2v { 1v3 } predictor. You lose.

Does partitioning recurse well? Theorem: If at every node n , E x , y | Pr( h ( x ) = 1 , y ) − Pr( h ( x ) = 1) Pr( y ) | > γ then after � 4(1 − γ )2 ln k � 1 γ 2 ǫ splits, the multiclass error is less than ǫ .

Online Partitioning Relax the optimization criteria: � � � E x | y [ˆ y ( x )] − E x [ˆ y ( x )] E x , y � ... and approximate with running average

Online Partitioning Relax the optimization criteria: � � � E x | y [ˆ y ( x )] − E x [ˆ y ( x )] E x , y � ... and approximate with running average Let e = 0 and for all y , e y = 0 , n y = 0 For each example ( x , y ) 1 if e y < e then b = − 1 else b = 1 2 Update w using ( x , b ) 3 n y ← n y + 1 4 e y ← ( n y − 1) e y + ˆ y ( x ) n y n y 5 e ← ( t − 1) e + ˆ y ( x ) t t Apply recursively to construct a tree structure.

Accuracy for a fixed training time LOMtree vs one-against-all 1 LOMtree accuracy 0.1 OAA 0.01 0.001 26 isolet 105 sector 1000 aloi 21841 imagenet 105033 ODP number of classes

Test Error %, optimized, no train-time constraint Performance of Log-time algorithms 100 Rand 90 Filter 80 LOM 70 Test Error % 60 50 40 30 20 10 0 Isolet Sector Aloi Imagenet ODP

Test Error %, optimized, no train-time constraint Compared to OAA 100 Rand 90 Filter 80 LOM 70 OAA Test Error % 60 50 40 30 20 10 0 Isolet Sector Aloi Imagenet ODP

Classes vs Test time ratio LOMtree vs one−against−all 12 10 log 2 (time ratio) 8 6 4 2 6 8 10 12 14 16 log 2 (number of classes)

Can we predict in time O (log 2 K )?

Can we predict in time O (log 2 K )? What is the right way to achieve consistency and dynamic partition?

Can we predict in time O (log 2 K )? What is the right way to achieve consistency and dynamic partition? How can you balance representation complexity and sample complexity?

Bibliography Alina Beygelzimer, John Langford, Pradeep Ravikumar, Error-Correcting Tournaments, http://arxiv.org/abs/0902.3176 Samy Bengio, Jason Weston, David Grangier, Label embedding trees for large multi-class tasks, NIPS 2010. Anna Choromanska, John Langford, Logarithmic Time Online Multiclass prediction, http://arxiv.org/abs/1406.1822

Logarithmic Time Prediction John Langford Microsoft Research - PowerPoint PPT Presentation

Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict y { 1 , ..., K } 3 See y The Multiclass

Logarithmic space Evgenij Thorstensen V18 Evgenij Thorstensen Logarithmic space V18 1 / 18

Topics on N orlund logarithmic means Nacima Memi c University of Sarajevo, Bosnia and

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

y ( y log x x a ) a is equivalent to f(x) = log a x Logarithmic

Entanglement entropy: logarithmic terms Sergey Solodukhin Institut Denis Poisson (Tours) Talk at

Lesson 5.4: Exponential & Logarithmic Equations 2 Basic strategies for solving Exponential

Steins method, logarithmic and transport inequalities M. Ledoux Institut de Math ematiques

Logarithmic Fluctuations From Circularity Lionel Levine (MIT) AMS Eastern Sectional Meeting

Logarithmic Minimal Models, W -Extended Fusion and Verlinde Formulas 24 September 2008 GGI

Random permutations with logarithmic cycle Random permutations weights Classical measures The

d i E Logarithmic Functions a l l u d Dr. Abdulla Eid b A College of Science . r D

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

Complexity Attack Resistant Flow Lookup Schemes for IPv6: A Measurement Based Comparison David

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

CockroachDBs Survivability Model Scalable, Survivable, Consistent, SQL presented by Marc

Flow-tools Tutorial Mark Fullmer maf@splintered.net Agenda Network flows Cisco /

ANSIBLE BEST PRACTICES: THE ESSENTIALS Ansible Automates: DC Jamie Duncan @jamieeduncan

RPSL in the Wild Presentation to Apricot 2000, Seoul, Korea Mark Prior Agenda Overview of

Board Meeting The Falmouth Historical Society June 2, 2020 Agenda Local History

BGP Security Techniques Danny McPherson danny@arbor.net Kyoto, Japan APRICOT 2005 1 Agenda

Logarithmic Time Prediction John Langford Microsoft Research - PowerPoint PPT Presentation

Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict y { 1 , ..., K } 3 See y The Multiclass

Logarithmic space Evgenij Thorstensen V18 Evgenij Thorstensen Logarithmic space V18 1 / 18

Topics on N orlund logarithmic means Nacima Memi c University of Sarajevo, Bosnia and

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

y ( y log x x a ) a is equivalent to f(x) = log a x Logarithmic

Entanglement entropy: logarithmic terms Sergey Solodukhin Institut Denis Poisson (Tours) Talk at

Lesson 5.4: Exponential &amp; Logarithmic Equations 2 Basic strategies for solving Exponential

Steins method, logarithmic and transport inequalities M. Ledoux Institut de Math ematiques

Logarithmic Fluctuations From Circularity Lionel Levine (MIT) AMS Eastern Sectional Meeting

Logarithmic Minimal Models, W -Extended Fusion and Verlinde Formulas 24 September 2008 GGI

Random permutations with logarithmic cycle Random permutations weights Classical measures The

d i E Logarithmic Functions a l l u d Dr. Abdulla Eid b A College of Science . r D

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

Complexity Attack Resistant Flow Lookup Schemes for IPv6: A Measurement Based Comparison David

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

CockroachDBs Survivability Model Scalable, Survivable, Consistent, SQL presented by Marc

Flow-tools Tutorial Mark Fullmer maf@splintered.net Agenda Network flows Cisco /

ANSIBLE BEST PRACTICES: THE ESSENTIALS Ansible Automates: DC Jamie Duncan @jamieeduncan

RPSL in the Wild Presentation to Apricot 2000, Seoul, Korea Mark Prior Agenda Overview of

Board Meeting The Falmouth Historical Society June 2, 2020 Agenda Local History

BGP Security Techniques Danny McPherson danny@arbor.net Kyoto, Japan APRICOT 2005 1 Agenda

Lesson 5.4: Exponential & Logarithmic Equations 2 Basic strategies for solving Exponential