logarithmic time prediction
play

Logarithmic Time Prediction John Langford Microsoft Research - PowerPoint PPT Presentation

Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict y { 1 , ..., K } 3 See y The Multiclass


  1. Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms

  2. The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ˆ y ∈ { 1 , ..., K } 3 See y

  3. The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ˆ y ∈ { 1 , ..., K } 3 See y Goal: Find h ( x ) minimizing error rate: ( x , y ) ∼ D ( h ( x ) � = y ) Pr with h ( x ) fast.

  4. Why?

  5. Why?

  6. Trick #1 K is small

  7. Trick #2: A hierarchy exists

  8. Trick #2: A hierarchy exists So use Trick #1 repeatedly.

  9. Trick #3: Shared representation

  10. Trick #3: Shared representation Very helpful... but computation in the last layer can still blow up.

  11. Trick #4: “Structured Prediction”

  12. Trick #4: “Structured Prediction” But what if the structure is unclear?

  13. Trick #5: GPU

  14. Trick #5: GPU 4 Teraflops is great... yet still burns energy.

  15. How fast can we hope to go?

  16. How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example.

  17. How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example. Proof: By construction Pick y ∼ U (1 , ..., K )

  18. How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example. Proof: By construction Pick y ∼ U (1 , ..., K ) Any prediction algorithm outputting less than log 2 K bits loses with constant probability. Any training algorithm reading an example requires Ω(log 2 K ) time.

  19. Can we predict in time O (log 2 K )? Computational Advantage of Log Time 100000 K / log(K) 10000 Benefit 1000 100 10 1 10 100 1000 10000 100000 1e+06 K

  20. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K

  21. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy )

  22. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy ) 3 Predict by finding y with minimal error.

  23. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy ) 3 Predict by finding y with minimal error. Prediction is Ω( K )

  24. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors.

  25. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy.

  26. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution.

  27. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution. Training is Ω( K ) or worse.

  28. Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1).

  29. Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1). 2 Pick y ′ � = y uniformly at random. 3 Train regressor y ′ with ( x , − 1).

  30. Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1). 2 Pick y ′ � = y uniformly at random. 3 Train regressor y ′ with ( x , − 1). Prediction is still Ω( K ).

  31. Can we predict in time O (log 2 K )?

  32. Is logarithmic time even possible? 1 v {2,3} P(y=1) = .4 P(y=2) = .3 2 v 3 P(y=3) = .3 1 3 2 P ( { 2 , 3 } ) > P (1) ⇒ lose for divide and conquer

  33. Filter Trees [BLR09] 1 v {2,3} P(y=1) = .4 P(y=2) = .3 2 v 3 P(y=3) = .3 1 3 2 1 Learn 2 v 3 first 2 Throw away all error examples 3 Learn 1 v Survivors Theorem: For all multiclass problems, for all binary classifiers, Multiclass Regret ≤ Average Binary Regret * log( K )

  34. Can you make it robust? Winner 1 2 3 4 5 6 7 8

  35. Can you make it robust? Winners 1 2 3 4 5 6 7 8

  36. Can you make it robust? Winners 1 2 3 4 5 6 7 8

  37. Can you make it robust? Winners 1 2 3 4 5 6 7 8 Theorem: [BLR09] For all multiclass problems, for all binary classifiers, a log(K)-correcting tournament satisfies: Multiclass Regret ≤ Average Binary Regret * 5.5 Determined best paper prize for ICML2012 (area chair decisions).

  38. How do you learn structure? Not all partitions are equally difficult. Compare { 1 , 7 } v { 3 , 8 } to { 1 , 8 } v { 3 , 7 } What is better?

  39. How do you learn structure? Not all partitions are equally difficult. Compare { 1 , 7 } v { 3 , 8 } to { 1 , 8 } v { 3 , 7 } What is better? [BWG10]: Better to confuse near leaves than near root. Intuition: the root predictor tends to be overconstrained while the leafwards predictors are less constrained.

  40. The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: E x , y | Pr( h ( x ) = 1 , y ) − Pr( h ( x ) = 1) Pr( y ) |

  41. The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: � E x Pr( y ) | Pr( h ( x ) = 1 | x ∈ X y ) − Pr( h ( x ) = 1) | y where X y is the set of x associated with y .

  42. The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Nonconvex for any symmetric hypothesis class (ouch)

  43. Bottom Up doesn’t work 1 2 3 Suppose you use linear representations.

  44. Bottom Up doesn’t work 1 2 3 Suppose you use linear representations. Suppose you first build a 1v3 predictor.

  45. Bottom Up doesn’t work 1 2 3 Suppose you use linear representations. Suppose you first build a 1v3 predictor. Suppose you then build a 2v { 1v3 } predictor. You lose.

  46. Does partitioning recurse well? Theorem: If at every node n , E x , y | Pr( h ( x ) = 1 , y ) − Pr( h ( x ) = 1) Pr( y ) | > γ then after � 4(1 − γ )2 ln k � 1 γ 2 ǫ splits, the multiclass error is less than ǫ .

  47. Online Partitioning Relax the optimization criteria: � � � E x | y [ˆ y ( x )] − E x [ˆ y ( x )] E x , y � ... and approximate with running average

  48. Online Partitioning Relax the optimization criteria: � � � E x | y [ˆ y ( x )] − E x [ˆ y ( x )] E x , y � ... and approximate with running average Let e = 0 and for all y , e y = 0 , n y = 0 For each example ( x , y ) 1 if e y < e then b = − 1 else b = 1 2 Update w using ( x , b ) 3 n y ← n y + 1 4 e y ← ( n y − 1) e y + ˆ y ( x ) n y n y 5 e ← ( t − 1) e + ˆ y ( x ) t t Apply recursively to construct a tree structure.

  49. Accuracy for a fixed training time LOMtree vs one-against-all 1 LOMtree accuracy 0.1 OAA 0.01 0.001 26 isolet 105 sector 1000 aloi 21841 imagenet 105033 ODP number of classes

  50. Test Error %, optimized, no train-time constraint Performance of Log-time algorithms 100 Rand 90 Filter 80 LOM 70 Test Error % 60 50 40 30 20 10 0 Isolet Sector Aloi Imagenet ODP

  51. Test Error %, optimized, no train-time constraint Compared to OAA 100 Rand 90 Filter 80 LOM 70 OAA Test Error % 60 50 40 30 20 10 0 Isolet Sector Aloi Imagenet ODP

  52. Classes vs Test time ratio LOMtree vs one−against−all 12 10 log 2 (time ratio) 8 6 4 2 6 8 10 12 14 16 log 2 (number of classes)

  53. Can we predict in time O (log 2 K )?

  54. Can we predict in time O (log 2 K )? What is the right way to achieve consistency and dynamic partition?

  55. Can we predict in time O (log 2 K )? What is the right way to achieve consistency and dynamic partition? How can you balance representation complexity and sample complexity?

  56. Bibliography Alina Beygelzimer, John Langford, Pradeep Ravikumar, Error-Correcting Tournaments, http://arxiv.org/abs/0902.3176 Samy Bengio, Jason Weston, David Grangier, Label embedding trees for large multi-class tasks, NIPS 2010. Anna Choromanska, John Langford, Logarithmic Time Online Multiclass prediction, http://arxiv.org/abs/1406.1822

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend