ecml 2015 big targets workshop
play

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 - PowerPoint PPT Presentation

How can we generalize well? ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can we generalize well? Extreme Challenges How can we generalize well? Can we compete with OAA? When can we predict


  1. How can we generalize well? ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop

  2. How can we generalize well? Extreme Challenges How can we generalize well? Can we compete with OAA? When can we predict quickly? Paul Mineiro ECML 2015 Big Targets Workshop

  3. How can we generalize well? How can we generalize well? Paul Mineiro ECML 2015 Big Targets Workshop

  4. How can we generalize well? Chasing Tails Typical extreme datasets have many rare classes. Paul Mineiro ECML 2015 Big Targets Workshop

  5. How can we generalize well? Chasing Tails Typical extreme datasets have many rare classes. What are the implications for generalization? Paul Mineiro ECML 2015 Big Targets Workshop

  6. How can we generalize well? Chasing Tails Typical extreme datasets have many rare classes. What are the implications for generalization? Let’s use the bootstrap to get intuition. Paul Mineiro ECML 2015 Big Targets Workshop

  7. How can we generalize well? Bootstrap Lesson Observation (Tail Frequencies) The true frequencies of tail classes is not clear given the training set. Paul Mineiro ECML 2015 Big Targets Workshop

  8. How can we generalize well? Two Loss Patterns All classes below have 1 training example. Which hypothesis do you like better? h 1 h 2 class 1 1 0.6 class 2 1 0.6 class 3 0 0.42 class 4 0 0.42 Paul Mineiro ECML 2015 Big Targets Workshop

  9. How can we generalize well? Two Loss Patterns All classes below have 1 training example. Which hypothesis do you like better? h 1 h 2 class 1 1 0.6 class 2 1 0.6 class 3 0 0.42 class 4 0 0.42 ERM likes h 1 better. I like h 2 better. Paul Mineiro ECML 2015 Big Targets Workshop

  10. How can we generalize well? The Extreme Deficiencies of ERM ERM cares only about average loss. h ∗ = argmin E ( x , y ) ∼ D [ l ( h ( x ); y )] h ∈H . . . but extreme learning empirical losses can have high variance. ERM doesn’t care about empirical loss variance. ERM is based upon a uniform bound on the hypothesis space. Paul Mineiro ECML 2015 Big Targets Workshop

  11. How can we generalize well? eXtreme Risk Minimization Sample Variance Penalization (XRM) penalizes combination of expected loss and loss variance. h ∗ = argmin ( E [ l ( h ( x ); y )] + κ V [ l ( h ( x ); y )]) h ∈H ( κ is a hyperparameter in practice) XRM is based upon empirical Bernstein bounds. Paul Mineiro ECML 2015 Big Targets Workshop

  12. How can we generalize well? Example: Neural Language Modeling Mini-batch XRM gradient:     l i ( φ ) − E j [ l j ( φ )] ∂ l i ( φ )      1 + κ E i     � ∂φ     � � − E j [ l j ( φ )] 2 l 2    j ( φ ) E j Smaller than average loss = ⇒ lower learning rate Larger than average loss = ⇒ larger learning rate Loss variance is the unit of loss measurement Paul Mineiro ECML 2015 Big Targets Workshop

  13. How can we generalize well? Example: Neural Language Modeling enwiki9 data set FNN-LM of Zhang et. al. Same everything except κ . method perplexity ERM ( κ = 0) 106.3 XRM ( κ = 0.25) 104.1 Modest lift, but over SOTA baseline and with minimal code changes. Paul Mineiro ECML 2015 Big Targets Workshop

  14. How can we generalize well? Example: Neural Language Modeling Progressive Loss Variance 4 3 ERM 2 XRM 1 0 10 4 10 6 10 8 10 10 Example # Paul Mineiro ECML 2015 Big Targets Workshop

  15. How can we generalize well? Example: Randomized Embeddings Based upon (randomized) SVD. d k c c V ⊤ k ≈ T d X Y n n W = TV ⊤ How to adapt black-box technique to XRM? Idea: proxy model = ⇒ importance weights. Paul Mineiro ECML 2015 Big Targets Workshop

  16. How can we generalize well? Imbalanced binary XRM Binary classification with constant predictor. l ( y ; q ) = y log( q ) + (1 − y ) log(1 − q ) � l ( y ; q ) − E [ l ( · ; q )] � 1 + κ � � � E [ l 2 ( · ; q )] − E [ l ( · ; q )] 2 � q = p  � p 1 − κ y = 0  1 − p = ( p ≤ 0 . 5) � 1 − p 1 + κ y = 1  p Paul Mineiro ECML 2015 Big Targets Workshop

  17. How can we generalize well? XRM Rembed for ODP Compute base rate q c each class c . Importance weight (1 + κ (1 / √ q y i )). method error rate (%) ODP ERM [80.3, 80.4] ODP XRM ( κ = 1) [78.5, 78.7] Modest lift, but over SOTA baseline and with minimal code changes. Paul Mineiro ECML 2015 Big Targets Workshop

  18. How can we generalize well? Summary The tail can deviate wildly between train and test. Controlling loss variance helps a little bit. Speculation: explicitly treat the head and tail differently? Paul Mineiro ECML 2015 Big Targets Workshop

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend