calibrated surrogate maximization of linear fractional
play

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - PowerPoint PPT Presentation

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP) accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis)


  1. Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP)

  2. accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis) positive negative 2 โ–  Our focus: binary classification

  3. accuracy: 0.8 positive F-measure Is accuracy appropriate? F-measure: 0.75 negative F-measure: 0 accuracy: 0.8 5 5 8 2 3 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ๐–ด๐–ฐ = ๐”ฝ X , Y =+1 [1 { f ( X )>0} ] ๐–ด๐–ฎ = ๐”ฝ X , Y = โˆ’ 1 [1 { f ( X )<0} ] ๐–ฆ๐–ฐ = ๐”ฝ X , Y = โˆ’ 1 [1 { f ( X )>0} ] ๐–ฆ๐–ฎ = ๐”ฝ X , Y =+1 [1 { f ( X )<0} ]

  4. Training and Evaluation minimizing 0/1-error compatible evaluation ๏ผŸ๏ผŸ๏ผŸ training incompatible evaluation training 4 compatible evaluation minimizing 0/1-error training โ–  Usual empirical risk minimization (ERM) ๐–ก๐–ฝ๐–ฝ = ๐–ด๐–ฐ + ๐–ด๐–ฎ 1 = 1 โˆ’ (0/1-risk) 0 โ–  Training with accuracy but evaluate with F 1 2 ๐–ด๐–ฐ 1 ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ 0 โ–  Why not? Direct Optimization 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

  5. Balanced Error Rate Fowlkes-Mallows index Accuracy Weighted Accuracy Gower-Legendre index Jaccard index Matthews Correlation Coefficient F-measure w 1 ๐–ด๐–ฐ + w 2 ๐–ด๐–ฎ ๐–ท๐–ก๐–ฝ๐–ฝ = ๐–ฆ๐–ญ๐–ฉ = ๐–ด๐–ฐ 1 w 1 ๐–ด๐–ฐ + w 2 ๐–ด๐–ฎ + w 3 ๐–ฆ๐–ฐ + w 4 ๐–ฆ๐–ฎ ฯ€ ๐–ด๐–ฐ + ๐–ฆ๐–ฐ 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ Wanna Unify!! ๐–ก๐–ฝ๐–ฝ = ๐–ด๐–ฐ + ๐–ด๐–ฎ ๐–ข๐–ฅ๐–ฒ = 1 1 ฯ€ ๐–ฆ๐–ฎ + 1 โˆ’ ฯ€ ๐–ฆ๐–ฐ ๐–ด๐–ฐ ๐–ช๐–ป๐–ฝ = ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ๐–ด๐–ฐ โ‹… ๐–ด๐–ฎ โˆ’ ๐–ฆ๐–ฐ โ‹… ๐–ฆ๐–ฎ ๐–ญ๐–ฃ๐–ฃ = ฯ€ (1 โˆ’ ฯ€ )( ๐–ด๐–ฐ + ๐–ฆ๐–ฐ )( ๐–ด๐–ฎ + ๐–ฆ๐–ฎ ) ๐–ด๐–ฐ + ๐–ด๐–ฎ ๐–ง๐–ฌ๐–ฉ = ๐–ด๐–ฐ + ฮฑ ( ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ) + ๐–ด๐–ฎ

  6. Actual Metrics linear-fraction Note: Unification of Metrics 6 ๐–ด๐–ฎ = โ„™ ( Y = โˆ’ 1) โˆ’ ๐–ฆ๐–ฐ ๐–ฆ๐–ฎ = โ„™ ( Y = + 1) โˆ’ ๐–ด๐–ฐ 2 ๐–ด๐–ฐ ๐–ฆ ๐Ÿค = 2 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 ๐–ด๐–ฐ ๐–ช๐–ป๐–ฝ = ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ a k , b k , c k : constants

  7. Unification of Metrics := 7 = linear-fraction . . . . . . . . . . . . . a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 1 ๐”ฝ X [ W 0 ( f ( X ))] ๐”ฝ X [ W 1 ( f ( X ))] โ–  TP, FP = expectation of 0/1-loss ๐–ด๐–ฐ = โ„™ ( Y = + 1, f ( X ) > 0) = ๐”ฝ X , Y =+1 [1 { f ( X )>0} ] โ–ถ e.g.

  8. Goal of This Talk 8 Given a metric metric (utility) labeled sample classifier s.t. i.i.d. U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 Q. How to optimize U ( f ) directly? โ–ถ without estimating class-posterior probability f : ๐’ด โ†’ โ„ {( x i , y i )} n โˆผ โ„™ i =1 U ( f โ€ฒ ๏ฟฝ ) U ( f ) = sup U f โ€ฒ ๏ฟฝ

  9. 9 Outline โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

  10. Formulation of Classification ฬ‚ classified correctly classified incorrectly make 0/1 loss smoother (Empirical) Surrogate Risk Example of convex in ! 10 = minimize mis-classification rate โ€จ ฬ‚ โ–  Goal of classification: maximize accuracy โ€จ n R ( f ) = 1 โˆ‘ 1 [ y i โ‰  sign( f ( x i ))] 0/1 ( ` ) n Logistic i =1 Hinge ๏ฟฝ ( m ) n = 1 โˆ‘ โ„“ ( y i f ( x i )) 1 n i =1 0 โˆ’ 1 0 1 m = y i f ( x i ) m n R ฯ• ( f ) = 1 โˆ‘ ฯ• ( y i f ( x i )) ฯ• n i =1 โ–ถ logistic loss f โ–ถ hinge loss โ‡’ SVM โ–ถ exponential loss โ‡’ AdaBoost

  11. 3 Actors in Risk Minimization 0/1-loss what we actually minimize (empirical (surrogate) risk) classified correctly ฬ‚ differentiable upper bound of 0/1-loss surrogate loss (surrogate risk) prediction margin classified โ€จ โ€จ โ€จ โ€จ 11 โ€จ โ€จ incorrectly โ€จ โ–  Minimize classification risk (= 1 - Accuracy) โ€จ R ( f ) = ๐”ฝ [ โ„“ ( Yf ( X ) ) ] 0/1 ( ` ) Logistic 0/1-loss represents if X is correctly Hinge ๏ฟฝ ( m ) classified by f 1 โ–  Surrogate loss makes tractable โ€จ 0 โˆ’ 1 0 1 m = y i f ( x i ) m R ฯ• ( f ) = ๐”ฝ [ ฯ• ( Yf ( X ))] โ–  Sample approximation (M-estimation) โ€จ n R ฯ• ( f ) = 1 โˆ‘ ฯ• ( y i f ( x i )) n i =1

  12. Convexity & Statistical Property Then, Assume : convex. 12 = argmin ? Q. argmin tractable (convex) intractable generalize Theorem. iff . ฬ‚ (informal) [Bartlett+ 2006] A. Yes, w/ calibrated surrogate n R ฯ• ( f ) = 1 R ฯ• R โˆ‘ ฯ• ( y i f ( x i )) n i =1 ฯ• R ฯ• ( f ) = ๐”ฝ [ ฯ• ( Yf ( X ))] argmin f R ฯ• ( f ) = argmin f R ( f ) ฯ• โ€ฒ ๏ฟฝ (0) < 0 R ( f ) = ๐”ฝ [ โ„“ ( Yf ( X ))] P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.

  13. Related Work: Plug-in Rule Y = +1 โ‡’ estimate P(Y=+1|x) and ฮด independently Y = -1 Y = +1 Y = -1 [Koyejo+ NIPS2014; Yan+ ICML2018] 13 โ–  Classifier based on class-posterior probability Bayes-optimal classifier (accuracy): โ„™ ( Y = + 1 | x ) โˆ’ 1 2 โ„™ ( Y = + 1 โˆฃ X ) 1 0 1 2 Bayes-optimal classifier (general case): โ„™ ( Y = + 1 | x ) โˆ’ ฮด * โ„™ ( Y = + 1 โˆฃ X ) 0 ฮด * 1 โ„™ ( Y = + 1 | x ) ฮด * O. O. Koyejo, N. Natarajan, P. K. Ravikumar, & I. S. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS , 2014. B. Yan, O. Koyejo, K. Zhong, & P. Ravikumar. Binary classification with Karmic, threshold-quasi-concave metrics. In ICML , 2018.

  14. 14 Outline โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

  15. Convexity & Statistical Property calibration โ‘  = argmin argmin objective? Q. tractable & calibrated calibration intractable tractable (convex) 15 intractable generalize ฬ‚ โ‘ก n R ฯ• ( f ) = 1 โˆ‘ ฯ• ( y i f ( x i )) n i =1 U ( f ) = ๐”ฝ X [ W 0 ( f ( X ))] ๐”ฝ X [ W 1 ( f ( X ))] R ฯ• ( f ) = ๐”ฝ [ ฯ• ( Yf ( X ))] R ฯ• R R ( f ) = ๐”ฝ [ โ„“ ( Yf ( X ))]

  16. Non-concave, but quasi-concave (proof) Show โ‡’ efficiently optimized non-concave, but unimodal concave is convex for is convex NB: super-level set of concave func. 16 is convex. if : concave, : convex, for and Idea: concave / convex = quasi-concave is quasi-concave f ( x ) g ( x ) f g f ( x ) โ‰ฅ 0 g ( x ) > 0 โˆ€ x { x | f / g โ‰ฅ ฮฑ } f ( x ) g ( x ) โ‰ฅ ฮฑ โŸบ f ( x ) โˆ’ ฮฑ g ( x ) โ‰ฅ 0 โŠ‡ โ–  quasi-concave concave โ–  super-levels are convex โˆด { x | f / g โ‰ฅ ฮฑ } โˆ€ ฮฑ โ‰ฅ 0

  17. Surrogate Utility = non-negative sum of convex โ‡’ concave non-negative sum of concave denominator from above numerator from below โ‡’ convex 17 linear-fraction โ–  Idea: bound true utility from below . . . . . . . . . . . . . a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 1 O . . . . . . . . . . . . . O a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 โ‰ฅ 1 . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 1 O O

  18. Surrogate Utility linear-fraction surrogate loss : Surrogate Utility = 18 โ–  Idea: bound true utility from below O . . . . . . . . . . . . . O a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 โ‰ฅ U ( f ) = a 0 ๐–ด๐–ฐ + b 0 ๐–ฆ๐–ฐ + c 0 1 . . . . . . . . . . . . . a 1 ๐–ด๐–ฐ + b 1 ๐–ฆ๐–ฐ + c 1 a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 1 O O U ฯ• ( f ) = a 0 ๐”ฝ P [1 โˆ’ ฯ• ( f ( X ))] + b 0 ๐”ฝ N [ โˆ’ ฯ• ( โˆ’ f ( X ))] + c 0 ฯ† ( m ) a 1 ๐”ฝ P [1 + ฯ• ( f ( X ))] + b 1 ๐”ฝ N [ ฯ• ( โˆ’ f ( X )) ] + c 1 ๐”ฝ [ W 0, ฯ• ] O := ๐”ฝ [ W 1, ฯ• ]

  19. Hybrid Optimization Strategy โ–ถ isnโ€™t quasi-concave if numerator < 0 maximize fractional form (quasi-concave) 19 O . . . . . . . . . . . . . O a 0 ๐”ฝ P + b 0 ๐”ฝ N . . . . . . . . . . + c 0 1 U ฯ• ( f ) = = . . . . . . . . . . . . . a 1 ๐”ฝ P + b 1 ๐”ฝ N . . . . . . . . . . + c 1 1 O O โ–  Note: numerator can be negative U ฯ• โ–ถ maximize numerator first (concave), then

  20. Hybrid Optimization Strategy 20 maximize numerator maximize fraction normalized gradient for quasi-concave optimization [Hazan+ NeurIPS2015] Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems (pp. 1594-1602).

  21. 21 Outline โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend