learning theory bridges loss functions
play

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP) Han Bao ( ) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary


  1. Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP)

  2. Han Bao (包 含) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary Classification. (preprint) Similarity-based Classification: Connecting Similarity Learning (MICCAI2020) Calibrated surrogate maximization of Dice. Calibrated Surrogate Maximization of Linear-fractional Utility Adversarially Robust Classification. (COLT2020) Calibrated Surrogate Losses for Source-guided Discrepancy. (AAAI2019) Unsupervised Domain Adaptation Based on (ICML2018) Classification from Pairwise Similarity and Unlabeled Data. https://hermite.jp/ つつみ ふくむ 2 robustness and knowledge transfer via loss function transfer ■ 2nd-year Ph.D. student @ Sugiyama-Honda-Yokoya Lab ■ Research Interests:

  3. cross-entropy + softmax + https://devblogs.nvidia.com/mocha-jl-deep-learning-julia/image1/

  4. Neural Network traffic feature ( ) Prediction softmax Network Neural label and prediction Distance of minimize light traffic label ( ) feature ( ) Training softmax cross-entropy light? x y x

  5. Neural Network light traffic Evaluation softmax rate misclassification Network Neural light traffic minimize Training softmax cross-entropy ∑ y i log z i 1 [ y ≠ z ]

  6. margin margin maximization hinge loss minimization misclassification rate max { 0, 1 − y i ( w ⊤ x i + b ) } w , b ∑ min i

  7. Neural Network cross-entropy softmax Deep Learning classifier SVM Classifier hinge loss learning = minimize loss misclassification rate x x Does it work?

  8. Background: Binary Classification if , if 8 and label : pair of feature ■ Input {( x i , y i )} n y i ∈ {±1} x i ∈ 𝒴 i =1 ▶ sample ■ Output 1 Y ≠ sign( f ( X )) f : 𝒴 → ℝ 0 Y = sign( f ( X )) ▶ classifier sign( f ( ⋅ )) ▶ predict class by R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ▶ criteria: misclassification rate x 2 f x 1 f ( x ) 0

  9. Loss function and Risk 0-1 risk wrong correct 1 0 9 discrete function no gradient for gradient descent minimization by ? 0-1 loss is NP-hard [Feldman+ 2012] ■ Goal of classification: minimize misclassification rate R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ■ Misclassification rate = expectation of 0-1 loss 1 [ Y ≠ sign( f ( X ))] = ϕ 01 ( Yf ( X )) R 01 ■ Minimizing ϕ 01 Yf ( X ) Y ≠ sign( f ( X )) Y = sign( f ( X )) Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing , 41 (6), 1558-1590.

  10. Target Loss vs. Surrogate Loss correct wrong Target loss (0-1 loss) correct wrong Surrogate loss 10 ϕ ϕ 01 ■ Final learning criterion ■ Different from target loss ■ Hard to optimize ■ Easily-optimizable criterion ▶ nonconvex, no gradient ▶ usually convex, smooth

  11. Elements of Learning Theory (empirical) Calibration theory for Key ingredient: then converges (roughly speaking) If model is not too complicated, Generalization theory : 11 surrogate risk loss functions ̂ surrogate risk (population) target risk n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) n i =1 R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] R 01 ( f ) = 𝔽 [ ℓ 01 ( Yf ( X ))]

  12. What surrogate is desirable? final learning criterion target risk Calibrated surrogate 12 surrogate risk … easily optimizable Target loss (0-1 loss) Surrogate loss ϕ R ϕ ( f ) R * ϕ R 01 ( f ) R * ϕ 01 01 f m f ∞ m →∞ m →∞ R ϕ ( f m ) → R * ϕ ⟹ R 01 ( f m ) → R * 01

  13. How to check risk convergence? target (excess) risk , surrogate is calibrated! for all If Idea: write as function of (by using contraposition) [Steinwart 2007] surrogate (excess) risk target (excess) risk s.t. 13 surrogate (excess) risk Definition. (calibration function) is - calibrated for a target loss Definition. such that for all , . , there exists if for any ϕ ψ ψ ε > 0 δ > 0 f R ϕ ( f ) − R * ϕ < δ ⟹ R ψ ( f ) − R * ψ < ε δ ε δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f δ ( ε ) > 0 ε > 0 Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation , 26 (2), 225-287.

  14. Main Tool: Calibration Function target excess risk target (excess) risk s.t. Definition. (calibration function) : biconjugate of increasing monotonically surrogate excess risk 14 -calibrated ▶ -calibrated ▶ for all surrogate (excess) risk δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f ■ Provides iff condition ψ ⟺ δ ( ε ) > 0 ε > 0 ■ Provides excess risk bound ψ ≤ ( δ **) − 1 ( R ϕ ( f ) − R * ϕ ) ψ ⟹ R ψ ( f ) − R * δ ** δ

  15. Example: Binary Classification ( ▶ squared loss ) hinge loss [Bartlett+ 2006] Theorem. If surrogate is convex, it is -calibrated iff 15 ϕ 01 ϕ ϕ 01 ▶ differentiable at 0 ϕ ′ (0) < 0 δ δ 1 1 ε ε 0 0 1 1 ϕ ( α ) = (1 − α ) 2 δ ( ε ) = ε 2 ϕ ( α ) = [1 − α ] + δ ( ε ) = ε P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.

  16. Counterintuitive Result Crammer-Singer loss [Zhang 2004] (similar extension of logistic loss is calibrated) Crammer-Singer loss is not calibrated to 0-1 loss! [Crammer & Singer 2001] one of multi-class extensions of hinge loss is correct class prediction margin 16 ■ e.g. multi-class classification ⇒ maximize prediction margin f : 𝒴 → ℝ 3 feature x prediction score f ( x ) max{0,1 − pred. margin } Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research , 2 (Dec), 265-292 Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research , 5 (Oct), 1225-1251.

  17. Summary: Calibration Theory 17 (omitted) cross-entropy is calibrated not calibrated! CS-loss (MC-hinge loss) is Multi-class Classification Hinge, logistic is calibrated Binary Classification leading to minimization of target Stringent justification of surrogate loss! Calibrated Surrogate ⇒ replace with surrogate loss Target loss is often hard to optimize 1 0 Surrogate vs. Target loss ϕ Yf ( X ) R ϕ ( f ) Calibrated iff ϕ ′ (0) < 0 R ψ ( f )

  18. When target is not 0-1 loss H. Bao and M. Sugiyama. Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification. In AISTATS , 2020.

  19. accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 positive negative 19 seemingly sensible classifier unreasonable classifier Accuracy can’t detect unreasonable classifiers under class imbalance ! ■ Our focus: binary classification

  20. accuracy: 0.8 negative 20 F-measure Is accuracy appropriate? F-measure: 0.75 F-measure: 0 positive accuracy: 0.8 5 5 8 2 ■ F-measure is more appropriate under class imbalance 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 = 𝔽 X , Y =+1 [1 { f ( X )>0} ] 𝖴𝖮 = 𝔽 X , Y = − 1 [1 { f ( X )<0} ] 𝖦𝖰 = 𝔽 X , Y = − 1 [1 { f ( X )>0} ] 𝖦𝖮 = 𝔽 X , Y =+1 [1 { f ( X )<0} ]

  21. Training and Evaluation Accuracy compatible ??? calibrated Surrogate utility F値 Training F-measure Evaluation = 0-1 risk Evaluation calibrated 0-1 risk Surrogate risk Training 21 compatible ■ Usual training with accuracy ■ Training with accuracy but evaluating with F-measure

  22. Not only F 1 , but many others Jaccard index Gower-Legendre index Balanced Error Rate Weighted Accuracy 22 Q. Can we handle in the same way? F-measure Accuracy w 1 𝖴𝖰 + w 2 𝖴𝖮 𝖷𝖡𝖽𝖽 = 𝖡𝖽𝖽 = 𝖴𝖰 + 𝖴𝖮 w 1 𝖴𝖰 + w 2 𝖴𝖮 + w 3 𝖦𝖰 + w 4 𝖦𝖮 2 𝖴𝖰 𝖢𝖥𝖲 = 1 1 𝖦 𝟤 = π 𝖦𝖮 + 1 − π 𝖦𝖰 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 + 𝖴𝖮 𝖴𝖰 𝖧𝖬𝖩 = 𝖪𝖻𝖽 = 𝖴𝖰 + α ( 𝖦𝖰 + 𝖦𝖮 ) + 𝖴𝖮 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮

  23. Actual Metrics linear-fraction Note: Unification of Metrics 23 𝖴𝖮 = ℙ ( Y = − 1) − 𝖦𝖰 𝖦𝖮 = ℙ ( Y = + 1) − 𝖴𝖰 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 𝖴𝖰 𝖪𝖻𝖽 = 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 a k , b k , c k : constants

  24. Unification of Metrics := positive prediction negative data && positive prediction positive data = expectation divided by expecation linear-fraction 24 ▶ ▶ && U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 . . . . . . . . . . . . . 𝔽 X [ W 0 ( f ( X ))] a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 1 . . . . . . . . . . . . . 𝔽 X [ W 1 ( f ( X ))] a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 ■ TP, FP = expectation of 0/1-loss 𝖴𝖰 = 𝔽 X , Y =+1 [ 1 [ f ( X ) > 0] ] 𝖦𝖰 = 𝔽 X , Y = − 1 [ 1 [ f ( X ) > 0] ] f ( X )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend