Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP)

Han Bao (包含) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary Classification. (preprint) Similarity-based Classification: Connecting Similarity Learning (MICCAI2020) Calibrated surrogate maximization of Dice. Calibrated Surrogate Maximization of Linear-fractional Utility Adversarially Robust Classification. (COLT2020) Calibrated Surrogate Losses for Source-guided Discrepancy. (AAAI2019) Unsupervised Domain Adaptation Based on (ICML2018) Classification from Pairwise Similarity and Unlabeled Data. https://hermite.jp/ つつみふくむ 2 robustness and knowledge transfer via loss function transfer ■ 2nd-year Ph.D. student @ Sugiyama-Honda-Yokoya Lab ■ Research Interests:

cross-entropy + softmax + https://devblogs.nvidia.com/mocha-jl-deep-learning-julia/image1/

Neural Network traffic feature ( ) Prediction softmax Network Neural label and prediction Distance of minimize light traffic label ( ) feature ( ) Training softmax cross-entropy light? x y x

Neural Network light traffic Evaluation softmax rate misclassification Network Neural light traffic minimize Training softmax cross-entropy ∑ y i log z i 1 [ y ≠ z ]

margin margin maximization hinge loss minimization misclassification rate max { 0, 1 − y i ( w ⊤ x i + b ) } w , b ∑ min i

Neural Network cross-entropy softmax Deep Learning classifier SVM Classifier hinge loss learning = minimize loss misclassification rate x x Does it work?

Background: Binary Classification if , if 8 and label : pair of feature ■ Input {( x i , y i )} n y i ∈ {±1} x i ∈ 𝒴 i =1 ▶ sample ■ Output 1 Y ≠ sign( f ( X )) f : 𝒴 → ℝ 0 Y = sign( f ( X )) ▶ classifier sign( f ( ⋅ )) ▶ predict class by R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ▶ criteria: misclassification rate x 2 f x 1 f ( x ) 0

Loss function and Risk 0-1 risk wrong correct 1 0 9 discrete function no gradient for gradient descent minimization by ? 0-1 loss is NP-hard [Feldman+ 2012] ■ Goal of classification: minimize misclassification rate R 01 ( f ) = 𝔽 [ 1 [ Y ≠ sign( f ( X ))] ] ■ Misclassification rate = expectation of 0-1 loss 1 [ Y ≠ sign( f ( X ))] = ϕ 01 ( Yf ( X )) R 01 ■ Minimizing ϕ 01 Yf ( X ) Y ≠ sign( f ( X )) Y = sign( f ( X )) Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing , 41 (6), 1558-1590.

Target Loss vs. Surrogate Loss correct wrong Target loss (0-1 loss) correct wrong Surrogate loss 10 ϕ ϕ 01 ■ Final learning criterion ■ Different from target loss ■ Hard to optimize ■ Easily-optimizable criterion ▶ nonconvex, no gradient ▶ usually convex, smooth

Elements of Learning Theory (empirical) Calibration theory for Key ingredient: then converges (roughly speaking) If model is not too complicated, Generalization theory : 11 surrogate risk loss functions ̂ surrogate risk (population) target risk n R ϕ ( f ) = 1 ∑ ϕ ( y i f ( x i )) n i =1 R ϕ ( f ) = 𝔽 [ ϕ ( Yf ( X ))] R 01 ( f ) = 𝔽 [ ℓ 01 ( Yf ( X ))]

What surrogate is desirable? final learning criterion target risk Calibrated surrogate 12 surrogate risk … easily optimizable Target loss (0-1 loss) Surrogate loss ϕ R ϕ ( f ) R * ϕ R 01 ( f ) R * ϕ 01 01 f m f ∞ m →∞ m →∞ R ϕ ( f m ) → R * ϕ ⟹ R 01 ( f m ) → R * 01

How to check risk convergence? target (excess) risk , surrogate is calibrated! for all If Idea: write as function of (by using contraposition) [Steinwart 2007] surrogate (excess) risk target (excess) risk s.t. 13 surrogate (excess) risk Definition. (calibration function) is - calibrated for a target loss Definition. such that for all , . , there exists if for any ϕ ψ ψ ε > 0 δ > 0 f R ϕ ( f ) − R * ϕ < δ ⟹ R ψ ( f ) − R * ψ < ε δ ε δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f δ ( ε ) > 0 ε > 0 Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation , 26 (2), 225-287.

Main Tool: Calibration Function target excess risk target (excess) risk s.t. Definition. (calibration function) : biconjugate of increasing monotonically surrogate excess risk 14 -calibrated ▶ -calibrated ▶ for all surrogate (excess) risk δ ( ε ) = inf R ϕ ( f ) − R * R ψ ( f ) − R * ψ ≥ ε ϕ f ■ Provides iff condition ψ ⟺ δ ( ε ) > 0 ε > 0 ■ Provides excess risk bound ψ ≤ ( δ **) − 1 ( R ϕ ( f ) − R * ϕ ) ψ ⟹ R ψ ( f ) − R * δ ** δ

Example: Binary Classification ( ▶ squared loss ) hinge loss [Bartlett+ 2006] Theorem. If surrogate is convex, it is -calibrated iff 15 ϕ 01 ϕ ϕ 01 ▶ differentiable at 0 ϕ ′ (0) < 0 δ δ 1 1 ε ε 0 0 1 1 ϕ ( α ) = (1 − α ) 2 δ ( ε ) = ε 2 ϕ ( α ) = [1 − α ] + δ ( ε ) = ε P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473), 138-156.

Counterintuitive Result Crammer-Singer loss [Zhang 2004] (similar extension of logistic loss is calibrated) Crammer-Singer loss is not calibrated to 0-1 loss！ [Crammer & Singer 2001] one of multi-class extensions of hinge loss is correct class prediction margin 16 ■ e.g. multi-class classification ⇒ maximize prediction margin f : 𝒴 → ℝ 3 feature x prediction score f ( x ) max{0,1 − pred. margin } Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research , 2 (Dec), 265-292 Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research , 5 (Oct), 1225-1251.

Summary: Calibration Theory 17 (omitted) cross-entropy is calibrated not calibrated! CS-loss (MC-hinge loss) is Multi-class Classification Hinge, logistic is calibrated Binary Classification leading to minimization of target Stringent justification of surrogate loss! Calibrated Surrogate ⇒ replace with surrogate loss Target loss is often hard to optimize 1 0 Surrogate vs. Target loss ϕ Yf ( X ) R ϕ ( f ) Calibrated iff ϕ ′ (0) < 0 R ψ ( f )

When target is not 0-1 loss H. Bao and M. Sugiyama. Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification. In AISTATS , 2020.

accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 positive negative 19 seemingly sensible classifier unreasonable classifier Accuracy can’t detect unreasonable classifiers under class imbalance ! ■ Our focus: binary classification

accuracy: 0.8 negative 20 F-measure Is accuracy appropriate? F-measure: 0.75 F-measure: 0 positive accuracy: 0.8 5 5 8 2 ■ F-measure is more appropriate under class imbalance 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 = 𝔽 X , Y =+1 [1 { f ( X )>0} ] 𝖴𝖮 = 𝔽 X , Y = − 1 [1 { f ( X )<0} ] 𝖦𝖰 = 𝔽 X , Y = − 1 [1 { f ( X )>0} ] 𝖦𝖮 = 𝔽 X , Y =+1 [1 { f ( X )<0} ]

Training and Evaluation Accuracy compatible ??? calibrated Surrogate utility F値 Training F-measure Evaluation = 0-1 risk Evaluation calibrated 0-1 risk Surrogate risk Training 21 compatible ■ Usual training with accuracy ■ Training with accuracy but evaluating with F-measure

Not only F 1 , but many others Jaccard index Gower-Legendre index Balanced Error Rate Weighted Accuracy 22 Q. Can we handle in the same way? F-measure Accuracy w 1 𝖴𝖰 + w 2 𝖴𝖮 𝖷𝖡𝖽𝖽 = 𝖡𝖽𝖽 = 𝖴𝖰 + 𝖴𝖮 w 1 𝖴𝖰 + w 2 𝖴𝖮 + w 3 𝖦𝖰 + w 4 𝖦𝖮 2 𝖴𝖰 𝖢𝖥𝖲 = 1 1 𝖦 𝟤 = π 𝖦𝖮 + 1 − π 𝖦𝖰 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 𝖴𝖰 + 𝖴𝖮 𝖴𝖰 𝖧𝖬𝖩 = 𝖪𝖻𝖽 = 𝖴𝖰 + α ( 𝖦𝖰 + 𝖦𝖮 ) + 𝖴𝖮 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮

Actual Metrics linear-fraction Note: Unification of Metrics 23 𝖴𝖮 = ℙ ( Y = − 1) − 𝖦𝖰 𝖦𝖮 = ℙ ( Y = + 1) − 𝖴𝖰 2 𝖴𝖰 𝖦 𝟤 = 2 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 𝖴𝖰 𝖪𝖻𝖽 = 𝖴𝖰 + 𝖦𝖰 + 𝖦𝖮 a k , b k , c k : constants

Unification of Metrics := positive prediction negative data && positive prediction positive data = expectation divided by expecation linear-fraction 24 ▶ ▶ && U ( f ) = a 0 𝖴𝖰 + b 0 𝖦𝖰 + c 0 a 1 𝖴𝖰 + b 1 𝖦𝖰 + c 1 . . . . . . . . . . . . . 𝔽 X [ W 0 ( f ( X ))] a 0 𝔽 P + b 0 𝔽 N . . . . . . . . . . + c 0 1 . . . . . . . . . . . . . 𝔽 X [ W 1 ( f ( X ))] a 1 𝔽 P + b 1 𝔽 N . . . . . . . . . . + c 1 1 ■ TP, FP = expectation of 0/1-loss 𝖴𝖰 = 𝔽 X , Y =+1 [ 1 [ f ( X ) > 0] ] 𝖦𝖰 = 𝔽 X , Y = − 1 [ 1 [ f ( X ) > 0] ] f ( X )

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP) Han Bao ( ) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary

7 Bridges in 74 Days (Or Less) (Or Less) NC 12 Detour Ocracoke NC 12 Bridges Ocracoke NC 12

Truss Bridges of Kentucky 1899 Amanda Abner Rebecca Turner 1893 Truss Bridges of Kentucky

Special girder bridges Skew bridges ETH Zrich | Chair of Concrete Structures and Bridge

Special girder bridges Curved Bridges ETH Zrich | Chair of Concrete Structures and Bridge

Special girder bridges Cantilever-constructed bridges 28.04.2020 ETH Zrich | Chair of

Learning Theory Bridges Loss Functions Sep 10 th , 2020 Han Bao (The University of Tokyo / RIKEN

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Finding Repetition Patterns in Songs BRIDGES Team SIGCSE 2019 BRIDGES (SIGCSE 2019) Song

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Elementary Functions Part 1, Functions Lecture 1.1b, Functions defined by equations Dr. Ken W.

Orthonormal bases of functions April 24, 2018 Data - Vectors or Functions Vectors Functions

Developing Managed Code Rootkits for the Java Runtime Environment DEFCON 24, August 6th 2016

Analysis of optimistic multi-party contract signing Rohit Chadha 1,2 , Steve Kremer 3 , Andre

Panta Rhei Water and Society: Science and Education Thorsten Wagener 1

Step up your game Step up your game & bring your projects to the bring your projects to the

Introduction to Jest testing framework Painless JavaScript Testing platform Vasyl Boroviak

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshin 1

JavaScript: The Good Parts vs. JavaScript: The Definitive Guide CS 252: Advanced Programming

Open source password manager for teams 80% of security incidents are due to poor password

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao - PowerPoint PPT Presentation

Learning Theory Bridges Loss Functions July 13 rd , 2020 Han Bao (The University of Tokyo / RIKEN AIP) Han Bao ( ) in Binary Classification. (AISTATS2020) knowledge learning transfer learning similarity robustness to Binary

7 Bridges in 74 Days (Or Less) (Or Less) NC 12 Detour Ocracoke NC 12 Bridges Ocracoke NC 12

Truss Bridges of Kentucky 1899 Amanda Abner Rebecca Turner 1893 Truss Bridges of Kentucky

Special girder bridges Skew bridges ETH Zrich | Chair of Concrete Structures and Bridge

Special girder bridges Curved Bridges ETH Zrich | Chair of Concrete Structures and Bridge

Special girder bridges Cantilever-constructed bridges 28.04.2020 ETH Zrich | Chair of

Learning Theory Bridges Loss Functions Sep 10 th , 2020 Han Bao (The University of Tokyo / RIKEN

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Finding Repetition Patterns in Songs BRIDGES Team SIGCSE 2019 BRIDGES (SIGCSE 2019) Song

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Elementary Functions Part 1, Functions Lecture 1.1b, Functions defined by equations Dr. Ken W.

Orthonormal bases of functions April 24, 2018 Data - Vectors or Functions Vectors Functions

Developing Managed Code Rootkits for the Java Runtime Environment DEFCON 24, August 6th 2016

Analysis of optimistic multi-party contract signing Rohit Chadha 1,2 , Steve Kremer 3 , Andre

Panta Rhei Water and Society: Science and Education Thorsten Wagener 1

Step up your game Step up your game &amp; bring your projects to the bring your projects to the

Introduction to Jest testing framework Painless JavaScript Testing platform Vasyl Boroviak

Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshin 1

JavaScript: The Good Parts vs. JavaScript: The Definitive Guide CS 252: Advanced Programming

Open source password manager for teams 80% of security incidents are due to poor password

Step up your game Step up your game & bring your projects to the bring your projects to the