SLIDE 1 Learning with Non-decomposable Performance Measures:
Stochastic Optimization & Statistical Consistency
Harikrishna Narasimhan
Department of Computer Science and Automation Indian Institute of Science, Bangalore
SLIDE 2
SLIDE 3 perfor
mance measu asure re?
SLIDE 4
0-1 Classification Error
SLIDE 5
0-1 Classification Error
point-wise se loss
SLIDE 6 Text Retrieval
F-measure
2 x Precision x Recall Precision + Recall
SLIDE 7 Medical Diagnosis
Area Under the ROC Curve (AUC)
False e Positive ive Rate True Positiv ive e Rate
SLIDE 8 Information Retrieval
Precision@K
in Top-K positions
SLIDE 9 http://www.tagxedo.com
SLIDE 10 Non-decomposable Performance Measures
……
SLIDE 11 Non-decomposable Performance Measures
……
cannot t be e ex expres esse sed as d as a s sum um of point-wise se errors!
SLIDE 12 Perform formance ance Measure sures
Algorit rithms hms
……
SLIDE 13 Perform formance ance Measure sures
Algorit rithms hms
……
Q1: Efficient Optimization?
SLIDE 14 Perform formance ance Measure sures
Algorit rithms hms
……
Q1: Efficient Optimization? Q2: Statistical Consistency?
SLIDE 15 Perform formance ance Measure sures
Algorit rithms hms
……
Efficient Learning Algorithms
Kar, P., Narasimhan, H. and Jain, P. “Online and Stochastic Gradient Methods for Non-decomposable Loss Functions”, NIPS 2014. To appear. Narasimhan, H. and Agarwal, S. “SVMpAUC-tight: A new support vector method for optimizing partial AUC based on a tight convex upper bound”, KDD 2013. Narasimhan, H. and Agarwal, S. “A structural SVM based approach for optimizing partial AUC”, ICML 2013.
SLIDE 16 Perform formance ance Measure sures
Algorit rithms hms
……
Statistical Consistency of Learning Algorithms
Narasimhan, H. and Agarwal, S. “On the statistical consistency
- f plug-in classifiers for non-decomposable performance
measures”. NIPS 2014. To appear. Narasimhan, H. and Agarwal, S. “On the relationship between binary classification, bipartite ranking, and binary class probability estimation”. NIPS 2013. Menon, A., Narasimhan, H., Agarwal, S. and Chawla, S. “On the statistical consistency of algorithms for binary classification under class imbalance ”, ICML 2013.
SLIDE 17
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
SLIDE 18
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 19 Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Purushottam Kar MSR, Bangalore Prateek Jain MSR, Bangalore
SLIDE 20
Stochastic Gradient Descent
convex (point-wise)
SLIDE 21
Stochastic Gradient Descent
convex (point-wise)
SLIDE 22
Stochastic Gradient Descent
convex (point-wise)
SLIDE 23
Stochastic Gradient Descent
convex (point-wise)
SLIDE 24
Stochastic Gradient Descent
convex (point-wise) point-wise se u upd pdate te
SLIDE 25
Stochastic Gradient Descent
SLIDE 26
Stochastic Gradient Descent
Note on Proof: – Unbiased gradient estimates (estimated gradient = true gradient)
SLIDE 27
Stochastic Gradient Descent
Note on Proof: – Unbiased gradient estimates (estimated gradient = true gradient) – Point-wise arguments!
SLIDE 28
Stochastic Gradient Descent
convex function of all points!
SLIDE 29
Stochastic Gradient Descent
convex function of all points! point-wise se u upd pdate te
SLIDE 30
Previous Work
SLIDE 31 Previous Work
- Stochastic methods for pair-wise performance
measures (Zhao et al., 11; Kar et al., 13)
– Finite buffer sampling schemes
SLIDE 32 Previous Work
- Stochastic methods for pair-wise performance
measures (Zhao et al., 11; Kar et al., 13)
– Finite buffer sampling schemes
pair ir-wise se decomp
sability
SLIDE 33 Previous Work
- Stochastic methods for pair-wise performance
measures (Zhao et al., 11; Kar et al., 13)
– Finite buffer sampling schemes
- Online learning with non-additive regret
(Rakhlin et al., 11)
– Algorithms provided not tractable; instantiation to popular losses not clear
pair ir-wise se decomp
sability
SLIDE 34
- Convex surrogates for non-decomposable measures
- Mini-batch stochastic methods
- Convergence guarantees
- Experimental results
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 35
non-decomposable performance measures … non-convex, discontinuous
convex
SLIDE 36
non-decomposable performance measures … non-convex, discontinuous … convex relaxation (J (Joachims, hims, 2005) 5)
convex
(SVMPerf Package)
SLIDE 37
F-measure
SLIDE 38
F-measure
(true ue labeling) (predicte icted d labeling) g)
SLIDE 39 F-measure
Convex Surrogate Loss (SVMPerf: Joachims, 05)
SLIDE 40 F-measure
non-dec ecomp mposab sable le
Convex Surrogate Loss (SVMPerf: Joachims, 05)
SLIDE 41 Precision@K
- No. of positive instances in the Top-K
positions of the ranked list
SLIDE 42 Precision@K
- No. of positive instances in the Top-K
positions of the ranked list
non-dec ecomp mposab sable le
Convex Surrogate Loss (SVMPerf: Joachims, 05)
SLIDE 43
(Partial) AUC
SLIDE 44
(Partial) AUC
SLIDE 45 (Partial) AUC
non-dec ecomp mposab sable le
Convex Surrogate Loss (Narasimhan & Agarwal, 13)
SLIDE 46
- Convex surrogates for non-decomposable measures
- Mini-batch stochastic methods
- Convergence guarantees
- Experimental results
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 47 x1 y1 x2 y2 x3 y3 x4 y4 … … … … … … … … … … … … … …
point-wise updates?
SLIDE 48 x1 y1 x2 y2 x3 y3 x4 y4 … … … … … … … … … … … … … …
……
SLIDE 49 x1 y1 x2 y2 x3 y3 x4 y4 … … … … … … … … … … … … … …
……
SLIDE 50
1-Pass Mini-Batch
SLIDE 51
2-Pass Mini-Batch
SLIDE 52
- Convex surrogates for non-decomposable measures
- Mini-batch stochastic methods
- Convergence guarantees
- Experimental results
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 53 …… … … … … … … …..
But first, some intuition
(‘s’ random points)
SLIDE 54 …… … … … … … … …..
But first, some intuition
… … … … … … … … … … … … … … … … … …
(‘s’ random points) (population of ‘n’ points)
SLIDE 55 how w w well l d does es t the l e loss e evalua uate ted d o
gen eneral alize e t to the e en entire e p popu pulation:
…… … … … … … … …..
But first, some intuition
… … … … … … … … … … … … … … … … … …
(‘s’ random points) (population of ‘n’ points)
?
SLIDE 56 how w w well l d does es t the l e loss e evalua uate ted d o
gen eneral alize e t to the e en entire e p popu pulation:
…… … … … … … … …..
But first, some intuition
… … … … … … … … … … … … … … … … … …
(‘s’ random points) (population of ‘n’ points)
?
SLIDE 57
Uniform Convergence
SLIDE 58
Uniform Convergence
dec ecreases ases w with m mini-batch length ‘s’
SLIDE 59
Convergence Guarantee
SLIDE 60
Convergence Guarantee
dec ecreases ases w with m mini-batch length ‘s’
SLIDE 61 Convergence Guarantee
dec ecreases ases w with m mini-batch length ‘s’ (no. o
up update tes)
SLIDE 62 Convergence Guarantee
dec ecreases ases w with m mini-batch length ‘s’ (no. o
up update tes) increases ases w with mini-batch length ‘s’
SLIDE 63
Instantiation to Specific Measures
SLIDE 64
- Convex surrogates for non-decomposable measures
- Mini-batch stochastic methods
- Convergence guarantees
- Experimental results
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 65 Experimental Results
(Partial AUC)
PPI KDD Cup 08 IJCNN Letter
SLIDE 66 Experimental Results
(Partial AUC)
PPI KDD Cup 08 IJCNN Letter
Batch ch Methods hods
SLIDE 67 Experimental Results
(Precision@K)
PPI KDD Cup 08 IJCNN Letter
Batch ch Method hod
SLIDE 68
Experimental Results
(Robustness to Epoch Lengths)
SLIDE 69
- Convex surrogates for non-decomposable measures
- Mini-batch stochastic methods
- Convergence guarantees
- Experimental results
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 70 Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
Rohit Vaish IISc, Bangalore Shivani Agarwal IISc, Bangalore
SLIDE 71 Our goal: In n practice ice: (surrog
te)
SLIDE 72 Our goal: In n practice ice: (surrog
te)
SLIDE 73 Our goal: In n practice ice: (surrog
te)
SLIDE 74 Our goal: In n practice ice: (surrog
te)
Part I was about solving this problem for non-decomposable measures with linear predictors
SLIDE 75 Our goal: In n practice ice: (surrog
te)
?
SLIDE 76 does the given learning algorithm for a performance measure converge in in the limi imit
inf nfinite te tr training data ta to the (Bayes) optimal mal pre redict ictor
for the measure?
SLIDE 77 Statistical Consistency
Data Space Model l Space
SLIDE 78 Statistical Consistency
Data Space Model l Space
SLIDE 79 Statistical Consistency
Data Space Model l Space
regr gret
SLIDE 80 Statistical Consistency
Data Space Model l Space
regr gret
regret
0 ?
P
→
SLIDE 81
Statistical Consistency
Underlying (unknown) distribution D over instances and labels
SLIDE 82
Statistical Consistency
Underlying (unknown) distribution D over instances and labels
SLIDE 83
Statistical Consistency
Underlying (unknown) distribution D over instances and labels
SLIDE 84
Statistical Consistency
Underlying (unknown) distribution D over instances and labels
SLIDE 85
Statistical Consistency
SLIDE 86 Statistical Consistency
– 0-1 classification error: Zhang, 04; Bartlett et al., 06 – Cost-weighted classification error: Scott, 12 – Balanced classification error: Narasimhan et al. , 13 – Logistic, squared, exponential losses (strictly proper losses): Reid & Williamson, 09, 10
– AUC: Clemencon et al., 08; Agarwal et al., 14
SLIDE 87 Statistical Consistency
– 0-1 classification error: Zhang, 04; Bartlett et al., 06 – Cost-weighted classification error: Scott, 12 – Balanced classification error: Narasimhan et al. , 13 – Logistic, squared, exponential losses (strictly proper losses): Reid & Williamson, 09, 10
– AUC: Clemencon et al., 08; Agarwal et al., 14
- General non-decomposable measure?
SLIDE 88
- Plug-in methods for classification measures
- Main consistency result
- Experimental results
- Proof intuition
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 89 Plug-in Method
Training Set
SLIDE 90 Plug-in Method
Training Set Class Probability Estimate
SLIDE 91 Plug-in Method
Training Set Class Probability Estimate Threshold Choice
SLIDE 92 Classification Measures
+1
+1
SLIDE 93 Classification Measures
+1
tr true ue positive ive
(TPR)
tr true ue nega gative ive
(TNR)
+1
SLIDE 94
Classification Measures
SLIDE 95
AM-measure (1 - BER)
Classification Measures
SLIDE 96
Classification Measures
G-mean
SLIDE 97
F-measure
Classification Measures
where Prec = p proportion of p points ts with y=1 | h(x) = 1
SLIDE 98
Classification Measures
non-dec ecomp mposab sable le
SLIDE 99
More formally,
Underlying (unknown) distribution D with:
SLIDE 100 More formally,
Underlying (unknown) distribution D with: proportion
SLIDE 101 More formally,
Underlying (unknown) distribution D with: proportion
Plug ug-in n Method
estim imate te: (using S1)
SLIDE 102 More formally,
Underlying (unknown) distribution D with: proportion
Plug ug-in n Method
estim imate te: thres esho hold ld: (using S2) (using S1)
SLIDE 103
- Plug-in methods for classification measures
- Main consistency results
- Experimental results
- Proof intuition
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 104
But first, some intuition
SLIDE 105
But first, some intuition
SLIDE 106
But first, some intuition
Optimal classifier for ?
SLIDE 107 But first, some intuition
0.5
Classification error
SLIDE 108 But first, some intuition
0.5
Classification error General non-decomposable measure
?
SLIDE 109
Main Consistency Result
SLIDE 110
Main Consistency Result
(w.r.t. S1)
SLIDE 111
Main Consistency Result
(w.r.t. S1)
✓
SLIDE 112
Main Consistency Result
(w.r.t. S1)
✓
?
SLIDE 113 Instantiation to Specific Measures
(Menon et al., 13) (Ye et al., 12)
SLIDE 114 Instantiation to Specific Measures
(Menon et al., 13) (Ye et al., 12)
SLIDE 115
Instantiation to Specific Measures
SLIDE 116
- Plug-in methods for classification measures
- Main consistency result
- Experimental results
- Proof intuition
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 117 Experimental Results
– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand
SLIDE 118 Experimental Results
– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand
SLIDE 119 Experimental Results
– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand
do do not conve nverge e to zero
SLIDE 120 Experimental Results
– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand
SLIDE 121
- Plug-in methods for classification measures
- Main consistency result
- Experimental results
- Proof intuition
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 122
Proof Intuition
SLIDE 123 Proof Intuition
implies ies for any ny fix ixed d ‘c’
SLIDE 124 Proof Intuition
implies ies for any ny fix ixed d ‘c’ Unif iform m conv nvergence ence gene neral raliza zation ion bo boun und d for
SLIDE 125
- Plug-in methods for classification measures
- Main consistency result
- Experimental results
- Proof intuition
Part I
Stochastic Gradient Methods for Non-decomposable Performance Measures
Part II
Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures
SLIDE 126
Questions?