[PPT] - Performance Measures: Stochastic Optimization & Statistical PowerPoint Presentation

SLIDE 1

Learning with Non-decomposable Performance Measures:

Stochastic Optimization & Statistical Consistency

Harikrishna Narasimhan

Department of Computer Science and Automation Indian Institute of Science, Bangalore

SLIDE 2

SLIDE 3

perfor

rmance

mance measu asure re?

SLIDE 4

0-1 Classification Error

SLIDE 5

0-1 Classification Error

point-wise se loss

SLIDE 6

Text Retrieval

F-measure

2 x Precision x Recall Precision + Recall

SLIDE 7

Medical Diagnosis

Area Under the ROC Curve (AUC)

False e Positive ive Rate True Positiv ive e Rate

SLIDE 8

Information Retrieval

Precision@K

No. of positive objects

in Top-K positions

SLIDE 9

http://www.tagxedo.com

SLIDE 10

Non-decomposable Performance Measures

……

SLIDE 11

Non-decomposable Performance Measures

……

cannot t be e ex expres esse sed as d as a s sum um of point-wise se errors!

SLIDE 12

Perform formance ance Measure sures

Algorit rithms hms

……

SLIDE 13

Perform formance ance Measure sures

Algorit rithms hms

……

Q1: Efficient Optimization?

SLIDE 14

Perform formance ance Measure sures

Algorit rithms hms

……

Q1: Efficient Optimization? Q2: Statistical Consistency?

SLIDE 15

Perform formance ance Measure sures

Algorit rithms hms

……

Efficient Learning Algorithms

Kar, P., Narasimhan, H. and Jain, P. “Online and Stochastic Gradient Methods for Non-decomposable Loss Functions”, NIPS 2014. To appear. Narasimhan, H. and Agarwal, S. “SVMpAUC-tight: A new support vector method for optimizing partial AUC based on a tight convex upper bound”, KDD 2013. Narasimhan, H. and Agarwal, S. “A structural SVM based approach for optimizing partial AUC”, ICML 2013.

SLIDE 16

Perform formance ance Measure sures

Algorit rithms hms

……

Statistical Consistency of Learning Algorithms

Narasimhan, H. and Agarwal, S. “On the statistical consistency

f plug-in classifiers for non-decomposable performance

measures”. NIPS 2014. To appear. Narasimhan, H. and Agarwal, S. “On the relationship between binary classification, bipartite ranking, and binary class probability estimation”. NIPS 2013. Menon, A., Narasimhan, H., Agarwal, S. and Chawla, S. “On the statistical consistency of algorithms for binary classification under class imbalance ”, ICML 2013.

SLIDE 17

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

SLIDE 18

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 19

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Purushottam Kar MSR, Bangalore Prateek Jain MSR, Bangalore

SLIDE 20

Stochastic Gradient Descent

convex (point-wise)

SLIDE 21

Stochastic Gradient Descent

convex (point-wise)

SLIDE 22

Stochastic Gradient Descent

convex (point-wise)

SLIDE 23

Stochastic Gradient Descent

convex (point-wise)

SLIDE 24

Stochastic Gradient Descent

convex (point-wise) point-wise se u upd pdate te

SLIDE 25

Stochastic Gradient Descent

SLIDE 26

Stochastic Gradient Descent

Note on Proof: – Unbiased gradient estimates (estimated gradient = true gradient)

SLIDE 27

Stochastic Gradient Descent

Note on Proof: – Unbiased gradient estimates (estimated gradient = true gradient) – Point-wise arguments!

SLIDE 28

Stochastic Gradient Descent

convex function of all points!

SLIDE 29

Stochastic Gradient Descent

convex function of all points! point-wise se u upd pdate te

SLIDE 30

Previous Work

SLIDE 31

Previous Work

Stochastic methods for pair-wise performance

measures (Zhao et al., 11; Kar et al., 13)

– Finite buffer sampling schemes

SLIDE 32

Previous Work

Stochastic methods for pair-wise performance

measures (Zhao et al., 11; Kar et al., 13)

– Finite buffer sampling schemes

pair ir-wise se decomp

mposabi

sability

SLIDE 33

Previous Work

Stochastic methods for pair-wise performance

measures (Zhao et al., 11; Kar et al., 13)

– Finite buffer sampling schemes

Online learning with non-additive regret

(Rakhlin et al., 11)

– Algorithms provided not tractable; instantiation to popular losses not clear

pair ir-wise se decomp

mposabi

sability

SLIDE 34

Convex surrogates for non-decomposable measures
Mini-batch stochastic methods
Convergence guarantees
Experimental results

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 35

non-decomposable performance measures … non-convex, discontinuous

convex

SLIDE 36

non-decomposable performance measures … non-convex, discontinuous … convex relaxation (J (Joachims, hims, 2005) 5)

convex

(SVMPerf Package)

SLIDE 37

F-measure

SLIDE 38

F-measure

(true ue labeling) (predicte icted d labeling) g)

SLIDE 39

F-measure

Convex Surrogate Loss (SVMPerf: Joachims, 05)

SLIDE 40

F-measure

non-dec ecomp mposab sable le

Convex Surrogate Loss (SVMPerf: Joachims, 05)

SLIDE 41

Precision@K

No. of positive instances in the Top-K

positions of the ranked list

SLIDE 42

Precision@K

No. of positive instances in the Top-K

positions of the ranked list

non-dec ecomp mposab sable le

Convex Surrogate Loss (SVMPerf: Joachims, 05)

SLIDE 43

(Partial) AUC

SLIDE 44

(Partial) AUC

SLIDE 45

(Partial) AUC

non-dec ecomp mposab sable le

Convex Surrogate Loss (Narasimhan & Agarwal, 13)

SLIDE 46

Convex surrogates for non-decomposable measures
Mini-batch stochastic methods
Convergence guarantees
Experimental results

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 47

x1 y1 x2 y2 x3 y3 x4 y4 … … … … … … … … … … … … … …

point-wise updates?

SLIDE 48

x1 y1 x2 y2 x3 y3 x4 y4 … … … … … … … … … … … … … …

……

SLIDE 49

x1 y1 x2 y2 x3 y3 x4 y4 … … … … … … … … … … … … … …

……

SLIDE 50

1-Pass Mini-Batch

SLIDE 51

2-Pass Mini-Batch

SLIDE 52

Convex surrogates for non-decomposable measures
Mini-batch stochastic methods
Convergence guarantees
Experimental results

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 53

…… … … … … … … …..

But first, some intuition

(‘s’ random points)

SLIDE 54

…… … … … … … … …..

But first, some intuition

… … … … … … … … … … … … … … … … … …

(‘s’ random points) (population of ‘n’ points)

SLIDE 55

how w w well l d does es t the l e loss e evalua uate ted d o

n ‘s’ random points

gen eneral alize e t to the e en entire e p popu pulation:

…… … … … … … … …..

But first, some intuition

… … … … … … … … … … … … … … … … … …

(‘s’ random points) (population of ‘n’ points)

?

SLIDE 56

how w w well l d does es t the l e loss e evalua uate ted d o

n ‘s’ random points

gen eneral alize e t to the e en entire e p popu pulation:

…… … … … … … … …..

But first, some intuition

… … … … … … … … … … … … … … … … … …

(‘s’ random points) (population of ‘n’ points)

?

SLIDE 57

Uniform Convergence

SLIDE 58

Uniform Convergence

dec ecreases ases w with m mini-batch length ‘s’

SLIDE 59

Convergence Guarantee

SLIDE 60

Convergence Guarantee

dec ecreases ases w with m mini-batch length ‘s’

SLIDE 61

Convergence Guarantee

dec ecreases ases w with m mini-batch length ‘s’ (no. o

f

up update tes)

SLIDE 62

Convergence Guarantee

dec ecreases ases w with m mini-batch length ‘s’ (no. o

f

up update tes) increases ases w with mini-batch length ‘s’

SLIDE 63

Instantiation to Specific Measures

SLIDE 64

Convex surrogates for non-decomposable measures
Mini-batch stochastic methods
Convergence guarantees
Experimental results

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 65

Experimental Results

(Partial AUC)

PPI KDD Cup 08 IJCNN Letter

SLIDE 66

Experimental Results

(Partial AUC)

PPI KDD Cup 08 IJCNN Letter

Batch ch Methods hods

SLIDE 67

Experimental Results

(Precision@K)

PPI KDD Cup 08 IJCNN Letter

Batch ch Method hod

SLIDE 68

Experimental Results

(Robustness to Epoch Lengths)

SLIDE 69

Convex surrogates for non-decomposable measures
Mini-batch stochastic methods
Convergence guarantees
Experimental results

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 70

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

Rohit Vaish IISc, Bangalore Shivani Agarwal IISc, Bangalore

SLIDE 71

Our goal: In n practice ice: (surrog

gate)

te)

SLIDE 72

Our goal: In n practice ice: (surrog

gate)

te)

SLIDE 73

Our goal: In n practice ice: (surrog

gate)

te)

SLIDE 74

Our goal: In n practice ice: (surrog

gate)

te)

Part I was about solving this problem for non-decomposable measures with linear predictors

SLIDE 75

Our goal: In n practice ice: (surrog

gate)

te)

?

SLIDE 76

does the given learning algorithm for a performance measure converge in in the limi imit

f
f

inf nfinite te tr training data ta to the (Bayes) optimal mal pre redict ictor

r

for the measure?

SLIDE 77

Statistical Consistency

Data Space Model l Space

SLIDE 78

Statistical Consistency

Data Space Model l Space

SLIDE 79

Statistical Consistency

Data Space Model l Space

regr gret

SLIDE 80

Statistical Consistency

Data Space Model l Space

regr gret

regret

0 ?

P

→

SLIDE 81

Statistical Consistency

Underlying (unknown) distribution D over instances and labels

SLIDE 82

Statistical Consistency

Underlying (unknown) distribution D over instances and labels

SLIDE 83

Statistical Consistency

Underlying (unknown) distribution D over instances and labels

SLIDE 84

Statistical Consistency

Underlying (unknown) distribution D over instances and labels

SLIDE 85

Statistical Consistency

SLIDE 86

Statistical Consistency

Decomposable measures

– 0-1 classification error: Zhang, 04; Bartlett et al., 06 – Cost-weighted classification error: Scott, 12 – Balanced classification error: Narasimhan et al. , 13 – Logistic, squared, exponential losses (strictly proper losses): Reid & Williamson, 09, 10

Pair-wise measures

– AUC: Clemencon et al., 08; Agarwal et al., 14

SLIDE 87

Statistical Consistency

Decomposable measures

– 0-1 classification error: Zhang, 04; Bartlett et al., 06 – Cost-weighted classification error: Scott, 12 – Balanced classification error: Narasimhan et al. , 13 – Logistic, squared, exponential losses (strictly proper losses): Reid & Williamson, 09, 10

Pair-wise measures

– AUC: Clemencon et al., 08; Agarwal et al., 14

General non-decomposable measure?

SLIDE 88

Plug-in methods for classification measures
Main consistency result
Experimental results
Proof intuition

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 89

Plug-in Method

Training Set

SLIDE 90

Plug-in Method

Training Set Class Probability Estimate

SLIDE 91

Plug-in Method

Training Set Class Probability Estimate Threshold Choice

SLIDE 92

Classification Measures

+1

1

+1

1

SLIDE 93

Classification Measures

+1

1

tr true ue positive ive

(TPR)

tr true ue nega gative ive

(TNR)

+1

1

SLIDE 94

Classification Measures

SLIDE 95

AM-measure (1 - BER)

Classification Measures

SLIDE 96

Classification Measures

G-mean

SLIDE 97

F-measure

Classification Measures

where Prec = p proportion of p points ts with y=1 | h(x) = 1

SLIDE 98

Classification Measures

non-dec ecomp mposab sable le

SLIDE 99

More formally,

Underlying (unknown) distribution D with:

SLIDE 100

More formally,

Underlying (unknown) distribution D with: proportion

f positives

SLIDE 101

More formally,

Underlying (unknown) distribution D with: proportion

f positives

Plug ug-in n Method

d : (S1, S2) ~ Dn

estim imate te: (using S1)

SLIDE 102

More formally,

Underlying (unknown) distribution D with: proportion

f positives

Plug ug-in n Method

d : (S1, S2) ~ Dn

estim imate te: thres esho hold ld: (using S2) (using S1)

SLIDE 103

Plug-in methods for classification measures
Main consistency results
Experimental results
Proof intuition

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 104

But first, some intuition

SLIDE 105

But first, some intuition

SLIDE 106

But first, some intuition

Optimal classifier for ?

SLIDE 107

But first, some intuition

0.5

Classification error

SLIDE 108

But first, some intuition

0.5

Classification error General non-decomposable measure

?

SLIDE 109

Main Consistency Result

SLIDE 110

Main Consistency Result

(w.r.t. S1)

SLIDE 111

Main Consistency Result

(w.r.t. S1)

✓

SLIDE 112

Main Consistency Result

(w.r.t. S1)

✓

?

SLIDE 113

Instantiation to Specific Measures

(Menon et al., 13) (Ye et al., 12)

SLIDE 114

Instantiation to Specific Measures

(Menon et al., 13) (Ye et al., 12)

SLIDE 115

Instantiation to Specific Measures

SLIDE 116

Plug-in methods for classification measures
Main consistency result
Experimental results
Proof intuition

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 117

Experimental Results

Synthetic data:

– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand

SLIDE 118

Experimental Results

Synthetic data:

– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand

SLIDE 119

Experimental Results

Synthetic data:

– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand

do do not conve nverge e to zero

regret

SLIDE 120

Experimental Results

Synthetic data:

– Gaussian class conditionals, equal covariance, p = 0.1 – Optimal classifier can be computed by hand

SLIDE 121

Plug-in methods for classification measures
Main consistency result
Experimental results
Proof intuition

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 122

Proof Intuition

SLIDE 123

Proof Intuition

im

implies ies for any ny fix ixed d ‘c’

SLIDE 124

Proof Intuition

im

implies ies for any ny fix ixed d ‘c’ Unif iform m conv nvergence ence gene neral raliza zation ion bo boun und d for

SLIDE 125

Plug-in methods for classification measures
Main consistency result
Experimental results
Proof intuition

Part I

Stochastic Gradient Methods for Non-decomposable Performance Measures

Part II

Statistical Consistency of Plug-in Methods for Non-decomposable Performance Measures

SLIDE 126