Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - - PowerPoint PPT Presentation

โ–ถ
calibrated surrogate maximization of linear fractional
SMART_READER_LITE
LIVE PREVIEW

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - - PowerPoint PPT Presentation

Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP) accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis)


slide-1
SLIDE 1

Calibrated Surrogate Maximization

  • f Linear-fractional Utility

07th Feb.

Han Bao (The University of Tokyo / RIKEN AIP)

slide-2
SLIDE 2

accuracy: 0.8

Is accuracy appropriate?

2 8 5 5

accuracy: 0.8

May cause severe issues! (e.g. in medical diagnosis)

positive negative

2

โ–  Our focus: binary classification

slide-3
SLIDE 3

accuracy: 0.8

Is accuracy appropriate?

2 8 5 5

accuracy: 0.8

positive negative

F-measure: 0.75 F-measure: 0 ๐–ด๐–ฐ = ๐”ฝX,Y=+1[1{f(X)>0}] ๐–ด๐–ฎ = ๐”ฝX,Y=โˆ’1[1{f(X)<0}] ๐–ฆ๐–ฐ = ๐”ฝX,Y=โˆ’1[1{f(X)>0}] ๐–ฆ๐–ฎ = ๐”ฝX,Y=+1[1{f(X)<0}]

๐–ฆ๐Ÿค = 2๐–ด๐–ฐ 2๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

F-measure

3

slide-4
SLIDE 4

Training and Evaluation

โ–  Usual empirical risk minimization (ERM)

minimizing 0/1-error

1

training evaluation

๐–ก๐–ฝ๐–ฝ = ๐–ด๐–ฐ + ๐–ด๐–ฎ = 1 โˆ’ (0/1-risk)

compatible

minimizing 0/1-error

1

training evaluation

๐–ฆ๐Ÿค = 2๐–ด๐–ฐ 2๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

incompatible

โ–  Training with accuracy but evaluate with F1

training ๏ผŸ๏ผŸ๏ผŸ evaluation

๐–ฆ๐Ÿค = 2๐–ด๐–ฐ 2๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

compatible

Direct Optimization

โ–  Why not?

4

slide-5
SLIDE 5

๐–ก๐–ฝ๐–ฝ = ๐–ด๐–ฐ + ๐–ด๐–ฎ

Accuracy

๐–ฆ๐Ÿค = 2๐–ด๐–ฐ 2๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

F-measure

๐–ช๐–ป๐–ฝ = ๐–ด๐–ฐ ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

Jaccard index

๐–ท๐–ก๐–ฝ๐–ฝ = w1๐–ด๐–ฐ + w2๐–ด๐–ฎ w1๐–ด๐–ฐ + w2๐–ด๐–ฎ + w3๐–ฆ๐–ฐ + w4๐–ฆ๐–ฎ

Weighted Accuracy

๐–ข๐–ฅ๐–ฒ = 1 ฯ€ ๐–ฆ๐–ฎ + 1 1 โˆ’ ฯ€ ๐–ฆ๐–ฐ

Balanced Error Rate

๐–ง๐–ฌ๐–ฉ = ๐–ด๐–ฐ + ๐–ด๐–ฎ ๐–ด๐–ฐ + ฮฑ(๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ) + ๐–ด๐–ฎ

Gower-Legendre index

๐–ฆ๐–ญ๐–ฉ = ๐–ด๐–ฐ ฯ€ 1 ๐–ด๐–ฐ + ๐–ฆ๐–ฐ

Fowlkes-Mallows index

๐–ญ๐–ฃ๐–ฃ = ๐–ด๐–ฐ โ‹… ๐–ด๐–ฎ โˆ’ ๐–ฆ๐–ฐ โ‹… ๐–ฆ๐–ฎ ฯ€(1 โˆ’ ฯ€)(๐–ด๐–ฐ + ๐–ฆ๐–ฐ)(๐–ด๐–ฎ + ๐–ฆ๐–ฎ)

Matthews Correlation Coefficient

Wanna Unify!!

slide-6
SLIDE 6

๐–ฆ๐Ÿค = 2๐–ด๐–ฐ 2๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ ๐–ช๐–ป๐–ฝ = ๐–ด๐–ฐ ๐–ด๐–ฐ + ๐–ฆ๐–ฐ + ๐–ฆ๐–ฎ

Actual Metrics

U(f ) = a0๐–ด๐–ฐ + b0๐–ฆ๐–ฐ + c0 a1๐–ด๐–ฐ + b1๐–ฆ๐–ฐ + c1

linear-fraction

๐–ด๐–ฎ = โ„™(Y = โˆ’ 1) โˆ’ ๐–ฆ๐–ฐ ๐–ฆ๐–ฎ = โ„™(Y = + 1) โˆ’ ๐–ด๐–ฐ

Note:

ak, bk, ck : constants

Unification of Metrics

6

slide-7
SLIDE 7

Unification of Metrics

โ–  TP, FP = expectation of 0/1-loss โ–ถ e.g.

7

U( f ) = a0๐–ด๐–ฐ + b0๐–ฆ๐–ฐ + c0 a1๐–ด๐–ฐ + b1๐–ฆ๐–ฐ + c1

linear-fraction a0๐”ฝP

. . . . . . . . . . . . . 1

+ b0๐”ฝN . . . . . . . . . . + c0 a1๐”ฝP

. . . . . . . . . . . . . 1

+ b1๐”ฝN . . . . . . . . . . + c1

= :=

๐”ฝX[W0(f(X))] ๐”ฝX[W1(f(X))]

๐–ด๐–ฐ = โ„™(Y = + 1,f(X) > 0) = ๐”ฝX,Y=+1[1{f(X)>0}]

slide-8
SLIDE 8

Goal of This Talk

8

Given a metric

  • Q. How to optimize U( f ) directly?

โ–ถ without estimating class-posterior probability

(utility)

f : ๐’ด โ†’ โ„ classifier s.t. U( f ) = sup

fโ€ฒ

U( fโ€ฒ)

{(xi, yi)}n

i=1

i.i.d. โˆผ โ„™

U labeled sample metric

U( f ) = a0๐–ด๐–ฐ + b0๐–ฆ๐–ฐ + c0 a1๐–ด๐–ฐ + b1๐–ฆ๐–ฐ + c1

slide-9
SLIDE 9

Outline

โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

9

slide-10
SLIDE 10

Formulation of Classification

โ–  Goal of classification: maximize accuracyโ€จ

= minimize mis-classification rateโ€จ

ฬ‚ R(f ) = 1 n

n

โˆ‘

i=1

1[yi โ‰  sign(f(xi))] = 1 n

n

โˆ‘

i=1

โ„“(yi f(xi))

10

ฬ‚ Rฯ•(f ) = 1 n

n

โˆ‘

i=1

ฯ•(yi f(xi))

โˆ’1 1 m 1 (m) 0/1 (`) Logistic Hinge

classified correctly classified incorrectly

m = yi f(xi)

make 0/1 loss smoother

(Empirical) Surrogate Risk

Example of

โ–ถ logistic loss โ–ถ hinge loss โ‡’ SVM โ–ถ exponential loss โ‡’ AdaBoost

ฯ•

convex in ! f

slide-11
SLIDE 11

3 Actors in Risk Minimization

โ–  Minimize classification risk (= 1 - Accuracy)โ€จ

โ€จ โ€จ

โ–  Surrogate loss makes tractableโ€จ

โ€จ โ€จ

โ–  Sample approximation (M-estimation)โ€จ

โ€จ โ€จ โ€จ

11

R(f ) = ๐”ฝ[ โ„“( Yf(X) ) ]

0/1-loss prediction margin

0/1-loss represents if X is correctly classified by f

Rฯ•(f ) = ๐”ฝ[ ฯ• (Yf(X))]

(surrogate risk) surrogate loss

differentiable upper bound of 0/1-loss

ฬ‚ Rฯ•(f ) = 1 n

n

โˆ‘

i=1

ฯ•(yi f(xi))

(empirical (surrogate) risk)

what we actually minimize

โˆ’1 1 m 1 (m) 0/1 (`) Logistic Hinge

classified correctly classified incorrectly

m = yi f(xi)

slide-12
SLIDE 12

Convexity & Statistical Property

12

R(f ) = ๐”ฝ[โ„“(Yf(X))] Rฯ•(f ) = ๐”ฝ[ฯ•(Yf(X))] ฬ‚ Rฯ•(f ) = 1 n

n

โˆ‘

i=1

ฯ•(yi f(xi))

generalize intractable tractable (convex)

  • Q. argmin

= argmin ? Rฯ• R

  • A. Yes, w/ calibrated surrogate

Theorem. Assume : convex. Then, iff .

(informal)

ฯ• argminf Rฯ•( f ) = argminf R( f ) ฯ•โ€ฒ(0) < 0

[Bartlett+ 2006]

  • P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006).

Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.

slide-13
SLIDE 13

Related Work: Plug-in Rule

โ–  Classifier based on class-posterior probability

13

Bayes-optimal classifier (accuracy): โ„™(Y = + 1|x)โˆ’ 1

2

โ„™(Y = + 1 โˆฃ X) 1

1 2

Y = +1 Y = -1 โ„™(Y = + 1 โˆฃ X) 1 ฮด* Y = +1 Y = -1 Bayes-optimal classifier (general case): โ„™(Y = + 1|x) โˆ’ ฮด* โ‡’ estimate P(Y=+1|x) and ฮด independently

โ„™(Y = + 1|x) ฮด*

  • O. O. Koyejo, N. Natarajan, P. K. Ravikumar, & I. S. Dhillon.

Consistent binary classification with generalized performance metrics. In NIPS, 2014.

  • B. Yan, O. Koyejo, K. Zhong, & P. Ravikumar.

Binary classification with Karmic, threshold-quasi-concave metrics. In ICML, 2018.

[Koyejo+ NIPS2014; Yan+ ICML2018]

slide-14
SLIDE 14

Outline

โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

14

slide-15
SLIDE 15

Convexity & Statistical Property

15

R(f ) = ๐”ฝ[โ„“(Yf(X))] Rฯ•(f ) = ๐”ฝ[ฯ•(Yf(X))] ฬ‚ Rฯ•(f ) = 1 n

n

โˆ‘

i=1

ฯ•(yi f(xi))

generalize calibration intractable tractable (convex)

U( f ) = ๐”ฝX[W0( f(X))] ๐”ฝX[W1( f(X))]

intractable calibration

  • Q. tractable & calibrated
  • bjective?

argmin = argmin Rฯ• R

โ‘  โ‘ก

slide-16
SLIDE 16

Non-concave, but quasi-concave

16

Idea: concave / convex = quasi-concave is quasi-concave if : concave, : convex, and for f(x) g(x) f g f(x) โ‰ฅ 0 g(x) > 0 โˆ€x

(proof) Show is convex. NB: super-level set of concave func. is convex is convex for {x| f/g โ‰ฅ ฮฑ} f(x) g(x) โ‰ฅ ฮฑ โŸบ f(x) โˆ’ ฮฑg(x) โ‰ฅ 0 โˆด {x| f/g โ‰ฅ ฮฑ} โˆ€ฮฑ โ‰ฅ 0

concave

non-concave, but unimodal โ‡’ efficiently optimized

โ–  quasi-concave concave โ–  super-levels are convex

โЇ

slide-17
SLIDE 17

Surrogate Utility

โ–  Idea: bound true utility from below

17

U( f ) = a0๐–ด๐–ฐ + b0๐–ฆ๐–ฐ + c0 a1๐–ด๐–ฐ + b1๐–ฆ๐–ฐ + c1

linear-fraction a0๐”ฝP

. . . . . . . . . . . . . 1

+ b0๐”ฝN . . . . . . . . . . + c0 a1๐”ฝP

. . . . . . . . . . . . . 1

+ b1๐”ฝN . . . . . . . . . . + c1

=

a0๐”ฝP

. . . . . . . . . . . . . 1

+ b0๐”ฝN . . . . . . . . . . + c0 a1๐”ฝP

. . . . . . . . . . . . . 1

+ b1๐”ฝN . . . . . . . . . . + c1

O O O O

โ‰ฅ

numerator from below denominator from above

non-negative sum of concave โ‡’ concave non-negative sum of convex โ‡’ convex

slide-18
SLIDE 18

Surrogate Utility

โ–  Idea: bound true utility from below

18

U( f ) = a0๐–ด๐–ฐ + b0๐–ฆ๐–ฐ + c0 a1๐–ด๐–ฐ + b1๐–ฆ๐–ฐ + c1

linear-fraction a0๐”ฝP

. . . . . . . . . . . . . 1

+ b0๐”ฝN . . . . . . . . . . + c0 a1๐”ฝP

. . . . . . . . . . . . . 1

+ b1๐”ฝN . . . . . . . . . . + c1

O O O O

โ‰ฅ

Uฯ•( f ) = a0๐”ฝP[1 โˆ’ ฯ•( f(X))] + b0๐”ฝN[โˆ’ฯ•(โˆ’f(X))] + c0 a1๐”ฝP[1 + ฯ•( f(X))] + b1๐”ฝN[ ฯ•(โˆ’f(X)) ] + c1

: Surrogate Utility

=

O ฯ†(m)

surrogate loss

:= ๐”ฝ[W0,ฯ•] ๐”ฝ[W1,ฯ•]

slide-19
SLIDE 19

Hybrid Optimization Strategy

โ–  Note: numerator can be negative โ–ถ

isnโ€™t quasi-concave if numerator < 0

โ–ถ maximize numerator first (concave), then

maximize fractional form (quasi-concave) Uฯ•

19

a0๐”ฝP

. . . . . . . . . . . . . 1

+ b0๐”ฝN . . . . . . . . . . + c0 a1๐”ฝP

. . . . . . . . . . . . . 1

+ b1๐”ฝN . . . . . . . . . . + c1

O O O O

Uฯ•( f ) = =

slide-20
SLIDE 20

Hybrid Optimization Strategy

20

maximize numerator maximize fraction

normalized gradient for quasi-concave optimization

[Hazan+ NeurIPS2015]

Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems (pp. 1594-1602).

slide-21
SLIDE 21

Outline

โ–  Introduction โ–  Preliminary โ–ถ Convex Risk Minimization โ–ถ Plug-in Principle vs. Cost-sensitive Learning โ–  Key Idea โ–ถ Quasi-concave Surrogate โ–  Calibration Analysis & Experiments

21

slide-22
SLIDE 22

Justify Surrogate Optimization

โ–  For classification riskโ€จ

โ€จ โ€จ โ€จ

โ–  For fractional utilityโ€จ

โ€จ โ€จ โ€จ โ€จ โ€จ

22

R(f ) = ๐”ฝ[โ„“(Yf(X))] Rฯ•(f ) = ๐”ฝ[ฯ•(Yf(X))]

surrogate risk classification risk

Rฯ•( fn)

nโ†’โˆž

โ†’ 0 โŸน R( fn)

nโ†’โˆž

โ†’ 0 โˆ€{fn} If is classification-calibrated loss,

[Bartlett+ 2006]

ฯ•

Note: informal

U( f ) = ๐”ฝX[W0( f(X))] ๐”ฝX[W1( f(X))]

Uฯ•( f ) = ๐”ฝX[W0,ฯ•( f(X))] ๐”ฝX[W1,ฯ•( f(X))]

surrogate utility true utility

Uฯ•( fn)

nโ†’โˆž

โ†’ 1 โŸน U( fn)

nโ†’โˆž

โ†’ 1 โˆ€{fn} What kind of conditions are needed for to satisfy ฯ• ? Q.

  • P. L. Bartlett, M. I. Jordan, & J. D. McAuliffe. (2006).

Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.

slide-23
SLIDE 23

Special Case: F1-measure

23

if satisfies ฯ•

Note: informal

Uฯ•( fn)

nโ†’โˆž

โ†’ 1 โŸน U( fn)

nโ†’โˆž

โ†’ 1 โˆ€{fn}

โ–ถ โ–ถ โ–ถ

โˆƒc โˆˆ (0,1) s.t. supf Uฯ•( f ) โ‰ฅ

2c 1 โˆ’ c , limmโ†’+0 ฯ•โ€ฒ(m) โ‰ฅ c limmโ†’โˆ’0 ฯ•โ€ฒ(m)

ฯ• is non-increasing Theorem ฯ• is convex

โ–  Example

ฯ•โˆ’1(m) = log(1 + eโˆ’m)

ฯ•+1(m) = log(1 + eโˆ’cm)

lim

mโ†’+0 ฯ•โ€ฒ(m) = โˆ’ c 2 ,

lim

mโ†’โˆ’0 ฯ•โ€ฒ(m) = โˆ’ 1 2

O ฯ•(m) m non-differentiable at m=0

merely sufficient!

slide-24
SLIDE 24

Experiment: F1-measure

24

(F1-measure is shown)

ฯ•(m) = max{log(1 + eโˆ’m), log(1 + e

โˆ’ m 3 )}

surrogate loss: model: linear-in-parameter

slide-25
SLIDE 25

Experiment: Jaccard index

25

(Jaccard index is shown)

ฯ•(m) = max{log(1 + eโˆ’m), log(1 + e

โˆ’ 3m 4 )}

surrogate loss: model: linear-in-parameter

slide-26
SLIDE 26

โ– Goal โ– Tractable Optimization โ– Calibrated Surrogate โ– Open Problems

โ–ถ necessary and sufficientโ€จ

condition of calibration

โ–ถ explicit convergence rate โ–ถ theoretical comparisonโ€จ

with probability estimation

U( f ) = a0๐–ด๐–ฐ + b0๐–ฆ๐–ฐ + c0 a1๐–ด๐–ฐ + b1๐–ฆ๐–ฐ + c1

maximize linear-fractional utility

a0๐”ฝP

. . . . . . . . . . . . . 1

+ b0๐”ฝN . . . . . . . . . . + c0 a1๐”ฝP

. . . . . . . . . . . . . 1

+ b1๐”ฝN . . . . . . . . . . + c1

O O O O

If loss is like then

argmaxf Uฯ•( f ) = argmaxf U( f )

surrogate utility quasi-concave optimization

O ฯ•(m) m

โ‘กdecreasing โ‘ grad discrepancy โ‘ขconvex

=

concave convex quasi-concave