Calibrated Surrogate Maximization
- f Linear-fractional Utility
Calibrated Surrogate Maximization of Linear-fractional Utility 07 th - - PowerPoint PPT Presentation
Calibrated Surrogate Maximization of Linear-fractional Utility 07 th Feb. Han Bao (The University of Tokyo / RIKEN AIP) accuracy: 0.8 Is accuracy appropriate? 2 8 5 5 accuracy: 0.8 May cause severe issues! (e.g. in medical diagnosis)
2 8 5 5
positive negative
โ Our focus: binary classification
2 8 5 5
positive negative
F-measure
โ Usual empirical risk minimization (ERM)
minimizing 0/1-error
1
training evaluation
๐ก๐ฝ๐ฝ = ๐ด๐ฐ + ๐ด๐ฎ = 1 โ (0/1-risk)
compatible
minimizing 0/1-error
1
training evaluation
๐ฆ๐ค = 2๐ด๐ฐ 2๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ
incompatible
โ Training with accuracy but evaluate with F1
training ๏ผ๏ผ๏ผ evaluation
๐ฆ๐ค = 2๐ด๐ฐ 2๐ด๐ฐ + ๐ฆ๐ฐ + ๐ฆ๐ฎ
compatible
Direct Optimization
โ Why not?
Accuracy
F-measure
Jaccard index
Weighted Accuracy
Balanced Error Rate
Gower-Legendre index
Fowlkes-Mallows index
Matthews Correlation Coefficient
๐ด๐ฎ = โ(Y = โ 1) โ ๐ฆ๐ฐ ๐ฆ๐ฎ = โ(Y = + 1) โ ๐ด๐ฐ
Note:
โ TP, FP = expectation of 0/1-loss โถ e.g.
linear-fraction a0๐ฝP
. . . . . . . . . . . . . 1
+ b0๐ฝN . . . . . . . . . . + c0 a1๐ฝP
. . . . . . . . . . . . . 1
+ b1๐ฝN . . . . . . . . . . + c1
Given a metric
โถ without estimating class-posterior probability
(utility)
f : ๐ด โ โ classifier s.t. U( f ) = sup
fโฒ
U( fโฒ)
i=1
U labeled sample metric
โ Introduction โ Preliminary โถ Convex Risk Minimization โถ Plug-in Principle vs. Cost-sensitive Learning โ Key Idea โถ Quasi-concave Surrogate โ Calibration Analysis & Experiments
โ Goal of classification: maximize accuracyโจ
n
i=1
n
i=1
n
i=1
โ1 1 m 1 (m) 0/1 (`) Logistic Hinge
classified correctly classified incorrectly
m = yi f(xi)
make 0/1 loss smoother
(Empirical) Surrogate Risk
Example of
โถ logistic loss โถ hinge loss โ SVM โถ exponential loss โ AdaBoost
ฯ
convex in ! f
โ Minimize classification risk (= 1 - Accuracy)โจ
โจ โจ
โ Surrogate loss makes tractableโจ
โจ โจ
โ Sample approximation (M-estimation)โจ
โจ โจ โจ
0/1-loss prediction margin
0/1-loss represents if X is correctly classified by f
(surrogate risk) surrogate loss
differentiable upper bound of 0/1-loss
n
i=1
(empirical (surrogate) risk)
what we actually minimize
โ1 1 m 1 (m) 0/1 (`) Logistic Hinge
classified correctly classified incorrectly
m = yi f(xi)
n
i=1
generalize intractable tractable (convex)
= argmin ? Rฯ R
Theorem. Assume : convex. Then, iff .
(informal)
ฯ argminf Rฯ( f ) = argminf R( f ) ฯโฒ(0) < 0
[Bartlett+ 2006]
Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.
โ Classifier based on class-posterior probability
Bayes-optimal classifier (accuracy): โ(Y = + 1|x)โ 1
2
โ(Y = + 1 โฃ X) 1
1 2
Y = +1 Y = -1 โ(Y = + 1 โฃ X) 1 ฮด* Y = +1 Y = -1 Bayes-optimal classifier (general case): โ(Y = + 1|x) โ ฮด* โ estimate P(Y=+1|x) and ฮด independently
Consistent binary classification with generalized performance metrics. In NIPS, 2014.
Binary classification with Karmic, threshold-quasi-concave metrics. In ICML, 2018.
[Koyejo+ NIPS2014; Yan+ ICML2018]
โ Introduction โ Preliminary โถ Convex Risk Minimization โถ Plug-in Principle vs. Cost-sensitive Learning โ Key Idea โถ Quasi-concave Surrogate โ Calibration Analysis & Experiments
n
i=1
generalize calibration intractable tractable (convex)
intractable calibration
argmin = argmin Rฯ R
โ โก
Idea: concave / convex = quasi-concave is quasi-concave if : concave, : convex, and for f(x) g(x) f g f(x) โฅ 0 g(x) > 0 โx
(proof) Show is convex. NB: super-level set of concave func. is convex is convex for {x| f/g โฅ ฮฑ} f(x) g(x) โฅ ฮฑ โบ f(x) โ ฮฑg(x) โฅ 0 โด {x| f/g โฅ ฮฑ} โฮฑ โฅ 0
concave
non-concave, but unimodal โ efficiently optimized
โ quasi-concave concave โ super-levels are convex
โ
โ Idea: bound true utility from below
U( f ) = a0๐ด๐ฐ + b0๐ฆ๐ฐ + c0 a1๐ด๐ฐ + b1๐ฆ๐ฐ + c1
linear-fraction a0๐ฝP
. . . . . . . . . . . . . 1
+ b0๐ฝN . . . . . . . . . . + c0 a1๐ฝP
. . . . . . . . . . . . . 1
+ b1๐ฝN . . . . . . . . . . + c1
a0๐ฝP
. . . . . . . . . . . . . 1
+ b0๐ฝN . . . . . . . . . . + c0 a1๐ฝP
. . . . . . . . . . . . . 1
+ b1๐ฝN . . . . . . . . . . + c1
O O O O
numerator from below denominator from above
non-negative sum of concave โ concave non-negative sum of convex โ convex
โ Idea: bound true utility from below
U( f ) = a0๐ด๐ฐ + b0๐ฆ๐ฐ + c0 a1๐ด๐ฐ + b1๐ฆ๐ฐ + c1
linear-fraction a0๐ฝP
. . . . . . . . . . . . . 1
+ b0๐ฝN . . . . . . . . . . + c0 a1๐ฝP
. . . . . . . . . . . . . 1
+ b1๐ฝN . . . . . . . . . . + c1
O O O O
O ฯ(m)
surrogate loss
โ Note: numerator can be negative โถ
โถ maximize numerator first (concave), then
a0๐ฝP
. . . . . . . . . . . . . 1
+ b0๐ฝN . . . . . . . . . . + c0 a1๐ฝP
. . . . . . . . . . . . . 1
+ b1๐ฝN . . . . . . . . . . + c1
O O O O
maximize numerator maximize fraction
normalized gradient for quasi-concave optimization
[Hazan+ NeurIPS2015]
Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems (pp. 1594-1602).
โ Introduction โ Preliminary โถ Convex Risk Minimization โถ Plug-in Principle vs. Cost-sensitive Learning โ Key Idea โถ Quasi-concave Surrogate โ Calibration Analysis & Experiments
โ For classification riskโจ
โ For fractional utilityโจ
surrogate risk classification risk
Rฯ( fn)
nโโ
โ 0 โน R( fn)
nโโ
โ 0 โ{fn} If is classification-calibrated loss,
[Bartlett+ 2006]
ฯ
Note: informal
U( f ) = ๐ฝX[W0( f(X))] ๐ฝX[W1( f(X))]
Uฯ( f ) = ๐ฝX[W0,ฯ( f(X))] ๐ฝX[W1,ฯ( f(X))]
surrogate utility true utility
Uฯ( fn)
nโโ
โ 1 โน U( fn)
nโโ
โ 1 โ{fn} What kind of conditions are needed for to satisfy ฯ ? Q.
Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138-156.
if satisfies ฯ
Note: informal
Uฯ( fn)
nโโ
โ 1 โน U( fn)
nโโ
โ 1 โ{fn}
โถ โถ โถ
โc โ (0,1) s.t. supf Uฯ( f ) โฅ
2c 1 โ c , limmโ+0 ฯโฒ(m) โฅ c limmโโ0 ฯโฒ(m)
ฯ is non-increasing Theorem ฯ is convex
โ Example
ฯโ1(m) = log(1 + eโm)
ฯ+1(m) = log(1 + eโcm)
lim
mโ+0 ฯโฒ(m) = โ c 2 ,
lim
mโโ0 ฯโฒ(m) = โ 1 2
O ฯ(m) m non-differentiable at m=0
(F1-measure is shown)
โ m 3 )}
surrogate loss: model: linear-in-parameter
(Jaccard index is shown)
โ 3m 4 )}
surrogate loss: model: linear-in-parameter
โ Goal โ Tractable Optimization โ Calibrated Surrogate โ Open Problems
โถ necessary and sufficientโจ
condition of calibration
โถ explicit convergence rate โถ theoretical comparisonโจ
with probability estimation
maximize linear-fractional utility
a0๐ฝP
. . . . . . . . . . . . . 1
+ b0๐ฝN . . . . . . . . . . + c0 a1๐ฝP
. . . . . . . . . . . . . 1
+ b1๐ฝN . . . . . . . . . . + c1
O O O O
If loss is like then
argmaxf Uฯ( f ) = argmaxf U( f )
surrogate utility quasi-concave optimization
O ฯ(m) m
โกdecreasing โ grad discrepancy โขconvex
concave convex quasi-concave