Calibration of Convex Surrogate Losses via Property Elicitation - PowerPoint PPT Presentation

Calibration of Convex Surrogate Losses via Property Elicitation Jessie Finocchiaro October 10, 2019

Introduction Machine Learning Predictions about Some property of data future events

Outline • Background • Surrogates • Calibration • Properties • Necessary and Sufficient Conditions • Case Study: Abstain • Dimensionality • Conclusion

Empirical Risk Minimization (ERM) Assumption: data comes i.i.d. from P over 𝒴, 𝒵 . Goal: minimize risk. 𝑆𝑗𝑡𝑙 ℎ = ∫ 𝑀 ℎ 𝑦 , 𝑧 𝑒𝑄(𝑦, 𝑧) Problem: don’t know P. Settle for minimizing empirical risk. 𝑄(𝑌, 𝑍) Take 𝑛 inputs 𝑦 𝑗 ∈ 𝒴, 𝑧 𝑗 ∈ 𝒵 . Loss 𝑛 𝑍 𝐹𝑆𝑗𝑡𝑙 ℎ = 1 𝑛 ෍ 𝑀(ℎ 𝑦 𝑗 , 𝑧 𝑗 ) Against outcome 𝑧 𝑗 𝑌 𝑗=1 Prediction given feature 𝑦 𝑗 Average

Conditional Probability Pr[𝑍|𝑌 = 3] Form hypothesis ℎ: 𝒴 → ℛ . Want to learn ℎ(𝑌) minimizing 𝔽 𝑌,𝑍 ~𝑄 𝑀(ℎ 𝑌 , 𝑍) . Pr[𝑍] If ℎ 𝑦 minimizes 𝑀 𝑠, 𝑍 over 𝑠 for all 𝑦 ∈ 𝒴 , obviously true. Abstract ℎ 𝑦 to 𝑠 , look at distribution on simplex, Δ 𝒵 . 𝑍

Setting Classification-like problems Multiclass classification (with reject option) Ranking Top-k classification Notation Finite outcomes 𝒵 ≔ 𝑜 Report set ℛ ≔ 𝑙 Probability Distribution over 𝒵: p ∈ Δ 𝒵 𝑞 𝑧 = Pr[𝑍 = 𝑧]

Surrogates Surrogates for 0-1 loss, 𝒵 = {−1,1} Loss functions measure error. Created with a task in mind. Often discrete. L(r,1) Discrete losses hard to optimize. Surrogates should approximate the original loss well. Report r

Calibration A calibrated loss 𝑀 is “good approximation” of discrete loss ℓ(𝑠, 𝑧) . Let ℓ be a discrete loss. A surrogate loss function 𝑀: ℝ 𝑒 × 𝒵 → ℝ + is said to be ℓ -calibrated if there exists a link function 𝜔: ℝ 𝑒 → ℛ such that . Inf over reports not linked to the argmin of the discrete loss.

Consistency Let 𝑔 𝑛 : 𝒴 → ℝ 𝑒 be the hypothesis learned from training on 𝑛 samples drawn i.i.d. over 𝑄. 𝑀 (𝑔) is the expected loss L by predicting 𝑔(𝑌) when 𝑌, 𝑍 ~𝑄 . 𝑓𝑠 𝑄 𝑔 𝑛 is said to be L-consistent if 𝑀 𝑔 𝑀 𝑔 ≔ 𝑓𝑠 𝑀,∗ 𝑓𝑠 𝑛 → 𝑔:𝒴→ℛ 𝑓𝑠 inf 𝑄 𝑄 𝑄 . Losses are calibrated. Hypotheses are consistent.

Relating calibration and consistency Let ℓ: ℛ × 𝒵 → ℝ + be a discrete loss. A surrogate loss function 𝑀: ℝ 𝑒 × 𝒵 → ℝ + is ℓ -calibrated if and only if there exists a link function 𝜔: ℝ 𝑒 → ℛ such that for all distributions 𝑄 on 𝒴 × 𝒵 and all sequences of surrogate hypotheses 𝑔 𝑛 : 𝒴 → ℝ 𝑒 , we have 𝑀,∗ ⇒ 𝑓𝑠 𝑀 𝑔 𝑛 → 𝑓𝑠 ℓ 𝜔 ∘ 𝑔 𝑛 → 𝑓𝑠 ℓ,∗ 𝑓𝑠 𝑄 𝑄 𝑄 𝑄 Converging to optimal Linked hypothesis converges hypothesis for surrogate to optimal loss for discrete Ramaswamy et al. (2015) Theorem 3, originally Tewari and Bartlett (2007) Theorem 2.

Pause Questions so far? Formalize properties. We use these to study calibration.

Properties A property Γ: Δ 𝒵 → ℛ is a function mapping probability distributions to reports. If it’s easier, think of 𝑞 ∈ Δ 𝒵 as conditional probability. A property Γ: Δ 𝒵 → ℛ is elicitable if there is a loss function 𝑀: ℛ × 𝒵 → ℝ + such that, for all 𝑞 ∈ Δ 𝒵 , Γ 𝑞 = argmin 𝑠 𝔽 𝑍~𝑞 𝑀(𝑠, 𝑍) . Here, we say the loss 𝑀 elicits Γ . Elicitable properties have convex level sets. (Lambert and Shoham 2009) Γ 𝑠 = {𝑞 ∈ Δ 𝒵 : 𝑠 ∈ Γ 𝑞 }

“Drawing” a property 𝑜 -simplex in (𝑜 − 1) -dimensional space. Example: n = 3

Calibration… in terms of properties A property Γ: Δ 𝒵 → ℝ 𝑒 and link 𝜔: ℝ 𝑒 → ℛ are ℓ -calibrated if 𝑣 𝑛 → Γ 𝑞 ⇒ 𝔽 𝑞 ℓ 𝜔 𝑣 𝑛 , 𝑍 → min 𝑠 𝔽 𝑞 ℓ 𝑠, 𝑍 . i.e. The property value can always be linked to the argmin of loss. Tool to study geometric properties of losses eliciting Γ . Definition courtesy of Agarwal and Agarwal (2015)

Property Papers Lambert, Shoham (2009): Eliciting Truthful Answers to Multiple- choice questions. Finite properties are elicitable iff their level sets form a power diagram. Agarwal, Agarwal (2015): On Consistent Surrogate Risk Minimization and Property Elicitation. There’s a connection between properties and surrogate losses. Frongillo, Kash (2015): On Elicitation Complexity. Every property is elicitable, but the question is how elicitable.

Calibrated surrogates Positive Normal Sets. Necessary Conditions. Sufficient Conditions. Relationship between positive normal sets and level sets of property.

Positive Normal Sets Finite outcome setting: rewrite 𝑀: ℛ → ℝ 𝑜 be the vector of loss values should each outcome occur. Linearity of expectation: rewrite 𝔽 𝑞 𝑀 𝑠, 𝑍 = < 𝑞, 𝑀 𝑠 > . Positive normal set Outcome Distributions Expected loss vector Inf expected where on outcome loss possible vector Definition for sequences: sequence converges to inf.

Necessary Condition Let ℓ be a discrete loss and let 𝑀 be ℓ -calibrated. Let 𝛿 be the property elicited by ℓ . For all u ∈ 𝒯 𝑀 = 𝑑𝑝𝑜𝑤(𝑗𝑛 𝑀 ) , there exists an 𝑠 ∈ ℛ such that 𝒪 𝑀 𝑣 ⊆ 𝛿 𝑠 Ramaswamy et al. (2015) Theorem 6

Sufficient Condition Suppose there exists some Example: 0-1 loss and hinge finite set of 𝑣 𝑗 ∈ 𝒯 𝑀 such that 𝒪 ℎ𝑗𝑜𝑕𝑓 {2,0} = 𝑞 ∈ Δ 2 : 𝑞 1 ≥ 1 2 ⊆ 𝛿 1 ∪ 𝑗 𝒪 𝑀 (𝑣 𝑗 ) = Δ 𝒵 , and for each i = 𝑞 ∈ Δ 2 : 𝑞 1 ≤ 1 𝒪 ℎ𝑗𝑜𝑕𝑓 0,2 2 ⊆ 𝛿 −1 there exists an r 𝑘 ∈ ℛ such that 𝒪 𝑀 𝑣 𝑗 ⊆ 𝛿 𝑠 𝑘 . Then 𝑀 is ℓ -calibrated. 𝑞 1 = 1 2 Ramaswamy et al. (2015) Theorem 8

Case Study: Abstain Situations where the cost of misclassification is high. College admissions. Medical diagnoses. Discrete loss for this problem: 0, 𝑠 = 𝑧 1 1 2 𝑠, 𝑧 = ℓ 2 , 𝑠 = ⊥ 1, 𝑠 ≠ 𝑧 𝑝𝑠 ⊥

Historical calibrated surrogates • Crammer Singer (2001) 𝑀 𝐷𝑇 𝑣, 𝑧 = 1 + max 𝑘≠𝑧 𝑣 𝑘 − 𝑣 𝑧 + 𝜔 𝐷𝑇 𝑣 = ቊ𝑏𝑠𝑕𝑛𝑏𝑦 1≤𝑗≤𝑜 𝑣 𝑗 , 𝑣 1 − 𝑣 2 > 𝜐 ⊥, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • One vs All (Rafkin, Klateau 2004) 𝑀 𝑃𝑤𝐵 𝑣, 𝑧 = ∑𝕁{𝑧 = 𝑗} 1 − 𝑣 𝑗 + + 𝕁 𝑧 ≠ 𝑗 1 + 𝑣 𝑗 + 𝑏𝑠𝑕𝑛𝑏𝑦 1≤𝑗≤𝑜 𝑣 𝑗 , max 𝑣 𝑘 > 𝜐 𝜔 𝑃𝑤𝐵 𝑣 = ൝ 𝑘 ⊥, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

BEP Surrogate and Link 𝑞 = (1 4 , 1 4 , 1 4 , 1 4) 𝑞 = ( 7 10 , 1 10 , 1 10 , 1 10) Ramaswamy et al. (2018) Section 4

Why BEP? 1 BEP, CS, and OvA are all ℓ 2 -calibrated for abstain loss. Convex surrogates takes report 𝑣 ∈ ℝ 𝑒 . BEP: 𝑒 = log 2 (𝑜) . CS and OvA: 𝑒 = 𝑜. Why does dimension 𝑒 matter?

Dimensionality Want algorithms that are efficient and accurate. Reducing dimension makes optimization problem more efficient. Calibration guarantees accuracy.

Elicitation Complexity Elic( Γ ) = Minimum dimension 𝑒 such that Γ is elicitable by a 𝑒 - dimensional loss. Maybe Γ isn’t itself 1 -elicitable, but can be calculated by g ∘ ෠ Γ , where ෠ Γ is 1-elicitable. Then Elic( Γ ) = 1. Called indirect elicitation. Example: Γ 𝑞 = 𝔽 𝑞 𝑍 2 . Elicit 𝔽 𝑞 [𝑍] and take 𝑕: 𝑦 ↦ 𝑦 2 .

Convex Calibration Dimension Special case of elicitation complexity. ccdim( ℓ ) = minimum dimension 𝑒 such that there is a convex ℓ - calibrated surrogate 𝑀: ℝ 𝑒 × 𝒵 → ℝ . 1 Ex: ccdim( ℓ 2 ) ≤ log 2 (𝑜) because of BEP surrogate. From Ramaswamy et al. (2015) Definition 10.

Bounds on CC dimension Understood through Feasible Subspace Dimension. Tight bound for properties where all the level sets intersect on interior of simplex. ccdim( ℓ) = n-1 Does not apply to abstain. These results from Ramaswamy et al. (2015).

Feasible Subspace Dimension The feasible subspace dimension 𝜈 𝒟 𝑞 of a convex set 𝒟 at the point 𝑞 ∈ 𝒟 is the dimension of ℱ 𝒟 𝑞 ∩ −ℱ 𝒟 (𝑞) . ℱ 𝒟 𝑞 is the cone of feasible directions of 𝒟 at 𝑞. Essentially: the dimension of the smallest face of 𝒟 containing 𝑞 .

Feasible Subspace Dimension

Calibration of Convex Surrogate Losses via Property Elicitation - PowerPoint PPT Presentation

Calibration of Convex Surrogate Losses via Property Elicitation Jessie Finocchiaro October 10, 2019 Introduction Machine Learning Predictions about Some property of data future events Outline Background Surrogates Calibration

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Contents of Presentation Types of losses Causes of losses Prevention of losses

Sampling Lecture 30 ME EN 575 Andrew Ning aning@byu.edu Outline Surrogate Based Optimization

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Food Losses/Waste in Food Value Chains Food Losses/Waste in Food Value Chains Areas

Minimal Sufficient Statistics Lecture 03 Biostatistics 602 - Statistical Inference . Summary .

Necessary and Sufficient Conditions for Input-Output Finite-Time Stability of Linear Time-Varying

Rheoli Tir yn Gynaliadwy Sustainable Land Management Bob Vaughan Manager Sustainable Land This

DNREC Virtual Public Hearing On the Application and Draft Air Permit to Install a 12,000 Gallon

With Finite Memory Consensus is Easier Than Reliable Broadcast Carole Delporte-Gallet 1 ephane

Equality of Schur Supports of Ribbons Other Results Acknowledgements References Marisa Gaetz,

Searching for the Best Tests An Introduction to Automated Software Testing with

Medical Necessity Benefits Collaborative 1/13/2015 Kimberley Smith Benefits Collaborative