Calibration of Convex Surrogate Losses via Property Elicitation - - PowerPoint PPT Presentation

β–Ά
calibration of convex surrogate losses via property
SMART_READER_LITE
LIVE PREVIEW

Calibration of Convex Surrogate Losses via Property Elicitation - - PowerPoint PPT Presentation

Calibration of Convex Surrogate Losses via Property Elicitation Jessie Finocchiaro October 10, 2019 Introduction Machine Learning Predictions about Some property of data future events Outline Background Surrogates Calibration


slide-1
SLIDE 1

Calibration of Convex Surrogate Losses via Property Elicitation

Jessie Finocchiaro October 10, 2019

slide-2
SLIDE 2
slide-3
SLIDE 3

Introduction

Predictions about future events Some property of data Machine Learning

slide-4
SLIDE 4

Outline

  • Background
  • Surrogates
  • Calibration
  • Properties
  • Necessary and Sufficient Conditions
  • Case Study: Abstain
  • Dimensionality
  • Conclusion
slide-5
SLIDE 5

Empirical Risk Minimization (ERM)

Assumption: data comes i.i.d. from P over 𝒴, 𝒡 . Goal: minimize risk. 𝑆𝑗𝑑𝑙 β„Ž = ∫ 𝑀 β„Ž 𝑦 , 𝑧 𝑒𝑄(𝑦, 𝑧)

Problem: don’t know P.

Settle for minimizing empirical risk. Take 𝑛 inputs 𝑦𝑗 ∈ 𝒴, 𝑧𝑗 ∈ 𝒡. 𝐹𝑆𝑗𝑑𝑙 β„Ž = 1 𝑛 ෍

𝑗=1 𝑛

𝑀(β„Ž 𝑦𝑗 , 𝑧𝑗)

π‘Œ 𝑍 𝑄(π‘Œ, 𝑍) Average Loss Against

  • utcome 𝑧𝑗

Prediction given feature 𝑦𝑗

slide-6
SLIDE 6

Conditional Probability

Form hypothesis β„Ž: 𝒴 β†’ β„›. Want to learn β„Ž(π‘Œ) minimizing 𝔽 π‘Œ,𝑍 ~𝑄𝑀(β„Ž π‘Œ , 𝑍). If β„Ž 𝑦 minimizes 𝑀 𝑠, 𝑍 over 𝑠 for all 𝑦 ∈ 𝒴, obviously true. Abstract β„Ž 𝑦 to 𝑠, look at distribution on simplex, Δ𝒡.

𝑍 Pr[𝑍] Pr[𝑍|π‘Œ = 3]

slide-7
SLIDE 7

Setting

Classification-like problems

Multiclass classification (with reject option) Ranking Top-k classification

Notation

Finite outcomes 𝒡 ≔ π‘œ Report set β„› ≔ 𝑙 Probability Distribution over 𝒡: p ∈ Δ𝒡 π‘žπ‘§ = Pr[𝑍 = 𝑧]

slide-8
SLIDE 8

Outline

  • Background
  • Surrogates
  • Calibration
  • Properties
  • Necessary and Sufficient Conditions
  • Case Study: Abstain
  • Dimensionality
  • Conclusion
slide-9
SLIDE 9

Surrogates

Loss functions measure error. Created with a task in mind.

Often discrete.

Discrete losses hard to optimize. Surrogates should approximate the original loss well.

L(r,1) Report r Surrogates for 0-1 loss, 𝒡 = {βˆ’1,1}

slide-10
SLIDE 10

Calibration

A calibrated loss 𝑀 is β€œgood approximation” of discrete loss β„“(𝑠, 𝑧). Let β„“ be a discrete loss. A surrogate loss function 𝑀: ℝ𝑒 Γ— 𝒡 β†’ ℝ+ is said to be β„“-calibrated if there exists a link function πœ”: ℝ𝑒 β†’ β„› such that

.

Inf over reports not linked to the argmin of the discrete loss.

slide-11
SLIDE 11

Consistency

Let 𝑔𝑛: 𝒴 β†’ ℝ𝑒 be the hypothesis learned from training on 𝑛 samples drawn i.i.d. over 𝑄. 𝑓𝑠

𝑄 𝑀(𝑔) is the expected loss L by predicting 𝑔(π‘Œ) when π‘Œ, 𝑍 ~𝑄.

𝑔𝑛 is said to be L-consistent if 𝑓𝑠

𝑄 𝑀 𝑔 𝑛 β†’

inf

𝑔:𝒴→ℛ 𝑓𝑠 𝑄 𝑀 𝑔 ≔ 𝑓𝑠 𝑄 𝑀,βˆ—

.

Losses are calibrated. Hypotheses are consistent.

slide-12
SLIDE 12

Relating calibration and consistency

Let β„“: β„› Γ— 𝒡 β†’ ℝ+ be a discrete loss. A surrogate loss function 𝑀: ℝ𝑒 Γ— 𝒡 β†’ ℝ+ is β„“-calibrated if and only if there exists a link function πœ”: ℝ𝑒 β†’ β„› such that for all distributions 𝑄

  • n 𝒴 Γ— 𝒡 and all sequences of surrogate hypotheses 𝑔𝑛: 𝒴 β†’ ℝ𝑒,

we have 𝑓𝑠

𝑄 𝑀 𝑔𝑛 β†’ 𝑓𝑠 𝑄 𝑀,βˆ— β‡’ 𝑓𝑠 𝑄 β„“ πœ” ∘ 𝑔𝑛 β†’ 𝑓𝑠 𝑄 β„“,βˆ—

Ramaswamy et al. (2015) Theorem 3, originally Tewari and Bartlett (2007) Theorem 2. Converging to optimal hypothesis for surrogate Linked hypothesis converges to optimal loss for discrete

slide-13
SLIDE 13

Pause

Questions so far? Formalize properties.

We use these to study calibration.

slide-14
SLIDE 14

Properties

A property Ξ“: Δ𝒡 β†’ β„› is a function mapping probability distributions to reports. If it’s easier, think of π‘ž ∈ Δ𝒡 as conditional probability. A property Ξ“: Δ𝒡 β†’ β„› is elicitable if there is a loss function 𝑀: β„› Γ— 𝒡 β†’ ℝ+such that, for all π‘ž ∈ Δ𝒡, Ξ“ π‘ž = argmin𝑠𝔽𝑍~π‘žπ‘€(𝑠, 𝑍). Here, we say the loss 𝑀 elicits Ξ“. Elicitable properties have convex level sets. (Lambert and Shoham 2009) Ξ“

𝑠 = {π‘ž ∈ Δ𝒡: 𝑠 ∈ Ξ“ π‘ž }

slide-15
SLIDE 15

β€œDrawing” a property

π‘œ-simplex in (π‘œ βˆ’ 1)-dimensional space. Example: n = 3

slide-16
SLIDE 16

Calibration… in terms of properties

A property Ξ“: Δ𝒡 β†’ ℝ𝑒 and link πœ”: ℝ𝑒 β†’ β„› are β„“-calibrated if 𝑣𝑛 β†’ Ξ“ π‘ž β‡’ π”½π‘žβ„“ πœ” 𝑣𝑛 , 𝑍 β†’ minπ‘ π”½π‘žβ„“ 𝑠, 𝑍 . i.e. The property value can always be linked to the argmin of loss. Tool to study geometric properties of losses eliciting Ξ“.

Definition courtesy of Agarwal and Agarwal (2015)

slide-17
SLIDE 17

Property Papers

Lambert, Shoham (2009): Eliciting Truthful Answers to Multiple- choice questions.

Finite properties are elicitable iff their level sets form a power diagram.

Agarwal, Agarwal (2015): On Consistent Surrogate Risk Minimization and Property Elicitation.

There’s a connection between properties and surrogate losses.

Frongillo, Kash (2015): On Elicitation Complexity.

Every property is elicitable, but the question is how elicitable.

slide-18
SLIDE 18

Calibrated surrogates

Positive Normal Sets. Necessary Conditions. Sufficient Conditions. Relationship between positive normal sets and level sets of property.

slide-19
SLIDE 19

Positive Normal Sets

Finite outcome setting: rewrite 𝑀: β„› β†’ β„π‘œ be the vector of loss values should each outcome occur.

Linearity of expectation: rewrite π”½π‘žπ‘€ 𝑠, 𝑍 = < π‘ž, 𝑀 𝑠 >.

Positive normal set Definition for sequences: sequence converges to inf.

Inf expected loss possible Distributions where Expected loss

  • n outcome

vector Outcome vector

slide-20
SLIDE 20

Necessary Condition

Let β„“ be a discrete loss and let 𝑀 be β„“-calibrated. Let 𝛿 be the property elicited by β„“. For all u ∈ 𝒯𝑀 = π‘‘π‘π‘œπ‘€(𝑗𝑛 𝑀 ) , there exists an 𝑠 ∈ β„› such that π’ͺ𝑀 𝑣 βŠ† 𝛿𝑠

Ramaswamy et al. (2015) Theorem 6

slide-21
SLIDE 21

Sufficient Condition

Suppose there exists some finite set of 𝑣𝑗 ∈ 𝒯𝑀 such that βˆͺ𝑗 π’ͺ𝑀(𝑣𝑗) = Δ𝒡, and for each i there exists an rπ‘˜ ∈ β„› such that π’ͺ𝑀 𝑣𝑗 βŠ† π›Ώπ‘ π‘˜. Then 𝑀 is β„“ -calibrated.

Ramaswamy et al. (2015) Theorem 8 Example: 0-1 loss and hinge π’ͺβ„Žπ‘—π‘œπ‘•π‘“ {2,0} = π‘ž ∈ Ξ”2: π‘ž1 β‰₯ 1 2 βŠ† 𝛿1 π’ͺβ„Žπ‘—π‘œπ‘•π‘“ 0,2 = π‘ž ∈ Ξ”2: π‘ž1 ≀ 1 2 βŠ† π›Ώβˆ’1 π‘ž1 = 1 2

slide-22
SLIDE 22

Outline

  • Background
  • Surrogates
  • Calibration
  • Properties
  • Necessary and Sufficient Conditions
  • Case Study: Abstain
  • Dimensionality
  • Conclusion
slide-23
SLIDE 23

Case Study: Abstain

Situations where the cost of misclassification is high.

College admissions. Medical diagnoses.

Discrete loss for this problem:

β„“

1 2 𝑠, 𝑧 =

0, 𝑠 = 𝑧 1 2 , 𝑠 = βŠ₯ 1, 𝑠 β‰  𝑧 𝑝𝑠 βŠ₯

slide-24
SLIDE 24

Historical calibrated surrogates

  • Crammer Singer (2001)

𝑀𝐷𝑇 𝑣, 𝑧 = 1 + max

π‘˜β‰ π‘§ π‘£π‘˜ βˆ’ 𝑣𝑧

+

πœ”π·π‘‡ 𝑣 = α‰Šπ‘π‘ π‘•π‘›π‘π‘¦1β‰€π‘—β‰€π‘œπ‘£π‘—, 𝑣 1 βˆ’ 𝑣 2 > 𝜐 βŠ₯, π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓

  • One vs All (Rafkin, Klateau 2004)

𝑀𝑃𝑀𝐡 𝑣, 𝑧 = βˆ‘π•{𝑧 = 𝑗} 1 βˆ’ 𝑣𝑗 + + 𝕁 𝑧 β‰  𝑗 1 + 𝑣𝑗 + πœ”π‘ƒπ‘€π΅ 𝑣 = ࡝ 𝑏𝑠𝑕𝑛𝑏𝑦1β‰€π‘—β‰€π‘œπ‘£π‘—, max

π‘˜

π‘£π‘˜ > 𝜐 βŠ₯, π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓

slide-25
SLIDE 25

BEP Surrogate and Link

Ramaswamy et al. (2018) Section 4 π‘ž = (1 4 , 1 4 , 1 4 , 1 4) π‘ž = ( 7 10 , 1 10 , 1 10 , 1 10)

slide-26
SLIDE 26

Why BEP?

BEP, CS, and OvA are all β„“

1 2-calibrated for abstain loss.

Convex surrogates takes report 𝑣 ∈ ℝ𝑒.

BEP: 𝑒 = log2(π‘œ) . CS and OvA: 𝑒 = π‘œ.

Why does dimension 𝑒 matter?

slide-27
SLIDE 27

Outline

  • Background
  • Surrogates
  • Calibration
  • Properties
  • Necessary and Sufficient Conditions
  • Case Study: Abstain
  • Dimensionality
  • Conclusion
slide-28
SLIDE 28

Dimensionality

Want algorithms that are efficient and accurate.

Reducing dimension makes optimization problem more efficient. Calibration guarantees accuracy.

slide-29
SLIDE 29

Elicitation Complexity

Elic(Ξ“) = Minimum dimension 𝑒 such that Ξ“ is elicitable by a 𝑒- dimensional loss. Maybe Ξ“ isn’t itself 1-elicitable, but can be calculated by g ∘ ΰ·  Ξ“ , where ΰ·  Ξ“ is 1-elicitable.

Then Elic(Ξ“) = 1. Called indirect elicitation.

Example: Ξ“ π‘ž = π”½π‘ž 𝑍 2.

Elicit π”½π‘ž[𝑍] and take 𝑕: 𝑦 ↦ 𝑦2.

slide-30
SLIDE 30

Convex Calibration Dimension

Special case of elicitation complexity. ccdim(β„“) = minimum dimension 𝑒 such that there is a convex β„“- calibrated surrogate 𝑀: ℝ𝑒 Γ— 𝒡 β†’ ℝ. Ex: ccdim(β„“

1 2) ≀ log2(π‘œ) because of BEP surrogate.

From Ramaswamy et al. (2015) Definition 10.

slide-31
SLIDE 31

Bounds on CC dimension

Understood through Feasible Subspace Dimension. Tight bound for properties where all the level sets intersect on interior of simplex. ccdim(β„“) = n-1

Does not apply to abstain.

These results from Ramaswamy et al. (2015).

slide-32
SLIDE 32

Feasible Subspace Dimension

The feasible subspace dimension πœˆπ’Ÿ π‘ž of a convex set π’Ÿ at the point π‘ž ∈ π’Ÿ is the dimension of β„±π’Ÿ π‘ž ∩ βˆ’β„±π’Ÿ(π‘ž).

β„±π’Ÿ π‘ž is the cone of feasible directions of π’Ÿ at π‘ž.

Essentially: the dimension of the smallest face of π’Ÿ containing π‘ž.

slide-33
SLIDE 33

Feasible Subspace Dimension

slide-34
SLIDE 34

Feasible Subspace Dimension

slide-35
SLIDE 35

Lower bound on CC Dimension

Let β„“ be the discrete loss eliciting the finite property 𝛿: Δ𝒡 β†’ β„›. For π‘ž ∈ Δ𝒡 and 𝑠 such that π‘ž ∈ 𝛿𝑠, ccdim(β„“) β‰₯ ||π‘ž||0 βˆ’ πœˆπ›Ώπ‘ (π‘ž) βˆ’ 1.

Ramaswamy et al. (2016) Theorem 16 Skip proof sketch

slide-36
SLIDE 36

Proof Sketch

Consider π‘ž ∈ π‘ π‘“π‘šπ‘—π‘œπ‘’(Δ𝒡). Other case is a reduction to this. Suppose surrogate 𝑀: ℝ𝑒 β†’ ℝ+

π‘œ is β„“-calibrated.

Want to show 𝑒 β‰₯ ||π‘ž||0 βˆ’ πœˆπ›Ώπ‘  π‘ž βˆ’ 1.

Want to show there is β„‹ βŠ† Δ𝒡 and 𝑠′ ∈ β„› so that three things are true:

  • 1. π‘ž ∈ β„‹
  • 2. πœˆβ„‹ π‘ž = π‘œ βˆ’ 𝑒 βˆ’ 1
  • 3. β„‹ βŠ† 𝛿𝑠′

π‘ž ∈ 𝛿𝑠′ πœˆπ›Ώπ‘  π‘ž = πœˆπ›Ώπ‘ β€² π‘ž β‰₯ πœˆβ„‹ π‘ž = π‘œ βˆ’ 𝑒 βˆ’ 1

slide-37
SLIDE 37

Proof Sketch: Conditions 1 and 2

Construct β„‹ and 𝑠′. Consider 𝑣 ∈ π’Ÿ so that inf

zβˆˆπ’―π‘€ < π‘ž, 𝑨 > = inf π‘£βˆˆπ’Ÿ < π‘ž, 𝑀 𝑣 > .

Claim: βˆƒπ‘₯𝑧 ∈ πœ–π‘€π‘§(𝑣) for all 𝑧 ∈ 𝒡 so that βˆ‘π‘§ π‘žπ‘§ π‘₯𝑧 = 0. Set 𝐡 = π‘₯1, π‘₯2, … , π‘₯π‘œ ∈ ℝ𝑒 𝑦 π‘œ. β„‹ = {π‘Ÿ ∈ Δ𝒡 ∢ π΅π‘Ÿ = 0} πœˆβ„‹ π‘ž = π‘œπ‘£π‘šπ‘šπ‘—π‘’π‘§ 𝐡 𝟚 β‰₯ π‘œ βˆ’ 𝑒 βˆ’ 1 by Ramaswamy Lemma 15.

  • 1. and 2. satisfied.
slide-38
SLIDE 38

Condition 3 Intuition

We construct β„‹ using sequences that converge to 𝑣. πœ—-subdifferential used to limit to the infimum. They show π‘Ÿ ∈ β„‹ β‡’ π‘Ÿ ∈ π’ͺ𝑀 𝑣 with a bunch of arithmetic.

Result follows since π’ͺ𝑀 𝑣 βŠ† 𝛿𝑠′ for some 𝑠′ ∈ β„›. β€œLet β„“ be the discrete loss eliciting the finite property 𝛿: Δ𝒡 β†’ β„›. For π‘ž ∈ Δ𝒡 and 𝑠 such that π‘ž ∈ 𝛿𝑠, ccdim(β„“) β‰₯ ||π‘ž||0 βˆ’ πœˆπ›Ώπ‘ (π‘ž) βˆ’ 1.”

slide-39
SLIDE 39

Mode: CC dimension tight bound

Trivial bound: 𝑑𝑑𝑒𝑗𝑛 0 βˆ’ 1 π‘šπ‘π‘‘π‘‘ ≀ π‘œ βˆ’ 1. Previous result: For π‘ž ∈ 𝛿𝑠, 𝑑𝑑𝑒𝑗𝑛 β„“ β‰₯ ||π‘ž||0 βˆ’ πœˆπ›Ώπ‘ (π‘ž) βˆ’ 1. Take π‘ž =βˆ©π‘  𝛿𝑠. 𝑑𝑑𝑒𝑗𝑛 0 βˆ’ 1 π‘šπ‘π‘‘π‘‘ β‰₯ π‘œ βˆ’ 0 βˆ’ 1 Thus, bound is tight.

slide-40
SLIDE 40

Abstain: Unknown bounds

𝐷𝑑𝑒𝑗𝑛 β‰₯ 3 βˆ’ 1 βˆ’ 1 = 1 𝐷𝑑𝑒𝑗𝑛 β‰₯ 3 βˆ’ 2 βˆ’ 1 = 0 𝐷𝑑𝑒𝑗𝑛 β‰₯ 2 βˆ’ 0 βˆ’ 1 = 1

With abstain, βˆ©π‘— 𝛿𝑗 = βˆ…. Thus, lower bound does not apply.

slide-41
SLIDE 41

Calibrated surrogate papers

Ramaswamy, Agarwal (2015): Convex calibration dimension for multiclass loss matrices.

Reducing surrogate dimension is important. Here are some bounds on dimension of calibrated convex surrogate losses.

Ramaswamy et al. (2018): Consistent Algorithms for Multiclass classification with a reject option.

Abstain is cool, and here’s a log(n) dimension consistent convex surrogate.

slide-42
SLIDE 42

Other discrete examples

Ramaswamy et al. (2015): Convex calibrated surrogates for hierarchical classification.

Here’s a calibrated surrogate for this problem using abstain surrogate.

Lapin et al. (2015): Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification.

Here are some surrogates for these problems. Not sure if they’re calibrated.

Yu, Blaschko (2015): The LovΓ‘sz Hinge: A Novel Convex Surrogate for Submodular Losses.

Here’s how you go from submodular functions to convex surrogates.

slide-43
SLIDE 43

Future Work

Embedding Dimension. Constructing Link Functions. Approximate Elicitation.

slide-44
SLIDE 44

Summary

β„“

1 2 𝑠, 𝑧 =

0, 𝑠 = 𝑧 1 2 , 𝑠 = βŠ₯ 1, 𝑠 β‰  𝑧 𝑝𝑠 βŠ₯

Thank you, questions?