Decision Theory, and Loss Functions CMSC 691 UMBC Some slides - - PowerPoint PPT Presentation

β–Ά
decision theory and loss
SMART_READER_LITE
LIVE PREVIEW

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides - - PowerPoint PPT Presentation

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash Todays Goal: learn about empirical risk minimization Set t = 0 Pick a starting value t F Until converged: 1. Get value y t = F( t )


slide-1
SLIDE 1

Decision Theory, and Loss Functions

CMSC 691 UMBC

Some slides adapted from Hamed Pirsiavash

slide-2
SLIDE 2

argmin

h

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘—

F

Today’s Goal: learn about empirical risk minimization

Set t = 0 Pick a starting value ΞΈt Until converged:

  • 1. Get value y t = F(ΞΈ t)
  • 2. Get derivative g t = F’(ΞΈ t)
  • 3. Get scaling factor ρ t
  • 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t
  • 5. Set t += 1
slide-3
SLIDE 3

Outline

Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

slide-4
SLIDE 4

Decision Theory

β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ

slide-5
SLIDE 5

Decision Theory

β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h(x) to produce yΜƒ

slide-6
SLIDE 6

Decision Theory

β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h(x) to produce yΜƒ Requirement 2: a function β„“(y, yΜƒ) telling us how wrong we are

slide-7
SLIDE 7

Decision Theory

β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h(x) to produce yΜƒ Requirement 2: a loss function β„“(y, yΜƒ) telling us how wrong we are Goal: minimize our expected loss across any possible input

slide-8
SLIDE 8

score

Requirement 1: Decision Function

instance 1 instance 2 instance 3 instance 4

Evaluator

Gold/correct labels

h(x) is our predictor (classifier, regression model, clustering model, etc.)

Machine Learning Predictor Extra-knowledge

h(x)

slide-9
SLIDE 9

Requirement 2: Loss Function

β„“ 𝑧, ො 𝑧 β‰₯ 0

β€œcorrect” label/result predicted label/result β€œell” (fancy l character)

loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

  • ptimize β„“?
  • minimize
  • maximize
slide-10
SLIDE 10

Requirement 2: Loss Function

β„“ 𝑧, ො 𝑧 β‰₯ 0

β€œcorrect” label/result predicted label/result β€œell” (fancy l character)

loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

Negative β„“ (βˆ’β„“) is called a utility or reward function

slide-11
SLIDE 11

Decision Theory

minimize expected loss across any possible input

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)]

slide-12
SLIDE 12

Risk Minimization

minimize expected loss across any possible input

a particular, unspecified input pair (x,y)… but we want any possible pair

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)] = arg min

β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))]

slide-13
SLIDE 13

Decision Theory

minimize expected loss across any possible input input

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)] = arg min

β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] =

argmin

h

𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š

Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

slide-14
SLIDE 14

Risk Minimization

minimize expected loss across any possible input

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)] = arg min

β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] =

argmin

h

𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š = argmin

h ∫ β„“ 𝑧, β„Ž π’š

𝑄 π’š, 𝑧 𝑒(π’š, 𝑧)

slide-15
SLIDE 15

Risk Minimization

minimize expected loss across any possible input

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)] = arg min

β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] =

argmin

h

𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š = argmin

h ∫ β„“ 𝑧, β„Ž π’š

𝑄 π’š, 𝑧 𝑒(π’š, 𝑧)

we don’t know this distribution*!

*we could try to approximate it analytically

slide-16
SLIDE 16

(Posterior) Empirical Risk Minimization

minimize expected (posterior) loss across our observed input

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)] = arg min

β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] =

argmin

h

𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š β‰ˆ argmin

h

1 𝑂 ෍

𝑗=1 𝑂

π”½π‘§βˆΌπ‘„(β‹…|π’šπ’‹) β„“ 𝑧, β„Ž π’šπ’‹

slide-17
SLIDE 17

Empirical Risk Minimization

minimize expected loss across our observed input (& output)

arg min

ො 𝑧 𝔽[β„“(𝑧, ො

𝑧)] = arg min

β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] =

argmin

h

𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š β‰ˆ argmin

h

1 𝑂 ෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Ž π’šπ‘—

slide-18
SLIDE 18

Empirical Risk Minimization

minimize expected loss across our observed input (& output)

argmin

h

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Ž π’šπ‘—

  • ur

classifier/predictor controlled by our parameters ΞΈ

change ΞΈ β†’ change the behavior of the classifier

slide-19
SLIDE 19

Best Case: Optimize Empirical Risk with Gradients

argmin

h

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘— argmin

πœ„

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘—

change ΞΈ β†’ change the behavior of the classifier

slide-20
SLIDE 20

Best Case: Optimize Empirical Risk with Gradients

differentiating might not always work: β€œβ€¦ apart from the computational details”

argmin

πœ„

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘—

change ΞΈ β†’ change the behavior of the classifier

How? Use Gradient Descent on 𝐺(πœ„)!

𝐺(πœ„)

slide-21
SLIDE 21

Best Case: Optimize Empirical Risk with Gradients

π›Όπœ„πΊ = ෍

𝑗

πœ–β„“ 𝑧𝑗, ො 𝑧 = β„Žπœ„ π’šπ‘— πœ– ො 𝑧 π›Όπœ„β„Žπœ„ π’šπ’‹

differentiating might not always work: β€œβ€¦ apart from the computational details”

argmin

πœ„

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘—

change ΞΈ β†’ change the behavior of the classifier

slide-22
SLIDE 22

Best Case: Optimize Empirical Risk with Gradients

π›Όπœ„πΊ = ෍

𝑗

πœ–β„“ 𝑧𝑗, ො 𝑧 = β„Žπœ„ π’šπ‘— πœ– ො 𝑧 π›Όπœ„β„Žπœ„ π’šπ’‹

differentiating might not always work: β€œβ€¦ apart from the computational details”

argmin

πœ„

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘—

change ΞΈ β†’ change the behavior of the classifier

Step 1: compute the gradient of the loss wrt the predicted value

slide-23
SLIDE 23

Best Case: Optimize Empirical Risk with Gradients

π›Όπœ„πΊ = ෍

𝑗

πœ–β„“ 𝑧𝑗, ො 𝑧 = β„Žπœ„ π’šπ‘— πœ– ො 𝑧 π›Όπœ„β„Žπœ„ π’šπ’‹

differentiating might not always work: β€œβ€¦ apart from the computational details”

argmin

πœ„

෍

𝑗=1 𝑂

β„“ 𝑧𝑗, β„Žπœ„ π’šπ‘—

change ΞΈ β†’ change the behavior of the classifier

Step 1: compute the gradient of the loss wrt the predicted value Step 2: compute the gradient of the predicted value wrt πœ„.

slide-24
SLIDE 24

Outline

Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

slide-25
SLIDE 25

Loss Functions Serve a Task

Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised

Probabilistic Generative Conditional Spectral Neural Memory- based Exemplar …

the data: amount of human input/number

  • f labeled examples

the approach: how any data are being used the task: what kind

  • f problem are you

solving?

slide-26
SLIDE 26

Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

an instance d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled instances (d1,c1),....,(dm,cm)

Output:

a learned classifier Ξ³ that maps instances to classes

Ξ³ learns to associate certain features of instances with their labels

slide-27
SLIDE 27

Classification Loss Function Example: 0-1 Loss

β„“ 𝑧, ො 𝑧 = α‰Š0, if 𝑧 = ො 𝑧 1, if 𝑧 β‰  ො 𝑧

slide-28
SLIDE 28

Classification Loss Function Example: 0-1 Loss

β„“ 𝑧, ො 𝑧 = α‰Š0, if 𝑧 = ො 𝑧 1, if 𝑧 β‰  ො 𝑧

Problem 1: not differentiable wrt ො 𝑧 (or ΞΈ)

slide-29
SLIDE 29

Surrogate loss: replace Zero/one loss by a smooth function Easier to optimize if the surrogate loss is convex

Convex surrogate loss functions

Courtesy Hamed Pirsiavash, CIML

ෝ 𝑧𝑗

slide-30
SLIDE 30

Example: ERM with Exponential loss

Courtesy Hamed Pirsiavash

  • bjective
slide-31
SLIDE 31

Example: ERM with Exponential loss

Courtesy Hamed Pirsiavash

gradient

  • bjective
slide-32
SLIDE 32

Example: ERM with Exponential loss

loss term

high for misclassified points

Courtesy Hamed Pirsiavash

gradient

  • bjective

update

slide-33
SLIDE 33

Structured Classification: Sequence & Structured Prediction

Courtesy Hamed Pirsiavash

slide-34
SLIDE 34

Classification Loss Function Example: 0-1 Loss

β„“ 𝑧, ො 𝑧 = α‰Š0, if 𝑧 = ො 𝑧 1, if 𝑧 β‰  ො 𝑧

Problem 1: not differentiable wrt ො 𝑧 (or ΞΈ) Problem 2: too strict. Structured Prediction involves many individual decisions Solution 1: Specialize 0-1 to the structured problem at hand

slide-35
SLIDE 35

Regression

Like classification, but real-valued

slide-36
SLIDE 36

Regression Example: Stock Market Prediction

Courtesy Hamed Pirsiavash

slide-37
SLIDE 37

Regression Loss Function Examples

β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 2 squared loss/MSE (Mean squared error)

ො 𝑧 is a real value β†’ nicely differentiable (generally) ☺

slide-38
SLIDE 38

Regression Loss Function Examples

β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 2 β„“ 𝑧, ො 𝑧 = |𝑧 βˆ’ ො 𝑧| squared loss/MSE (Mean squared error) absolute loss

ො 𝑧 is a real value β†’ nicely differentiable (generally) ☺ Absolute value is mostly differentiable

slide-39
SLIDE 39

Regression Loss Function Examples

β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 2 β„“ 𝑧, ො 𝑧 = |𝑧 βˆ’ ො 𝑧| squared loss/MSE (Mean squared error) absolute loss

ො 𝑧 is a real value β†’ nicely differentiable (generally) ☺ Absolute value is mostly differentiable

These loss functions prefer different behavior in the predictions (hint: look at the gradient of each)… we’ll get back to this

slide-40
SLIDE 40

Outline

Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

slide-41
SLIDE 41

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

Multi-label Classification

slide-42
SLIDE 42

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task

Multi-label Classification

slide-43
SLIDE 43

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 βˆ’ 1} (for finite K), then a multi-class classification task Q: What are some examples

  • f multi-class classification?

Multi-label Classification

slide-44
SLIDE 44

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 βˆ’ 1} (for finite K), then a multi-class classification task Q: What are some examples

  • f multi-class classification?

A: Many possibilities. See A2, Q{1,2,4-7}

Multi-label Classification

slide-45
SLIDE 45

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 βˆ’ 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

  • utput

Multi-

  • utput

If multiple π‘§π‘š are predicted, then a multi- label classification task

slide-46
SLIDE 46

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 βˆ’ 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

  • utput

Multi-

  • utput

Given input 𝑦, predict multiple discrete labels 𝑧 = (𝑧1, … , 𝑧𝑀)

If multiple π‘§π‘š are predicted, then a multi- label classification task

slide-47
SLIDE 47

Multi-class Classification

Given input 𝑦, predict discrete label 𝑧

If 𝑧 ∈ {0,1} (or 𝑧 ∈ {True, False}), then a binary classification task If 𝑧 ∈ {0,1, … , 𝐿 βˆ’ 1} (for finite K), then a multi-class classification task

Multi-label Classification

Single

  • utput

Multi-

  • utput

Given input 𝑦, predict multiple discrete labels 𝑧 = (𝑧1, … , 𝑧𝑀)

If multiple π‘§π‘š are predicted, then a multi- label classification task Each π‘§π‘š could be binary or multi-class

slide-48
SLIDE 48

Outline

Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

slide-49
SLIDE 49

Bring it all together: MAP, 0/1 loss, cross-entropy, log- likelihood likelihood

  • 1. Show that a MAP estimation π‘ž 𝑧 𝑦)

minimizes 0/1 loss

  • 2. Show that minimizing cross-entropy loss is

the same as maximizing the (conditional) log- likelihood

  • 3. Consider: what is cross-entropy in a multi-

label setting?