SLIDE 1 Decision Theory, and Loss Functions
CMSC 691 UMBC
Some slides adapted from Hamed Pirsiavash
SLIDE 2 argmin
h
ΰ·
π=1 π
β π§π, βπ ππ
F
Todayβs Goal: learn about empirical risk minimization
Set t = 0 Pick a starting value ΞΈt Until converged:
- 1. Get value y t = F(ΞΈ t)
- 2. Get derivative g t = Fβ(ΞΈ t)
- 3. Get scaling factor Ο t
- 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t
- 5. Set t += 1
SLIDE 3
Outline
Decision Theory Loss Functions Multiclass vs. Multilabel Prediction
SLIDE 4
Decision Theory
βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ
SLIDE 5
Decision Theory
βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ Requirement 1: a decision (hypothesis) function h(x) to produce yΜ
SLIDE 6
Decision Theory
βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ Requirement 1: a decision (hypothesis) function h(x) to produce yΜ Requirement 2: a function β(y, yΜ) telling us how wrong we are
SLIDE 7
Decision Theory
βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ Requirement 1: a decision (hypothesis) function h(x) to produce yΜ Requirement 2: a loss function β(y, yΜ) telling us how wrong we are Goal: minimize our expected loss across any possible input
SLIDE 8 score
Requirement 1: Decision Function
instance 1 instance 2 instance 3 instance 4
Evaluator
Gold/correct labels
h(x) is our predictor (classifier, regression model, clustering model, etc.)
Machine Learning Predictor Extra-knowledge
h(x)
SLIDE 9 Requirement 2: Loss Function
β π§, ΰ· π§ β₯ 0
βcorrectβ label/result predicted label/result βellβ (fancy l character)
loss: A function that tells you how much to penalize a prediction yΜ from the correct answer y
- ptimize β?
- minimize
- maximize
SLIDE 10 Requirement 2: Loss Function
β π§, ΰ· π§ β₯ 0
βcorrectβ label/result predicted label/result βellβ (fancy l character)
loss: A function that tells you how much to penalize a prediction yΜ from the correct answer y
Negative β (ββ) is called a utility or reward function
SLIDE 11
Decision Theory
minimize expected loss across any possible input
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)]
SLIDE 12 Risk Minimization
minimize expected loss across any possible input
a particular, unspecified input pair (x,y)β¦ but we want any possible pair
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)] = arg min
β π½[β(π§, β(π))]
SLIDE 13 Decision Theory
minimize expected loss across any possible input input
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)] = arg min
β π½[β(π§, β(π))] =
argmin
h
π½ π,π§ βΌπ β π§, β π
Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y
SLIDE 14
Risk Minimization
minimize expected loss across any possible input
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)] = arg min
β π½[β(π§, β(π))] =
argmin
h
π½ π,π§ βΌπ β π§, β π = argmin
h β« β π§, β π
π π, π§ π(π, π§)
SLIDE 15 Risk Minimization
minimize expected loss across any possible input
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)] = arg min
β π½[β(π§, β(π))] =
argmin
h
π½ π,π§ βΌπ β π§, β π = argmin
h β« β π§, β π
π π, π§ π(π, π§)
we donβt know this distribution*!
*we could try to approximate it analytically
SLIDE 16
(Posterior) Empirical Risk Minimization
minimize expected (posterior) loss across our observed input
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)] = arg min
β π½[β(π§, β(π))] =
argmin
h
π½ π,π§ βΌπ β π§, β π β argmin
h
1 π ΰ·
π=1 π
π½π§βΌπ(β
|ππ) β π§, β ππ
SLIDE 17
Empirical Risk Minimization
minimize expected loss across our observed input (& output)
arg min
ΰ· π§ π½[β(π§, ΰ·
π§)] = arg min
β π½[β(π§, β(π))] =
argmin
h
π½ π,π§ βΌπ β π§, β π β argmin
h
1 π ΰ·
π=1 π
β π§π, β ππ
SLIDE 18 Empirical Risk Minimization
minimize expected loss across our observed input (& output)
argmin
h
ΰ·
π=1 π
β π§π, β ππ
classifier/predictor controlled by our parameters ΞΈ
change ΞΈ β change the behavior of the classifier
SLIDE 19 Best Case: Optimize Empirical Risk with Gradients
argmin
h
ΰ·
π=1 π
β π§π, βπ ππ argmin
π
ΰ·
π=1 π
β π§π, βπ ππ
change ΞΈ β change the behavior of the classifier
SLIDE 20 Best Case: Optimize Empirical Risk with Gradients
differentiating might not always work: ββ¦ apart from the computational detailsβ
argmin
π
ΰ·
π=1 π
β π§π, βπ ππ
change ΞΈ β change the behavior of the classifier
How? Use Gradient Descent on πΊ(π)!
πΊ(π)
SLIDE 21 Best Case: Optimize Empirical Risk with Gradients
πΌππΊ = ΰ·
π
πβ π§π, ΰ· π§ = βπ ππ π ΰ· π§ πΌπβπ ππ
differentiating might not always work: ββ¦ apart from the computational detailsβ
argmin
π
ΰ·
π=1 π
β π§π, βπ ππ
change ΞΈ β change the behavior of the classifier
SLIDE 22 Best Case: Optimize Empirical Risk with Gradients
πΌππΊ = ΰ·
π
πβ π§π, ΰ· π§ = βπ ππ π ΰ· π§ πΌπβπ ππ
differentiating might not always work: ββ¦ apart from the computational detailsβ
argmin
π
ΰ·
π=1 π
β π§π, βπ ππ
change ΞΈ β change the behavior of the classifier
Step 1: compute the gradient of the loss wrt the predicted value
SLIDE 23 Best Case: Optimize Empirical Risk with Gradients
πΌππΊ = ΰ·
π
πβ π§π, ΰ· π§ = βπ ππ π ΰ· π§ πΌπβπ ππ
differentiating might not always work: ββ¦ apart from the computational detailsβ
argmin
π
ΰ·
π=1 π
β π§π, βπ ππ
change ΞΈ β change the behavior of the classifier
Step 1: compute the gradient of the loss wrt the predicted value Step 2: compute the gradient of the predicted value wrt π.
SLIDE 24
Outline
Decision Theory Loss Functions Multiclass vs. Multilabel Prediction
SLIDE 25 Loss Functions Serve a Task
Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised
Probabilistic Generative Conditional Spectral Neural Memory- based Exemplar β¦
the data: amount of human input/number
the approach: how any data are being used the task: what kind
solving?
SLIDE 26 Classification: Supervised Machine Learning
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis β¦
Input:
an instance d a fixed set of classes C = {c1, c2,β¦, cJ} A training set of m hand-labeled instances (d1,c1),....,(dm,cm)
Output:
a learned classifier Ξ³ that maps instances to classes
Ξ³ learns to associate certain features of instances with their labels
SLIDE 27
Classification Loss Function Example: 0-1 Loss
β π§, ΰ· π§ = α0, if π§ = ΰ· π§ 1, if π§ β ΰ· π§
SLIDE 28
Classification Loss Function Example: 0-1 Loss
β π§, ΰ· π§ = α0, if π§ = ΰ· π§ 1, if π§ β ΰ· π§
Problem 1: not differentiable wrt ΰ· π§ (or ΞΈ)
SLIDE 29 Surrogate loss: replace Zero/one loss by a smooth function Easier to optimize if the surrogate loss is convex
Convex surrogate loss functions
Courtesy Hamed Pirsiavash, CIML
ΰ· π§π
SLIDE 30 Example: ERM with Exponential loss
Courtesy Hamed Pirsiavash
SLIDE 31 Example: ERM with Exponential loss
Courtesy Hamed Pirsiavash
gradient
SLIDE 32 Example: ERM with Exponential loss
loss term
high for misclassified points
Courtesy Hamed Pirsiavash
gradient
update
SLIDE 33 Structured Classification: Sequence & Structured Prediction
Courtesy Hamed Pirsiavash
SLIDE 34
Classification Loss Function Example: 0-1 Loss
β π§, ΰ· π§ = α0, if π§ = ΰ· π§ 1, if π§ β ΰ· π§
Problem 1: not differentiable wrt ΰ· π§ (or ΞΈ) Problem 2: too strict. Structured Prediction involves many individual decisions Solution 1: Specialize 0-1 to the structured problem at hand
SLIDE 35
Regression
Like classification, but real-valued
SLIDE 36 Regression Example: Stock Market Prediction
Courtesy Hamed Pirsiavash
SLIDE 37
Regression Loss Function Examples
β π§, ΰ· π§ = y β ΰ· π§ 2 squared loss/MSE (Mean squared error)
ΰ· π§ is a real value β nicely differentiable (generally) βΊ
SLIDE 38
Regression Loss Function Examples
β π§, ΰ· π§ = y β ΰ· π§ 2 β π§, ΰ· π§ = |π§ β ΰ· π§| squared loss/MSE (Mean squared error) absolute loss
ΰ· π§ is a real value β nicely differentiable (generally) βΊ Absolute value is mostly differentiable
SLIDE 39 Regression Loss Function Examples
β π§, ΰ· π§ = y β ΰ· π§ 2 β π§, ΰ· π§ = |π§ β ΰ· π§| squared loss/MSE (Mean squared error) absolute loss
ΰ· π§ is a real value β nicely differentiable (generally) βΊ Absolute value is mostly differentiable
These loss functions prefer different behavior in the predictions (hint: look at the gradient of each)β¦ weβll get back to this
SLIDE 40
Outline
Decision Theory Loss Functions Multiclass vs. Multilabel Prediction
SLIDE 41
Multi-class Classification
Given input π¦, predict discrete label π§
Multi-label Classification
SLIDE 42 Multi-class Classification
Given input π¦, predict discrete label π§
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task
Multi-label Classification
SLIDE 43 Multi-class Classification
Given input π¦, predict discrete label π§
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task Q: What are some examples
- f multi-class classification?
Multi-label Classification
SLIDE 44 Multi-class Classification
Given input π¦, predict discrete label π§
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task Q: What are some examples
- f multi-class classification?
A: Many possibilities. See A2, Q{1,2,4-7}
Multi-label Classification
SLIDE 45 Multi-class Classification
Given input π¦, predict discrete label π§
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task
Multi-label Classification
Single
Multi-
If multiple π§π are predicted, then a multi- label classification task
SLIDE 46 Multi-class Classification
Given input π¦, predict discrete label π§
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task
Multi-label Classification
Single
Multi-
Given input π¦, predict multiple discrete labels π§ = (π§1, β¦ , π§π)
If multiple π§π are predicted, then a multi- label classification task
SLIDE 47 Multi-class Classification
Given input π¦, predict discrete label π§
If π§ β {0,1} (or π§ β {True, False}), then a binary classification task If π§ β {0,1, β¦ , πΏ β 1} (for finite K), then a multi-class classification task
Multi-label Classification
Single
Multi-
Given input π¦, predict multiple discrete labels π§ = (π§1, β¦ , π§π)
If multiple π§π are predicted, then a multi- label classification task Each π§π could be binary or multi-class
SLIDE 48
Outline
Decision Theory Loss Functions Multiclass vs. Multilabel Prediction
SLIDE 49 Bring it all together: MAP, 0/1 loss, cross-entropy, log- likelihood likelihood
- 1. Show that a MAP estimation π π§ π¦)
minimizes 0/1 loss
- 2. Show that minimizing cross-entropy loss is
the same as maximizing the (conditional) log- likelihood
- 3. Consider: what is cross-entropy in a multi-
label setting?