decision theory and loss
play

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides - PowerPoint PPT Presentation

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash Todays Goal: learn about empirical risk minimization Set t = 0 Pick a starting value t F Until converged: 1. Get value y t = F( t )


  1. Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

  2. Today’s Goal: learn about empirical risk minimization Set t = 0 Pick a starting value ΞΈ t F Until converged: 𝑂 1. Get value y t = F( ΞΈ t ) argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 2. Get derivative g t = F’(ΞΈ t ) h 3. Get scaling factor ρ t 𝑗=1 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t 5. Set t += 1

  3. Outline Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

  4. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ

  5. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜƒ

  6. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜƒ Requirement 2: a function β„“ (y, yΜƒ) telling us how wrong we are

  7. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜƒ Requirement 2: a loss function β„“ (y, yΜƒ) telling us how wrong we are Goal: minimize our expected loss across any possible input

  8. Requirement 1: Decision Function Gold/correct labels h(x) instance 1 instance 2 Machine score Evaluator Learning Predictor instance 3 instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

  9. Requirement 2: Loss Function β€œell” (fancy l predicted label/result character) optimize β„“ ? β„“ 𝑧, ො 𝑧 β‰₯ 0 β€’ minimize β€’ maximize β€œcorrect” label/result loss: A function that tells you how much to penalize a prediction yΜƒ from the correct answer y

  10. Requirement 2: Loss Function β€œell” (fancy l predicted label/result character) Negative β„“ ( βˆ’β„“ ) is β„“ 𝑧, ො 𝑧 β‰₯ 0 called a utility or reward function β€œcorrect” label/result loss: A function that tells you how much to penalize a prediction yΜƒ from the correct answer y

  11. Decision Theory minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] ො

  12. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] ො a particular , unspecified input pair ( x ,y)… but we want any possible pair

  13. Decision Theory minimize expected loss across any possible input input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

  14. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š = h argmin h ∫ β„“ 𝑧, β„Ž π’š 𝑄 π’š, 𝑧 𝑒(π’š, 𝑧)

  15. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š = h argmin h ∫ β„“ 𝑧, β„Ž π’š 𝑄 π’š, 𝑧 𝑒(π’š, 𝑧) we don’t know this distribution*! *we could try to approximate it analytically

  16. (Posterior) Empirical Risk Minimization minimize expected (posterior) loss across our observed input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š β‰ˆ h 𝑂 1 argmin 𝑂 ෍ 𝔽 π‘§βˆΌπ‘„(β‹…|π’š 𝒋 ) β„“ 𝑧, β„Ž π’š 𝒋 h 𝑗=1

  17. Empirical Risk Minimization minimize expected loss across our observed input (& output) arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š β‰ˆ h 𝑂 1 argmin 𝑂 ෍ β„“ 𝑧 𝑗 , β„Ž π’š 𝑗 h 𝑗=1

  18. Empirical Risk Minimization minimize expected loss across our observed input (& output) 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž π’š 𝑗 h 𝑗=1 change ΞΈ β†’ change the behavior of the classifier our classifier/predictor controlled by our parameters ΞΈ

  19. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 h 𝑗=1 change ΞΈ β†’ change the behavior of the classifier 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1

  20. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 𝐺(πœ„) change ΞΈ β†’ change the behavior of the classifier How? Use Gradient Descent on 𝐺(πœ„) ! differentiating might not always work: β€œβ€¦ apart from the computational details”

  21. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 change ΞΈ β†’ change the behavior of the classifier πœ–β„“ 𝑧 𝑗 , ො 𝑧 = β„Ž πœ„ π’š 𝑗 𝛼 πœ„ 𝐺 = ෍ 𝛼 πœ„ β„Ž πœ„ π’š 𝒋 πœ– ො 𝑧 𝑗 differentiating might not always work: β€œβ€¦ apart from the computational details”

  22. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 change ΞΈ β†’ change the behavior of the classifier πœ–β„“ 𝑧 𝑗 , ො 𝑧 = β„Ž πœ„ π’š 𝑗 𝛼 πœ„ 𝐺 = ෍ 𝛼 πœ„ β„Ž πœ„ π’š 𝒋 πœ– ො 𝑧 𝑗 Step 1: compute the gradient of the loss wrt the predicted value differentiating might not always work: β€œβ€¦ apart from the computational details”

  23. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 change ΞΈ β†’ change the behavior of the classifier πœ–β„“ 𝑧 𝑗 , ො 𝑧 = β„Ž πœ„ π’š 𝑗 𝛼 πœ„ 𝐺 = ෍ 𝛼 πœ„ β„Ž πœ„ π’š 𝒋 πœ– ො 𝑧 𝑗 Step 2: compute the gradient of Step 1: compute the gradient of the predicted the loss wrt the predicted value value wrt πœ„ . differentiating might not always work: β€œβ€¦ apart from the computational details”

  24. Outline Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

  25. Loss Functions Serve a Task Probabilistic Neural Classification Fully-supervised Generative Memory- based Regression Semi-supervised Conditional Exemplar … Spectral Clustering Un-supervised the task : what kind the data : amount of the approach : how of problem are you human input/number any data are being solving? of labeled examples used

  26. Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: an instance d Ξ³ learns to associate a fixed set of classes C = { c 1 , c 2 ,…, c J } certain features of A training set of m hand-labeled instances (d 1 ,c 1 ),....,(d m ,c m ) instances with their Output: a learned classifier Ξ³ that maps instances labels to classes

  27. Classification Loss Function Example: 0-1 Loss 𝑧 = α‰Š 0, if 𝑧 = ො 𝑧 β„“ 𝑧, ො 1, if 𝑧 β‰  ො 𝑧

  28. Classification Loss Function Example: 0-1 Loss 𝑧 = α‰Š 0, if 𝑧 = ො 𝑧 β„“ 𝑧, ො 1, if 𝑧 β‰  ො 𝑧 Problem 1: not differentiable wrt ො 𝑧 (or ΞΈ )

  29. Convex surrogate loss functions Surrogate loss: replace Zero/one loss by a smooth function Easier to optimize if the surrogate loss is convex 𝑧 𝑗 ෝ Courtesy Hamed Pirsiavash, CIML

  30. Example: ERM with Exponential loss objective Courtesy Hamed Pirsiavash

  31. Example: ERM with Exponential loss objective gradient Courtesy Hamed Pirsiavash

  32. Example: ERM with Exponential loss objective gradient update loss term high for misclassified points Courtesy Hamed Pirsiavash

  33. Structured Classification: Sequence & Structured Prediction Courtesy Hamed Pirsiavash

  34. Classification Loss Function Example: 0-1 Loss 𝑧 = α‰Š 0, if 𝑧 = ො 𝑧 β„“ 𝑧, ො 1, if 𝑧 β‰  ො 𝑧 Problem 1: not differentiable wrt ො 𝑧 (or ΞΈ) Problem 2: too strict. Solution 1: Specialize 0-1 to Structured Prediction the structured problem at involves many individual hand decisions

  35. Regression Like classification, but real-valued

  36. Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash

  37. Regression Loss Function Examples squared loss/MSE (Mean squared error) 𝑧 2 β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 is a real value β†’ ො nicely differentiable (generally) ☺

  38. Regression Loss Function Examples squared loss/MSE absolute loss (Mean squared error) 𝑧 2 β„“ 𝑧, ො 𝑧 = |𝑧 βˆ’ ො 𝑧 | β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 is a real value β†’ ො Absolute value is nicely differentiable mostly differentiable (generally) ☺

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend