probability and statistical decision theory
play

Probability and Statistical Decision Theory Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

  2. Logistics • Recitation tonight: 730-830pm, Halligan 111B • More on pipelines and feature transforms • Cross validation Mike Hughes - Tufts COMP 135 - Spring 2019 2

  3. Unit Objectives • Probability Basics • Discrete random variables • Continuous random variables • Decision Theory: Making optimal predictions • Limits of learning • The curse of dimensionality • The bias-variance tradeoff Mike Hughes - Tufts COMP 135 - Spring 2019 3

  4. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

  5. Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

  6. Model Complexity vs Error Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Spring 2019 6

  7. Today: Bias and Variance Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html Mike Hughes - Tufts COMP 135 - Spring 2019 7

  8. Model Complexity vs Error High Variance High Bias Mike Hughes - Tufts COMP 135 - Spring 2019 8

  9. Discrete Random Variable Examples: • Coin flip! Heads or tails? • Dice roll! 1 or 2 or … 6? In general, random variable is defined by: • Countable set of all possible outcomes • Probability value for each outcome Mike Hughes - Tufts COMP 135 - Spring 2019 9

  10. Probability Mass Function Notation: - X is random variable - x is a particular observed value - Probability of observation: p (" = $) Function p is a probability mass function (pmf) Maps possible values to probabilities in [0, 1] Must sum to one over domain of X Mike Hughes - Tufts COMP 135 - Spring 2019 10

  11. Pair exercise • Draw the pmf for a normal 6-sided dice roll • Draw pmf if there are: • 2 sides with 1 pip • 0 sides with 2 pips Mike Hughes - Tufts COMP 135 - Spring 2019 11

  12. Expected Values What is the expected value of a dice roll? Expected means probability-weighted average X E [ X ] = p ( X = x ) x x Mike Hughes - Tufts COMP 135 - Spring 2019 12

  13. Joint Probability X Y p ( X = candidate A AND Y = young) Mike Hughes - Tufts COMP 135 - Spring 2019 13

  14. Marginal Probability X Marginal p(Y): Y Marginal p(X): Mike Hughes - Tufts COMP 135 - Spring 2019 14

  15. Conditional Probability What is the probability of support for candidate A, if we assume that the voter is young? p ( X = candidate A | Y = young) Goal: X Marginal p(Y): Y Try it with your partner! Mike Hughes - Tufts COMP 135 - Spring 2019 15

  16. Conditional Probability What is the probability of support for candidate A, if we assume that the voter is young? p ( X = candidate A | Y = young) Goal: X Marginal p(Y): Y Answer: Mike Hughes - Tufts COMP 135 - Spring 2019 16

  17. The Rules of Probability = p ( X | Y ) p ( Y ) Mike Hughes - Tufts COMP 135 - Spring 2019 17

  18. Continuous Random Variables Any r.v. whose possible outcomes are not a discrete set, but take values on a number line Examples: uniform draw between 0 and 1 draw from Gaussian “bell curve” distribution Mike Hughes - Tufts COMP 135 - Spring 2019 18

  19. Probability Density Function • Generalizes pmf for discrete r.v. to continuous • Any pdf p(x) must satisfy two properties: ∀ x : p ( x ) ≥ 0 Z p ( x ) dx = 1 x Mike Hughes - Tufts COMP 135 - Spring 2019 19

  20. Example Consider a uniform distribution over entire real line (from -inf to + inf) Draw the pdf, verify that it can meet the required conditions (nonnegative, integrates to one). Is there a problem here? Mike Hughes - Tufts COMP 135 - Spring 2019 20

  21. Plots of Gaussian pdf What do you notice about y-axis values… Is there a problem here? Mike Hughes - Tufts COMP 135 - Spring 2019 21

  22. Probability Density Function • Generalizes pmf for discrete r.v. to continuous • Any pdf p(x) must satisfy two properties: ∀ x : p ( x ) ≥ 0 Z p ( x ) dx = 1 x Value of p(x) can take ANY value > 0, even sometimes larger than 1 Should NOT interpret as “probability of drawing exactly x” Should interpret as “density at vanishingly small interval around x” Remember: density = mass / volume Mike Hughes - Tufts COMP 135 - Spring 2019 22

  23. Continuous Expectations Z E [ X ] = xp ( x ) dx x ∈ domain( X ) Z E [ h ( X )] = h ( x ) p ( x ) dx x ∈ domain( X ) Mike Hughes - Tufts COMP 135 - Spring 2019 23

  24. Approximating Expectations Use “Monte Carlo”: average of a sample! • 1) Draw S i.i.d. samples from distribution x 1 , x 2 , . . . x S ∼ p ( x ) • 2) Compute mean of these sampled values S E [ h ( X )] ≈ 1 X h ( x s ) S s =1 For any function h, the mean of this random estimator is unbiased. As number of samples S increases, variance of estimator decreases. Mike Hughes - Tufts COMP 135 - Spring 2019 24

  25. Statistical Decision Theory • See ESL textbook in Ch. 2 and Ch. 7 Mike Hughes - Tufts COMP 135 - Spring 2019 25

  26. How to predict best if we know conditional probability? Assume we have: a specific x input of interest a known “true” conditional p(Y | X) error metric we care about How should we set our predictor ? Minimize the expected error! ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Key ideas: prediction will be a scalar conditional distribution p(Y|X) tells us everything we need to know Mike Hughes - Tufts COMP 135 - Spring 2019 26

  27. Expected y at a given fixed x Z E [ Y | X = x ] = y p ( y | X = x ) dy y Mike Hughes - Tufts COMP 135 - Spring 2019 27

  28. Recall from HW1 • Two constant value estimators • Mean of training set • Median of training set • Two possible error metrics • Squared error • Absolute error Which estimator did best under which error metric? Mike Hughes - Tufts COMP 135 - Spring 2019 28

  29. Minimize expected squared error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z y ) 2 p ( y | X = x ) dy E [err( Y, ˆ y ) | X = x ] = ( y − ˆ y What is your intuition from HW1? Express in terms of p(Y|X=x)… How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Mike Hughes - Tufts COMP 135 - Spring 2019 29

  30. Minimize expected squared error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z y ) 2 p ( y | X = x ) dy E [err( Y, ˆ y ) | X = x ] = ( y − ˆ y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Optimal predictor for squared error: mean y value under p(Y|X=x) In practice, mean of sampled y = E [ Y | X = x ] ˆ y values at/around x Mike Hughes - Tufts COMP 135 - Spring 2019 30

  31. Minimize expected absolute error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z E [err( Y, ˆ y ) | X = x ] = | y − ˆ y | p ( y | X = x ) dy y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y What is your intuition from HW 1? Mike Hughes - Tufts COMP 135 - Spring 2019 31

  32. Minimize expected absolute error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z E [err( Y, ˆ y ) | X = x ] = | y − ˆ y | p ( y | X = x ) dy y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Optimal predictor for squared error: median y value under p(Y|X=x) y ∗ = median( p ( Y | X = x )) In practice, median of ˆ sampled y values at/around x Mike Hughes - Tufts COMP 135 - Spring 2019 32

  33. Minimizing error with K-NN Ideal Approximation know “true” conditional p(y | x) Use neighborhood around x • • Take average of y values in • neighborhood If we have enough training data, K-NN is good approximation Some theorems say KNN estimate ideal as # examples (N) gets infinitely large Problem in practice: we never have enough data, esp. if feature dimensions are large Mike Hughes - Tufts COMP 135 - Spring 2019 33

  34. Curse of Dimensionality Mike Hughes - Tufts COMP 135 - Spring 2019 34

  35. MSE as dimension increases -- Linear Regression • K Neighbors Regression Credit: ISL textbook, Fig 3.20 Mike Hughes - Tufts COMP 135 - Spring 2019 35

  36. Write MSE via Bias & Variance is known “true” response value at given fixed input x y is a Random Variable obtained by fitting estimator to random ˆ y sample of N training data examples, then predicting at fixed x h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i y 2 − 2ˆ = E ˆ h y 2 i yy + y 2 = E ˆ − 2¯ y , E [ˆ ¯ y ] Mike Hughes - Tufts COMP 135 - Spring 2019 36

  37. Write MSE via Bias & Variance h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i y 2 − 2ˆ = E ˆ h y 2 i yy + y 2 = E ˆ − 2¯ h y 2 i y 2 + ¯ y 2 − 2¯ yy + y 2 Add net value of zero = E ˆ − ¯ Pick 0 = -a + a Mike Hughes - Tufts COMP 135 - Spring 2019 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend