Probability and Statistical Decision Theory Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

Logistics • Recitation tonight: 730-830pm, Halligan 111B • More on pipelines and feature transforms • Cross validation Mike Hughes - Tufts COMP 135 - Spring 2019 2

Unit Objectives • Probability Basics • Discrete random variables • Continuous random variables • Decision Theory: Making optimal predictions • Limits of learning • The curse of dimensionality • The bias-variance tradeoff Mike Hughes - Tufts COMP 135 - Spring 2019 3

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

Model Complexity vs Error Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Spring 2019 6

Today: Bias and Variance Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html Mike Hughes - Tufts COMP 135 - Spring 2019 7

Model Complexity vs Error High Variance High Bias Mike Hughes - Tufts COMP 135 - Spring 2019 8

Discrete Random Variable Examples: • Coin flip! Heads or tails? • Dice roll! 1 or 2 or … 6? In general, random variable is defined by: • Countable set of all possible outcomes • Probability value for each outcome Mike Hughes - Tufts COMP 135 - Spring 2019 9

Probability Mass Function Notation: - X is random variable - x is a particular observed value - Probability of observation: p (" = $) Function p is a probability mass function (pmf) Maps possible values to probabilities in [0, 1] Must sum to one over domain of X Mike Hughes - Tufts COMP 135 - Spring 2019 10

Pair exercise • Draw the pmf for a normal 6-sided dice roll • Draw pmf if there are: • 2 sides with 1 pip • 0 sides with 2 pips Mike Hughes - Tufts COMP 135 - Spring 2019 11

Expected Values What is the expected value of a dice roll? Expected means probability-weighted average X E [ X ] = p ( X = x ) x x Mike Hughes - Tufts COMP 135 - Spring 2019 12

Joint Probability X Y p ( X = candidate A AND Y = young) Mike Hughes - Tufts COMP 135 - Spring 2019 13

Marginal Probability X Marginal p(Y): Y Marginal p(X): Mike Hughes - Tufts COMP 135 - Spring 2019 14

Conditional Probability What is the probability of support for candidate A, if we assume that the voter is young? p ( X = candidate A | Y = young) Goal: X Marginal p(Y): Y Try it with your partner! Mike Hughes - Tufts COMP 135 - Spring 2019 15

Conditional Probability What is the probability of support for candidate A, if we assume that the voter is young? p ( X = candidate A | Y = young) Goal: X Marginal p(Y): Y Answer: Mike Hughes - Tufts COMP 135 - Spring 2019 16

The Rules of Probability = p ( X | Y ) p ( Y ) Mike Hughes - Tufts COMP 135 - Spring 2019 17

Continuous Random Variables Any r.v. whose possible outcomes are not a discrete set, but take values on a number line Examples: uniform draw between 0 and 1 draw from Gaussian “bell curve” distribution Mike Hughes - Tufts COMP 135 - Spring 2019 18

Probability Density Function • Generalizes pmf for discrete r.v. to continuous • Any pdf p(x) must satisfy two properties: ∀ x : p ( x ) ≥ 0 Z p ( x ) dx = 1 x Mike Hughes - Tufts COMP 135 - Spring 2019 19

Example Consider a uniform distribution over entire real line (from -inf to + inf) Draw the pdf, verify that it can meet the required conditions (nonnegative, integrates to one). Is there a problem here? Mike Hughes - Tufts COMP 135 - Spring 2019 20

Plots of Gaussian pdf What do you notice about y-axis values… Is there a problem here? Mike Hughes - Tufts COMP 135 - Spring 2019 21

Probability Density Function • Generalizes pmf for discrete r.v. to continuous • Any pdf p(x) must satisfy two properties: ∀ x : p ( x ) ≥ 0 Z p ( x ) dx = 1 x Value of p(x) can take ANY value > 0, even sometimes larger than 1 Should NOT interpret as “probability of drawing exactly x” Should interpret as “density at vanishingly small interval around x” Remember: density = mass / volume Mike Hughes - Tufts COMP 135 - Spring 2019 22

Continuous Expectations Z E [ X ] = xp ( x ) dx x ∈ domain( X ) Z E [ h ( X )] = h ( x ) p ( x ) dx x ∈ domain( X ) Mike Hughes - Tufts COMP 135 - Spring 2019 23

Approximating Expectations Use “Monte Carlo”: average of a sample! • 1) Draw S i.i.d. samples from distribution x 1 , x 2 , . . . x S ∼ p ( x ) • 2) Compute mean of these sampled values S E [ h ( X )] ≈ 1 X h ( x s ) S s =1 For any function h, the mean of this random estimator is unbiased. As number of samples S increases, variance of estimator decreases. Mike Hughes - Tufts COMP 135 - Spring 2019 24

Statistical Decision Theory • See ESL textbook in Ch. 2 and Ch. 7 Mike Hughes - Tufts COMP 135 - Spring 2019 25

How to predict best if we know conditional probability? Assume we have: a specific x input of interest a known “true” conditional p(Y | X) error metric we care about How should we set our predictor ? Minimize the expected error! ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Key ideas: prediction will be a scalar conditional distribution p(Y|X) tells us everything we need to know Mike Hughes - Tufts COMP 135 - Spring 2019 26

Expected y at a given fixed x Z E [ Y | X = x ] = y p ( y | X = x ) dy y Mike Hughes - Tufts COMP 135 - Spring 2019 27

Recall from HW1 • Two constant value estimators • Mean of training set • Median of training set • Two possible error metrics • Squared error • Absolute error Which estimator did best under which error metric? Mike Hughes - Tufts COMP 135 - Spring 2019 28

Minimize expected squared error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z y ) 2 p ( y | X = x ) dy E [err( Y, ˆ y ) | X = x ] = ( y − ˆ y What is your intuition from HW1? Express in terms of p(Y|X=x)… How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Mike Hughes - Tufts COMP 135 - Spring 2019 29

Minimize expected squared error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z y ) 2 p ( y | X = x ) dy E [err( Y, ˆ y ) | X = x ] = ( y − ˆ y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Optimal predictor for squared error: mean y value under p(Y|X=x) In practice, mean of sampled y = E [ Y | X = x ] ˆ y values at/around x Mike Hughes - Tufts COMP 135 - Spring 2019 30

Minimize expected absolute error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z E [err( Y, ˆ y ) | X = x ] = | y − ˆ y | p ( y | X = x ) dy y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y What is your intuition from HW 1? Mike Hughes - Tufts COMP 135 - Spring 2019 31

Minimize expected absolute error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z E [err( Y, ˆ y ) | X = x ] = | y − ˆ y | p ( y | X = x ) dy y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Optimal predictor for squared error: median y value under p(Y|X=x) y ∗ = median( p ( Y | X = x )) In practice, median of ˆ sampled y values at/around x Mike Hughes - Tufts COMP 135 - Spring 2019 32

Minimizing error with K-NN Ideal Approximation know “true” conditional p(y | x) Use neighborhood around x • • Take average of y values in • neighborhood If we have enough training data, K-NN is good approximation Some theorems say KNN estimate ideal as # examples (N) gets infinitely large Problem in practice: we never have enough data, esp. if feature dimensions are large Mike Hughes - Tufts COMP 135 - Spring 2019 33

Curse of Dimensionality Mike Hughes - Tufts COMP 135 - Spring 2019 34

MSE as dimension increases -- Linear Regression • K Neighbors Regression Credit: ISL textbook, Fig 3.20 Mike Hughes - Tufts COMP 135 - Spring 2019 35

Write MSE via Bias & Variance is known “true” response value at given fixed input x y is a Random Variable obtained by fitting estimator to random ˆ y sample of N training data examples, then predicting at fixed x h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i y 2 − 2ˆ = E ˆ h y 2 i yy + y 2 = E ˆ − 2¯ y , E [ˆ ¯ y ] Mike Hughes - Tufts COMP 135 - Spring 2019 36

Write MSE via Bias & Variance h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i y 2 − 2ˆ = E ˆ h y 2 i yy + y 2 = E ˆ − 2¯ h y 2 i y 2 + ¯ y 2 − 2¯ yy + y 2 Add net value of zero = E ˆ − ¯ Pick 0 = -a + a Mike Hughes - Tufts COMP 135 - Spring 2019 37

Probability and Statistical Decision Theory Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Ramsey on Partial Belief Dan Hoek PHI 371 Foundations of Probability and Decision Theory

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Making Decisions Under Uncertainty What an agent should do depends on: The agents ability

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Statistical Machine Learning Lecture 05: Bayesian Decision Theory Kristian Kersting TU Darmstadt

Decision theory Dr. Jarad Niemi STAT 544 - Iowa State University March 7, 2017 Jarad Niemi

Introduction to Decision Networks Alice Gao Lecture 13 Based on work by K. Leyton-Brown, K.

SENSATA THIRD QUARTER 2018 EARNINGS PRESENTATION OCTOBER 30, 2018 Forward-Looking Statements and

First Quarter 2018 Earnings Conference Call May 2, 2018 1 third Quarter 2017 earnings

SAT SAT SAT SAT To Become an Auto Parts Manufacturing Leader in ASEAN with Excellent Quality

Probability and Statistical Decision Theory Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Ramsey on Partial Belief Dan Hoek PHI 371 Foundations of Probability and Decision Theory

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Making Decisions Under Uncertainty What an agent should do depends on: The agents ability

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Statistical Machine Learning Lecture 05: Bayesian Decision Theory Kristian Kersting TU Darmstadt

Decision theory Dr. Jarad Niemi STAT 544 - Iowa State University March 7, 2017 Jarad Niemi

Introduction to Decision Networks Alice Gao Lecture 13 Based on work by K. Leyton-Brown, K.

SENSATA THIRD QUARTER 2018 EARNINGS PRESENTATION OCTOBER 30, 2018 Forward-Looking Statements and

First Quarter 2018 Earnings Conference Call May 2, 2018 1 third Quarter 2017 earnings

SAT SAT SAT SAT To Become an Auto Parts Manufacturing Leader in ASEAN with Excellent Quality

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models