Probability and Statistical Decision Theory Many slides - - PowerPoint PPT Presentation

probability and statistical decision theory
SMART_READER_LITE
LIVE PREVIEW

Probability and Statistical Decision Theory Many slides - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,


slide-1
SLIDE 1

Probability and Statistical Decision Theory

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Logistics

  • Recitation tonight: 730-830pm, Halligan 111B
  • More on pipelines and feature transforms
  • Cross validation

2

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

Unit Objectives

  • Probability Basics
  • Discrete random variables
  • Continuous random variables
  • Decision Theory: Making optimal predictions
  • Limits of learning
  • The curse of dimensionality
  • The bias-variance tradeoff

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-4
SLIDE 4

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-5
SLIDE 5

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-6
SLIDE 6

Model Complexity vs Error

6

Mike Hughes - Tufts COMP 135 - Spring 2019

Overfitting Underfitting

slide-7
SLIDE 7

Today: Bias and Variance

7

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html

slide-8
SLIDE 8

Model Complexity vs Error

8

Mike Hughes - Tufts COMP 135 - Spring 2019

High Variance High Bias

slide-9
SLIDE 9

Discrete Random Variable

Examples:

  • Coin flip! Heads or tails?
  • Dice roll! 1 or 2 or … 6?

In general, random variable is defined by:

  • Countable set of all possible outcomes
  • Probability value for each outcome

9

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-10
SLIDE 10

Probability Mass Function

Notation:

  • X is random variable
  • x is a particular observed value
  • Probability of observation: p(" = $)

Function p is a probability mass function (pmf) Maps possible values to probabilities in [0, 1] Must sum to one over domain of X

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-11
SLIDE 11

Pair exercise

  • Draw the pmf for a normal 6-sided dice roll
  • Draw pmf if there are:
  • 2 sides with 1 pip
  • 0 sides with 2 pips

11

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-12
SLIDE 12

Expected Values

What is the expected value of a dice roll? Expected means probability-weighted average

12

Mike Hughes - Tufts COMP 135 - Spring 2019

E[X] = X

x

p(X = x)x

slide-13
SLIDE 13

Joint Probability

13

Mike Hughes - Tufts COMP 135 - Spring 2019

X Y p(X = candidate A AND Y = young)

slide-14
SLIDE 14

Marginal Probability

14

Mike Hughes - Tufts COMP 135 - Spring 2019

X Y

Marginal p(Y):

Marginal p(X):

slide-15
SLIDE 15

Conditional Probability

What is the probability of support for candidate A, if we assume that the voter is young? Goal:

15

Mike Hughes - Tufts COMP 135 - Spring 2019

Marginal p(Y):

Try it with your partner!

X Y p(X = candidate A|Y = young)

slide-16
SLIDE 16

Conditional Probability

What is the probability of support for candidate A, if we assume that the voter is young? Goal:

16

Mike Hughes - Tufts COMP 135 - Spring 2019

Marginal p(Y):

Answer:

X Y p(X = candidate A|Y = young)

slide-17
SLIDE 17

The Rules of Probability

17

Mike Hughes - Tufts COMP 135 - Spring 2019

= p(X|Y )p(Y )

slide-18
SLIDE 18

Continuous Random Variables

Any r.v. whose possible outcomes are not a discrete set, but take values on a number line Examples: uniform draw between 0 and 1 draw from Gaussian “bell curve” distribution

18

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

Probability Density Function

  • Generalizes pmf for discrete r.v. to continuous
  • Any pdf p(x) must satisfy two properties:

19

Mike Hughes - Tufts COMP 135 - Spring 2019

∀x : p(x) ≥ 0 Z

x

p(x)dx = 1

slide-20
SLIDE 20

Example

20

Mike Hughes - Tufts COMP 135 - Spring 2019

Consider a uniform distribution over entire real line (from -inf to + inf) Draw the pdf, verify that it can meet the required conditions (nonnegative, integrates to one). Is there a problem here?

slide-21
SLIDE 21

Plots of Gaussian pdf

21

Mike Hughes - Tufts COMP 135 - Spring 2019

What do you notice about y-axis values… Is there a problem here?

slide-22
SLIDE 22

22

Mike Hughes - Tufts COMP 135 - Spring 2019

Probability Density Function

  • Generalizes pmf for discrete r.v. to continuous
  • Any pdf p(x) must satisfy two properties:

∀x : p(x) ≥ 0 Z

x

p(x)dx = 1

Value of p(x) can take ANY value > 0, even sometimes larger than 1 Should NOT interpret as “probability of drawing exactly x” Should interpret as “density at vanishingly small interval around x” Remember: density = mass / volume

slide-23
SLIDE 23

Continuous Expectations

E[X] = Z

x∈domain(X)

xp(x)dx

23

Mike Hughes - Tufts COMP 135 - Spring 2019

E[h(X)] = Z

x∈domain(X)

h(x)p(x)dx

slide-24
SLIDE 24

Approximating Expectations

Use “Monte Carlo”: average of a sample!

  • 1) Draw S i.i.d. samples from distribution
  • 2) Compute mean of these sampled values

24

Mike Hughes - Tufts COMP 135 - Spring 2019

E[h(X)] ≈ 1 S

S

X

s=1

h(xs)

x1, x2, . . . xS ∼ p(x)

For any function h, the mean of this random estimator is unbiased. As number of samples S increases, variance of estimator decreases.

slide-25
SLIDE 25

Statistical Decision Theory

  • See ESL textbook in Ch. 2 and Ch. 7

25

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

How to predict best if we know conditional probability?

26

Mike Hughes - Tufts COMP 135 - Spring 2019

Assume we have: a specific x input of interest a known “true” conditional p(Y | X) error metric we care about How should we set our predictor ? Minimize the expected error! Key ideas: prediction will be a scalar conditional distribution p(Y|X) tells us everything we need to know

min

ˆ y

E[err(Y, ˆ y)|X = x]

ˆ y

slide-27
SLIDE 27

Expected y at a given fixed x

27

Mike Hughes - Tufts COMP 135 - Spring 2019

E[Y |X = x] = Z

y

y p(y|X = x)dy

slide-28
SLIDE 28

Recall from HW1

  • Two constant value estimators
  • Mean of training set
  • Median of training set
  • Two possible error metrics
  • Squared error
  • Absolute error

Which estimator did best under which error metric?

28

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-29
SLIDE 29

Minimize expected squared error

29

Mike Hughes - Tufts COMP 135 - Spring 2019

What is your intuition from HW1? Express in terms of p(Y|X=x)… How should we set our predictor to minimize the expected error?

ˆ y

Assume we have: a specific x input of interest a known “true” conditional p(y | x)

E[err(Y, ˆ y)|X = x] = Z

y

(y − ˆ y)2 p(y|X = x)dy

min

ˆ y

E[err(Y, ˆ y)|X = x]

slide-30
SLIDE 30

Minimize expected squared error

30

Mike Hughes - Tufts COMP 135 - Spring 2019

How should we set our predictor to minimize the expected error? Optimal predictor for squared error: mean y value under p(Y|X=x)

ˆ y

Assume we have: a specific x input of interest a known “true” conditional p(y | x)

E[err(Y, ˆ y)|X = x] = Z

y

(y − ˆ y)2 p(y|X = x)dy

min

ˆ y

E[err(Y, ˆ y)|X = x] ˆ y = E[Y |X = x]

In practice, mean of sampled y values at/around x

slide-31
SLIDE 31

31

Mike Hughes - Tufts COMP 135 - Spring 2019

Minimize expected absolute error

How should we set our predictor to minimize the expected error? What is your intuition from HW 1?

ˆ y

Assume we have: a specific x input of interest a known “true” conditional p(y | x)

min

ˆ y

E[err(Y, ˆ y)|X = x]

E[err(Y, ˆ y)|X = x] = Z

y

|y − ˆ y| p(y|X = x)dy

slide-32
SLIDE 32

32

Mike Hughes - Tufts COMP 135 - Spring 2019

Minimize expected absolute error

How should we set our predictor to minimize the expected error?

ˆ y

Assume we have: a specific x input of interest a known “true” conditional p(y | x)

min

ˆ y

E[err(Y, ˆ y)|X = x]

E[err(Y, ˆ y)|X = x] = Z

y

|y − ˆ y| p(y|X = x)dy

Optimal predictor for squared error: median y value under p(Y|X=x)

ˆ y∗ = median(p(Y |X = x))

In practice, median of sampled y values at/around x

slide-33
SLIDE 33

Minimizing error with K-NN

Ideal Approximation

33

Mike Hughes - Tufts COMP 135 - Spring 2019

  • know “true” conditional p(y | x)
  • Use neighborhood around x
  • Take average of y values in

neighborhood If we have enough training data, K-NN is good approximation Some theorems say KNN estimate ideal as # examples (N) gets infinitely large Problem in practice: we never have enough data, esp. if feature dimensions are large

slide-34
SLIDE 34

Curse of Dimensionality

34

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-35
SLIDE 35

MSE as dimension increases

35

Mike Hughes - Tufts COMP 135 - Spring 2019

  • - Linear Regression
  • K Neighbors Regression

Credit: ISL textbook, Fig 3.20

slide-36
SLIDE 36

Write MSE via Bias & Variance

36

Mike Hughes - Tufts COMP 135 - Spring 2019

E h ˆ y(xtr, ytr) − y 2 i = E h (ˆ y − y)2 i = E h ˆ y2 − 2ˆ yy + y2i = E h ˆ y2i − 2¯ yy + y2

is known “true” response value at given fixed input x is a Random Variable obtained by fitting estimator to random sample of N training data examples, then predicting at fixed x

ˆ y

¯ y , E[ˆ y]

y

slide-37
SLIDE 37

37

Mike Hughes - Tufts COMP 135 - Spring 2019

Write MSE via Bias & Variance

E h ˆ y(xtr, ytr) − y 2 i = E h (ˆ y − y)2 i = E h ˆ y2 − 2ˆ yy + y2i = E h ˆ y2i − 2¯ yy + y2

= E h ˆ y2i − ¯ y2 + ¯ y2 − 2¯ yy + y2

Add net value of zero Pick 0 = -a + a

slide-38
SLIDE 38

38

Mike Hughes - Tufts COMP 135 - Spring 2019

Write MSE via Bias & Variance

E h ˆ y(xtr, ytr) − y 2 i = E h (ˆ y − y)2 i = E h ˆ y2 − 2ˆ yy + y2i = E h ˆ y2i − 2¯ yy + y2

= E h ˆ y2i − ¯ y2 + ¯ y2 − 2¯ yy + y2

(¯ y − y)2 bias , ¯ y − y

slide-39
SLIDE 39

39

Mike Hughes - Tufts COMP 135 - Spring 2019

MSE = Variance + Bias^2

E h ˆ y(xtr, ytr) − y 2 i = E h (ˆ y − y)2 i = E h ˆ y2 − 2ˆ yy + y2i = E h ˆ y2i − 2¯ yy + y2

= E h ˆ y2i − ¯ y2 + ¯ y2 − 2¯ yy + y2

(¯ y − y)2

= Var(ˆ y)+

Var[X] , E[X2] − E2

slide-40
SLIDE 40

Punchline

mean squared error = variance + bias^2

We can use this framing to explain tradeoffs of different prediction approaches on finite training datasets.

40

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-41
SLIDE 41

Toy example: ESL Fig. 7.3

41

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-42
SLIDE 42

42

Mike Hughes - Tufts COMP 135 - Spring 2019

variance total error bias

Error due to inability of average model to capture true predictive relationship Error due to estimating from a single finite sample

More flexible

slide-43
SLIDE 43

Toy example: ISL Fig. 6.5

43

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-44
SLIDE 44

44

Mike Hughes - Tufts COMP 135 - Spring 2019

variance total error bias

Error due to inability of average fit to capture true predictive relationship Error due to estimating from a single finite sample

More flexible

slide-45
SLIDE 45

Can Also Treat True Y as R.V.

45

Mike Hughes - Tufts COMP 135 - Spring 2019

Y = f(X) + ✏

Noise Random Variable Symmetric (zero mean) Often, Gaussian True signal function

slide-46
SLIDE 46

The Final MSE decomposition

46

Mike Hughes - Tufts COMP 135 - Spring 2019

E[MSE] = Var(ˆ y) + bias2 + irreducible error

For more, see Sec. 7.3 of ESL textbook…

slide-47
SLIDE 47

Bias and Variance

47

Mike Hughes - Tufts COMP 135 - Spring 2019

Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html