CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: - - PowerPoint PPT Presentation

cs70 lecture 35
SMART_READER_LITE
LIVE PREVIEW

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: - - PowerPoint PPT Presentation

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.): Linear and Beyond 1. Review: Linear Regression (LR), LLSE 2. LR: Examples 3. Beyond LR: Quadratic Regression 4. Conditional Expectation (CE) and


slide-1
SLIDE 1

CS70: Lecture 35.

Regression (contd.): Linear and Beyond

slide-2
SLIDE 2

CS70: Lecture 35.

Regression (contd.): Linear and Beyond

  • 1. Review: Linear Regression (LR), LLSE
  • 2. LR: Examples
  • 3. Beyond LR: Quadratic Regression
  • 4. Conditional Expectation (CE) and properties
  • 5. Non-linear Regression: CE = Minimum Mean-Squared

Error (MMSE)

slide-3
SLIDE 3

Review: Linear Regression – Motivation

Example: 100 people. Let (Xn,Yn) = (height, weight) of person n, for n = 1,...,100:

E[Y ] Y X

The blue line is Y = −114.3+106.5X. (X in meters, Y in kg.) Best linear fit: Linear Regression.

slide-4
SLIDE 4

Review: Covariance

Definition

The covariance of X and Y is cov(X,Y) := E[(X −E[X])(Y −E[Y])]. Fact cov(X,Y) = E[XY]−E[X]E[Y].

slide-5
SLIDE 5

Review: Examples of Covariance

Note that E[X] = 0 and E[Y] = 0 in these examples. Then cov(X,Y) = E[XY]. When cov(X,Y) > 0, the RVs X and Y tend to be large or small

  • together. X and Y are said to be positively correlated.

When cov(X,Y) < 0, when X is larger, Y tends to be smaller. X and Y are said to be negatively correlated. When cov(X,Y) = 0, we say that X and Y are uncorrelated.

slide-6
SLIDE 6

Review: Linear Regression – Non-Bayesian

Definition Given the samples {(Xn,Yn),n = 1,...,N}, the Linear Regression of Y over X is ˆ Y = a+bX where (a,b) minimize

N

n=1

(Yn −a−bXn)2. Thus, ˆ Yn = a+bXn is our guess about Yn given Xn. The squared error is (Yn − ˆ Yn)2. The LR minimizes the sum of the squared errors. Note: This is a non-Bayesian formulation: there is no prior.

slide-7
SLIDE 7

Review: Linear Least Squares Estimate (LLSE)

Definition Given two RVs X and Y with known distribution Pr[X = x,Y = y], the Linear Least Squares Estimate of Y given X is ˆ Y = a+bX =: L[Y|X] where (a,b) minimize g(a,b) := E[(Y −a−bX)2]. Thus, ˆ Y = a+bX is our guess about Y given X. The squared error is (Y − ˆ Y)2. The LLSE minimizes the expected value of the squared error. Note: This is a Bayesian formulation: there is a prior.

slide-8
SLIDE 8

Review: LR: Non-Bayesian or Uniform?

Observe that 1 N

N

n=1

(Yn −a−bXn)2 = E[(Y −a−bX)2] where one assumes that (X,Y) = (Xn,Yn), w.p. 1 N for n = 1,...,N. That is, the non-Bayesian LR is equivalent to the Bayesian LLSE that assumes that (X,Y) is uniform on the set of

  • bserved samples.

Thus, we can study the two cases LR and LLSE in one shot. However, the interpretations are different!

slide-9
SLIDE 9

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting:

slide-10
SLIDE 10

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] =

slide-11
SLIDE 11

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N

N

n=1

Xn; E[Y] =

slide-12
SLIDE 12

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N

N

n=1

Xn; E[Y] = 1 N

N

n=1

Yn

slide-13
SLIDE 13

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N

N

n=1

Xn; E[Y] = 1 N

N

n=1

Yn Var[X] = E[X 2]−(E[X])2 =

slide-14
SLIDE 14

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N

N

n=1

Xn; E[Y] = 1 N

N

n=1

Yn Var[X] = E[X 2]−(E[X])2 = 1 N

N

n=1

(Xn)2 −( 1 N

N

n=1

(Xn))2

slide-15
SLIDE 15

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N

N

n=1

Xn; E[Y] = 1 N

N

n=1

Yn Var[X] = E[X 2]−(E[X])2 = 1 N

N

n=1

(Xn)2 −( 1 N

N

n=1

(Xn))2 Cov(X,Y) = E[XY]−E[X]E[Y]

slide-16
SLIDE 16

Review: LLSE

Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N

N

n=1

Xn; E[Y] = 1 N

N

n=1

Yn Var[X] = E[X 2]−(E[X])2 = 1 N

N

n=1

(Xn)2 −( 1 N

N

n=1

(Xn))2 Cov(X,Y) = E[XY]−E[X]E[Y] = 1 N

N

n=1

(XnYn)−( 1 N

N

n=1

Xn)( 1 N

N

n=1

Yn)

slide-17
SLIDE 17

LR: Illustration

slide-18
SLIDE 18

LR: Illustration

Note that

◮ the LR line goes through (E[X],E[Y])

slide-19
SLIDE 19

LR: Illustration

Note that

◮ the LR line goes through (E[X],E[Y]) ◮ its slope is cov(X,Y) var(X) .

slide-20
SLIDE 20

Linear Regression: Examples

slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Linear Regression: Example 2

slide-24
SLIDE 24

Linear Regression: Example 2

We find:

E[X] =

slide-25
SLIDE 25

Linear Regression: Example 2

We find:

E[X] = 0;

slide-26
SLIDE 26

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] =

slide-27
SLIDE 27

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;

slide-28
SLIDE 28

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] =

slide-29
SLIDE 29

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;

slide-30
SLIDE 30

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] =

slide-31
SLIDE 31

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2;

slide-32
SLIDE 32

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 =

slide-33
SLIDE 33

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;

slide-34
SLIDE 34

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] =

slide-35
SLIDE 35

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = 1/2;

slide-36
SLIDE 36

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = 1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) =

slide-37
SLIDE 37

Linear Regression: Example 2

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = 1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) = X.

slide-38
SLIDE 38

Linear Regression: Example 3

slide-39
SLIDE 39

Linear Regression: Example 3

We find:

E[X] =

slide-40
SLIDE 40

Linear Regression: Example 3

We find:

E[X] = 0;

slide-41
SLIDE 41

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] =

slide-42
SLIDE 42

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;

slide-43
SLIDE 43

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] =

slide-44
SLIDE 44

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;

slide-45
SLIDE 45

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] =

slide-46
SLIDE 46

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2;

slide-47
SLIDE 47

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 =

slide-48
SLIDE 48

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;

slide-49
SLIDE 49

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] =

slide-50
SLIDE 50

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = −1/2;

slide-51
SLIDE 51

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = −1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) =

slide-52
SLIDE 52

Linear Regression: Example 3

We find:

E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = −1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) = −X.

slide-53
SLIDE 53

Estimation Error

We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]).

slide-54
SLIDE 54

Estimation Error

We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). How good is this estimator?

slide-55
SLIDE 55

Estimation Error

We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). How good is this estimator? That is, what is the mean squared estimation error?

slide-56
SLIDE 56

Estimation Error

We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). How good is this estimator? That is, what is the mean squared estimation error? We find

E[|Y −L[Y|X]|2] = E[(Y −E[Y]−(cov(X,Y)/var(X))(X −E[X]))2] = E[(Y −E[Y])2]−2(cov(X,Y)/var(X))E[(Y −E[Y])(X −E[X])] +(cov(X,Y)/var(X))2E[(X −E[X])2 = var(Y)− cov(X,Y)2 var(X) . Without observations, the estimate is E[Y] = 0. The error is var(Y). Observing X reduces the error.

slide-57
SLIDE 57

Wrap-up of Linear Regression

Linear Regression

slide-58
SLIDE 58

Wrap-up of Linear Regression

Linear Regression

  • 1. Linear Regression: L[Y|X] = E[Y]+ cov(X,Y)

var(X) (X −E[X])

slide-59
SLIDE 59

Wrap-up of Linear Regression

Linear Regression

  • 1. Linear Regression: L[Y|X] = E[Y]+ cov(X,Y)

var(X) (X −E[X])

  • 2. Non-Bayesian: minimize ∑n(Yn −a−bXn)2
slide-60
SLIDE 60

Wrap-up of Linear Regression

Linear Regression

  • 1. Linear Regression: L[Y|X] = E[Y]+ cov(X,Y)

var(X) (X −E[X])

  • 2. Non-Bayesian: minimize ∑n(Yn −a−bXn)2
  • 3. Bayesian: minimize E[(Y −a−bX)2]
slide-61
SLIDE 61

Beyond Linear Regression: Discussion

slide-62
SLIDE 62

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is?

slide-63
SLIDE 63

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y].

slide-64
SLIDE 64

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y.

slide-65
SLIDE 65

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y?

slide-66
SLIDE 66

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y.

slide-67
SLIDE 67

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX.

slide-68
SLIDE 68

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)?

slide-69
SLIDE 69

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)? Answer:

slide-70
SLIDE 70

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)? Answer: E[Y|X].

slide-71
SLIDE 71

Beyond Linear Regression: Discussion

Goal: guess the value of Y in the expected squared error

  • sense. We know nothing about Y other than its distribution.

Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)? Answer: E[Y|X]. This is called the Conditional Expectation (CE).

slide-72
SLIDE 72

Nonlinear Regression: Motivation

slide-73
SLIDE 73

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear.

slide-74
SLIDE 74

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight),

slide-75
SLIDE 75

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income),

slide-76
SLIDE 76

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk).

slide-77
SLIDE 77

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk).

slide-78
SLIDE 78

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk). Our goal:

slide-79
SLIDE 79

Nonlinear Regression: Motivation

There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk). Our goal: explore estimates ˆ Y = g(X) for nonlinear functions g(·).

slide-80
SLIDE 80

Quadratic Regression

slide-81
SLIDE 81

Quadratic Regression

Let X,Y be two random variables defined on the same probability space.

slide-82
SLIDE 82

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition:

slide-83
SLIDE 83

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable

slide-84
SLIDE 84

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2

slide-85
SLIDE 85

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2].

slide-86
SLIDE 86

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation:

slide-87
SLIDE 87

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c.

slide-88
SLIDE 88

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2]

slide-89
SLIDE 89

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X]

slide-90
SLIDE 90

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X] = E[(Y −a−bX −cX 2)X 2]

slide-91
SLIDE 91

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X] = E[(Y −a−bX −cX 2)X 2] We solve these three equations in the three unknowns (a,b,c).

slide-92
SLIDE 92

Quadratic Regression

Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X] = E[(Y −a−bX −cX 2)X 2] We solve these three equations in the three unknowns (a,b,c).

slide-93
SLIDE 93

Conditional Expectation

Definition Let X and Y be RVs on Ω.

slide-94
SLIDE 94

Conditional Expectation

Definition Let X and Y be RVs on Ω. The conditional expectation of Y given X is defined as E[Y|X] = g(X)

slide-95
SLIDE 95

Conditional Expectation

Definition Let X and Y be RVs on Ω. The conditional expectation of Y given X is defined as E[Y|X] = g(X) where g(x) := E[Y|X = x] := ∑

y

yPr[Y = y|X = x].

slide-96
SLIDE 96

Conditional Expectation

Definition Let X and Y be RVs on Ω. The conditional expectation of Y given X is defined as E[Y|X] = g(X) where g(x) := E[Y|X = x] := ∑

y

yPr[Y = y|X = x].

slide-97
SLIDE 97

Deja vu, all over again?

Have we seen this before?

slide-98
SLIDE 98

Deja vu, all over again?

Have we seen this before? Yes.

slide-99
SLIDE 99

Deja vu, all over again?

Have we seen this before? Yes. Is anything new?

slide-100
SLIDE 100

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes.

slide-101
SLIDE 101

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X).

slide-102
SLIDE 102

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal?

slide-103
SLIDE 103

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite!

slide-104
SLIDE 104

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient.

slide-105
SLIDE 105

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X.

slide-106
SLIDE 106

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·).

slide-107
SLIDE 107

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX.

slide-108
SLIDE 108

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX. It could be that g(X) = a+bX +cX 2.

slide-109
SLIDE 109

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX. It could be that g(X) = a+bX +cX 2. Or that g(X) = 2sin(4X)+exp{−3X}.

slide-110
SLIDE 110

Deja vu, all over again?

Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX. It could be that g(X) = a+bX +cX 2. Or that g(X) = 2sin(4X)+exp{−3X}. Or something else.

slide-111
SLIDE 111

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x]

slide-112
SLIDE 112

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x] Theorem

slide-113
SLIDE 113

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y];

slide-114
SLIDE 114

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X];

slide-115
SLIDE 115

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X]; (c) E[Yh(X)|X] = h(X)E[Y|X],∀h(·);

slide-116
SLIDE 116

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X]; (c) E[Yh(X)|X] = h(X)E[Y|X],∀h(·); (d) E[E[Y|X]] = E[Y].

slide-117
SLIDE 117

Properties of CE

E[Y|X = x] = ∑

y

yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X]; (c) E[Yh(X)|X] = h(X)E[Y|X],∀h(·); (d) E[E[Y|X]] = E[Y].

slide-118
SLIDE 118

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1.

slide-119
SLIDE 119

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X].

slide-120
SLIDE 120

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X]

slide-121
SLIDE 121

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X]

slide-122
SLIDE 122

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X] = 2+5X +7XE[Y]+11X 2 +13X 3E[Z 2]

slide-123
SLIDE 123

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X] = 2+5X +7XE[Y]+11X 2 +13X 3E[Z 2] = 2+5X +11X 2 +13X 3(var[Z]+E[Z]2)

slide-124
SLIDE 124

Calculating E[Y|X]

Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X] = 2+5X +7XE[Y]+11X 2 +13X 3E[Z 2] = 2+5X +11X 2 +13X 3(var[Z]+E[Z]2) = 2+5X +11X 2 +13X 3.

slide-125
SLIDE 125

CE = MMSE

(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2].

slide-126
SLIDE 126

CE = MMSE

(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2]. That is, E[Y|X] is the ‘best’ guess about Y based on X.

slide-127
SLIDE 127

CE = MMSE

(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2]. That is, E[Y|X] is the ‘best’ guess about Y based on X. Specifically, it is the function g(X) of X that minimizes E[(Y −g(X))2].

slide-128
SLIDE 128

CE = MMSE

(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2]. That is, E[Y|X] is the ‘best’ guess about Y based on X. Specifically, it is the function g(X) of X that minimizes E[(Y −g(X))2].

slide-129
SLIDE 129

Summary

Linear and Non-Linear Regression: Conditional Expectation

slide-130
SLIDE 130

Summary

Linear and Non-Linear Regression: Conditional Expectation

◮ Linear Regression: L[Y|X] = E[Y]+ cov(X,Y) var(X) (X −E[X])

slide-131
SLIDE 131

Summary

Linear and Non-Linear Regression: Conditional Expectation

◮ Linear Regression: L[Y|X] = E[Y]+ cov(X,Y) var(X) (X −E[X]) ◮ Non-linear Regression: MMSE: E[Y|X] minimizes

E[(Y −g(X))2] over all g(·)

◮ Definition: E[Y|X] := ∑y yPr[Y = y|X = x]

slide-132
SLIDE 132

Summary

Linear and Non-Linear Regression: Conditional Expectation

◮ Linear Regression: L[Y|X] = E[Y]+ cov(X,Y) var(X) (X −E[X]) ◮ Non-linear Regression: MMSE: E[Y|X] minimizes

E[(Y −g(X))2] over all g(·)

◮ Definition: E[Y|X] := ∑y yPr[Y = y|X = x]