SLIDE 1
CS70: Lecture 35.
Regression (contd.): Linear and Beyond
SLIDE 2 CS70: Lecture 35.
Regression (contd.): Linear and Beyond
- 1. Review: Linear Regression (LR), LLSE
- 2. LR: Examples
- 3. Beyond LR: Quadratic Regression
- 4. Conditional Expectation (CE) and properties
- 5. Non-linear Regression: CE = Minimum Mean-Squared
Error (MMSE)
SLIDE 3 Review: Linear Regression – Motivation
Example: 100 people. Let (Xn,Yn) = (height, weight) of person n, for n = 1,...,100:
E[Y ] Y X
The blue line is Y = −114.3+106.5X. (X in meters, Y in kg.) Best linear fit: Linear Regression.
SLIDE 4
Review: Covariance
Definition
The covariance of X and Y is cov(X,Y) := E[(X −E[X])(Y −E[Y])]. Fact cov(X,Y) = E[XY]−E[X]E[Y].
SLIDE 5 Review: Examples of Covariance
Note that E[X] = 0 and E[Y] = 0 in these examples. Then cov(X,Y) = E[XY]. When cov(X,Y) > 0, the RVs X and Y tend to be large or small
- together. X and Y are said to be positively correlated.
When cov(X,Y) < 0, when X is larger, Y tends to be smaller. X and Y are said to be negatively correlated. When cov(X,Y) = 0, we say that X and Y are uncorrelated.
SLIDE 6
Review: Linear Regression – Non-Bayesian
Definition Given the samples {(Xn,Yn),n = 1,...,N}, the Linear Regression of Y over X is ˆ Y = a+bX where (a,b) minimize
N
∑
n=1
(Yn −a−bXn)2. Thus, ˆ Yn = a+bXn is our guess about Yn given Xn. The squared error is (Yn − ˆ Yn)2. The LR minimizes the sum of the squared errors. Note: This is a non-Bayesian formulation: there is no prior.
SLIDE 7
Review: Linear Least Squares Estimate (LLSE)
Definition Given two RVs X and Y with known distribution Pr[X = x,Y = y], the Linear Least Squares Estimate of Y given X is ˆ Y = a+bX =: L[Y|X] where (a,b) minimize g(a,b) := E[(Y −a−bX)2]. Thus, ˆ Y = a+bX is our guess about Y given X. The squared error is (Y − ˆ Y)2. The LLSE minimizes the expected value of the squared error. Note: This is a Bayesian formulation: there is a prior.
SLIDE 8 Review: LR: Non-Bayesian or Uniform?
Observe that 1 N
N
∑
n=1
(Yn −a−bXn)2 = E[(Y −a−bX)2] where one assumes that (X,Y) = (Xn,Yn), w.p. 1 N for n = 1,...,N. That is, the non-Bayesian LR is equivalent to the Bayesian LLSE that assumes that (X,Y) is uniform on the set of
Thus, we can study the two cases LR and LLSE in one shot. However, the interpretations are different!
SLIDE 9
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting:
SLIDE 10
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] =
SLIDE 11
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N
N
∑
n=1
Xn; E[Y] =
SLIDE 12
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N
N
∑
n=1
Xn; E[Y] = 1 N
N
∑
n=1
Yn
SLIDE 13
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N
N
∑
n=1
Xn; E[Y] = 1 N
N
∑
n=1
Yn Var[X] = E[X 2]−(E[X])2 =
SLIDE 14
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N
N
∑
n=1
Xn; E[Y] = 1 N
N
∑
n=1
Yn Var[X] = E[X 2]−(E[X])2 = 1 N
N
∑
n=1
(Xn)2 −( 1 N
N
∑
n=1
(Xn))2
SLIDE 15
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N
N
∑
n=1
Xn; E[Y] = 1 N
N
∑
n=1
Yn Var[X] = E[X 2]−(E[X])2 = 1 N
N
∑
n=1
(Xn)2 −( 1 N
N
∑
n=1
(Xn))2 Cov(X,Y) = E[XY]−E[X]E[Y]
SLIDE 16
Review: LLSE
Theorem Consider two RVs X,Y with a given distribution Pr[X = x,Y = y]. Then, L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). Non-Bayesian setting: E[X] = 1 N
N
∑
n=1
Xn; E[Y] = 1 N
N
∑
n=1
Yn Var[X] = E[X 2]−(E[X])2 = 1 N
N
∑
n=1
(Xn)2 −( 1 N
N
∑
n=1
(Xn))2 Cov(X,Y) = E[XY]−E[X]E[Y] = 1 N
N
∑
n=1
(XnYn)−( 1 N
N
∑
n=1
Xn)( 1 N
N
∑
n=1
Yn)
SLIDE 17
LR: Illustration
SLIDE 18
LR: Illustration
Note that
◮ the LR line goes through (E[X],E[Y])
SLIDE 19
LR: Illustration
Note that
◮ the LR line goes through (E[X],E[Y]) ◮ its slope is cov(X,Y) var(X) .
SLIDE 20
Linear Regression: Examples
SLIDE 21
SLIDE 22
SLIDE 23
Linear Regression: Example 2
SLIDE 24
Linear Regression: Example 2
We find:
E[X] =
SLIDE 25
Linear Regression: Example 2
We find:
E[X] = 0;
SLIDE 26
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] =
SLIDE 27
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;
SLIDE 28
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] =
SLIDE 29
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;
SLIDE 30
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] =
SLIDE 31
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2;
SLIDE 32
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 =
SLIDE 33
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;
SLIDE 34
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] =
SLIDE 35
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = 1/2;
SLIDE 36
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = 1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) =
SLIDE 37
Linear Regression: Example 2
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = 1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = 1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) = X.
SLIDE 38
Linear Regression: Example 3
SLIDE 39
Linear Regression: Example 3
We find:
E[X] =
SLIDE 40
Linear Regression: Example 3
We find:
E[X] = 0;
SLIDE 41
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] =
SLIDE 42
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;
SLIDE 43
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] =
SLIDE 44
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;
SLIDE 45
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] =
SLIDE 46
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2;
SLIDE 47
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 =
SLIDE 48
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;
SLIDE 49
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] =
SLIDE 50
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = −1/2;
SLIDE 51
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = −1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) =
SLIDE 52
Linear Regression: Example 3
We find:
E[X] = 0;E[Y] = 0;E[X 2] = 1/2;E[XY] = −1/2; var[X] = E[X 2]−E[X]2 = 1/2;cov(X,Y) = E[XY]−E[X]E[Y] = −1/2; LR: ˆ Y = E[Y]+ cov(X,Y) var[X] (X −E[X]) = −X.
SLIDE 53
Estimation Error
We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]).
SLIDE 54
Estimation Error
We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). How good is this estimator?
SLIDE 55
Estimation Error
We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). How good is this estimator? That is, what is the mean squared estimation error?
SLIDE 56
Estimation Error
We saw that the LLSE of Y given X is L[Y|X] = ˆ Y = E[Y]+ cov(X,Y) var(X) (X −E[X]). How good is this estimator? That is, what is the mean squared estimation error? We find
E[|Y −L[Y|X]|2] = E[(Y −E[Y]−(cov(X,Y)/var(X))(X −E[X]))2] = E[(Y −E[Y])2]−2(cov(X,Y)/var(X))E[(Y −E[Y])(X −E[X])] +(cov(X,Y)/var(X))2E[(X −E[X])2 = var(Y)− cov(X,Y)2 var(X) . Without observations, the estimate is E[Y] = 0. The error is var(Y). Observing X reduces the error.
SLIDE 57
Wrap-up of Linear Regression
Linear Regression
SLIDE 58 Wrap-up of Linear Regression
Linear Regression
- 1. Linear Regression: L[Y|X] = E[Y]+ cov(X,Y)
var(X) (X −E[X])
SLIDE 59 Wrap-up of Linear Regression
Linear Regression
- 1. Linear Regression: L[Y|X] = E[Y]+ cov(X,Y)
var(X) (X −E[X])
- 2. Non-Bayesian: minimize ∑n(Yn −a−bXn)2
SLIDE 60 Wrap-up of Linear Regression
Linear Regression
- 1. Linear Regression: L[Y|X] = E[Y]+ cov(X,Y)
var(X) (X −E[X])
- 2. Non-Bayesian: minimize ∑n(Yn −a−bXn)2
- 3. Bayesian: minimize E[(Y −a−bX)2]
SLIDE 61
Beyond Linear Regression: Discussion
SLIDE 62 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is?
SLIDE 63 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y].
SLIDE 64 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y.
SLIDE 65 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y?
SLIDE 66 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y.
SLIDE 67 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX.
SLIDE 68 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)?
SLIDE 69 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)? Answer:
SLIDE 70 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)? Answer: E[Y|X].
SLIDE 71 Beyond Linear Regression: Discussion
Goal: guess the value of Y in the expected squared error
- sense. We know nothing about Y other than its distribution.
Our best guess is? E[Y]. Now assume we make some observation X related to Y. How do we use that observation to improve our guess about Y? Idea: use a function g(X) of the observation to estimate Y. LR: Restriction to linear functions: g(X) = a+bX. With no such constraints, what is the best g(X)? Answer: E[Y|X]. This is called the Conditional Expectation (CE).
SLIDE 72
Nonlinear Regression: Motivation
SLIDE 73
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear.
SLIDE 74
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight),
SLIDE 75
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income),
SLIDE 76
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk).
SLIDE 77
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk).
SLIDE 78
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk). Our goal:
SLIDE 79
Nonlinear Regression: Motivation
There are many situations where a good guess about Y given X is not linear. E.g., (diameter of object, weight), (school years, income), (PSA level, cancer risk). Our goal: explore estimates ˆ Y = g(X) for nonlinear functions g(·).
SLIDE 80
Quadratic Regression
SLIDE 81
Quadratic Regression
Let X,Y be two random variables defined on the same probability space.
SLIDE 82
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition:
SLIDE 83
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable
SLIDE 84
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2
SLIDE 85
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2].
SLIDE 86
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation:
SLIDE 87
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c.
SLIDE 88
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2]
SLIDE 89
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X]
SLIDE 90
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X] = E[(Y −a−bX −cX 2)X 2]
SLIDE 91
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X] = E[(Y −a−bX −cX 2)X 2] We solve these three equations in the three unknowns (a,b,c).
SLIDE 92
Quadratic Regression
Let X,Y be two random variables defined on the same probability space. Definition: The quadratic regression of Y over X is the random variable Q[Y|X] = a+bX +cX 2 where a,b,c are chosen to minimize E[(Y −a−bX −cX 2)2]. Derivation: We set to zero the derivatives w.r.t. a,b,c. We get = E[Y −a−bX −cX 2] = E[(Y −a−bX −cX 2)X] = E[(Y −a−bX −cX 2)X 2] We solve these three equations in the three unknowns (a,b,c).
SLIDE 93
Conditional Expectation
Definition Let X and Y be RVs on Ω.
SLIDE 94
Conditional Expectation
Definition Let X and Y be RVs on Ω. The conditional expectation of Y given X is defined as E[Y|X] = g(X)
SLIDE 95
Conditional Expectation
Definition Let X and Y be RVs on Ω. The conditional expectation of Y given X is defined as E[Y|X] = g(X) where g(x) := E[Y|X = x] := ∑
y
yPr[Y = y|X = x].
SLIDE 96
Conditional Expectation
Definition Let X and Y be RVs on Ω. The conditional expectation of Y given X is defined as E[Y|X] = g(X) where g(x) := E[Y|X = x] := ∑
y
yPr[Y = y|X = x].
SLIDE 97
Deja vu, all over again?
Have we seen this before?
SLIDE 98
Deja vu, all over again?
Have we seen this before? Yes.
SLIDE 99
Deja vu, all over again?
Have we seen this before? Yes. Is anything new?
SLIDE 100
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes.
SLIDE 101
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X).
SLIDE 102
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal?
SLIDE 103
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite!
SLIDE 104
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient.
SLIDE 105
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X.
SLIDE 106
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·).
SLIDE 107
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX.
SLIDE 108
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX. It could be that g(X) = a+bX +cX 2.
SLIDE 109
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX. It could be that g(X) = a+bX +cX 2. Or that g(X) = 2sin(4X)+exp{−3X}.
SLIDE 110
Deja vu, all over again?
Have we seen this before? Yes. Is anything new? Yes. The idea of defining g(x) = E[Y|X = x] and then E[Y|X] = g(X). Big deal? Quite! Simple but most convenient. Recall that L[Y|X] = a+bX is a function of X. This is similar: E[Y|X] = g(X) for some function g(·). In general, g(X) is not linear, i.e., not a+bX. It could be that g(X) = a+bX +cX 2. Or that g(X) = 2sin(4X)+exp{−3X}. Or something else.
SLIDE 111
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x]
SLIDE 112
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x] Theorem
SLIDE 113
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y];
SLIDE 114
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X];
SLIDE 115
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X]; (c) E[Yh(X)|X] = h(X)E[Y|X],∀h(·);
SLIDE 116
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X]; (c) E[Yh(X)|X] = h(X)E[Y|X],∀h(·); (d) E[E[Y|X]] = E[Y].
SLIDE 117
Properties of CE
E[Y|X = x] = ∑
y
yPr[Y = y|X = x] Theorem (a) X,Y independent ⇒ E[Y|X] = E[Y]; (b) E[aY +bZ|X] = aE[Y|X]+bE[Z|X]; (c) E[Yh(X)|X] = h(X)E[Y|X],∀h(·); (d) E[E[Y|X]] = E[Y].
SLIDE 118
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1.
SLIDE 119
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X].
SLIDE 120
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X]
SLIDE 121
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X]
SLIDE 122
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X] = 2+5X +7XE[Y]+11X 2 +13X 3E[Z 2]
SLIDE 123
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X] = 2+5X +7XE[Y]+11X 2 +13X 3E[Z 2] = 2+5X +11X 2 +13X 3(var[Z]+E[Z]2)
SLIDE 124
Calculating E[Y|X]
Let X,Y,Z be i.i.d. with mean 0 and variance 1. We want to calculate E[2+5X +7XY +11X 2 +13X 3Z 2|X]. We find E[2+5X +7XY +11X 2 +13X 3Z 2|X] = 2+5X +7XE[Y|X]+11X 2 +13X 3E[Z 2|X] = 2+5X +7XE[Y]+11X 2 +13X 3E[Z 2] = 2+5X +11X 2 +13X 3(var[Z]+E[Z]2) = 2+5X +11X 2 +13X 3.
SLIDE 125
CE = MMSE
(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2].
SLIDE 126
CE = MMSE
(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2]. That is, E[Y|X] is the ‘best’ guess about Y based on X.
SLIDE 127
CE = MMSE
(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2]. That is, E[Y|X] is the ‘best’ guess about Y based on X. Specifically, it is the function g(X) of X that minimizes E[(Y −g(X))2].
SLIDE 128
CE = MMSE
(Conditional Expectation = Minimum Mean Squared Error) Theorem g(X) := E[Y|X] is the function of X that minimizes E[(Y −g(X))2]. That is, E[Y|X] is the ‘best’ guess about Y based on X. Specifically, it is the function g(X) of X that minimizes E[(Y −g(X))2].
SLIDE 129
Summary
Linear and Non-Linear Regression: Conditional Expectation
SLIDE 130
Summary
Linear and Non-Linear Regression: Conditional Expectation
◮ Linear Regression: L[Y|X] = E[Y]+ cov(X,Y) var(X) (X −E[X])
SLIDE 131
Summary
Linear and Non-Linear Regression: Conditional Expectation
◮ Linear Regression: L[Y|X] = E[Y]+ cov(X,Y) var(X) (X −E[X]) ◮ Non-linear Regression: MMSE: E[Y|X] minimizes
E[(Y −g(X))2] over all g(·)
◮ Definition: E[Y|X] := ∑y yPr[Y = y|X = x]
SLIDE 132
Summary
Linear and Non-Linear Regression: Conditional Expectation
◮ Linear Regression: L[Y|X] = E[Y]+ cov(X,Y) var(X) (X −E[X]) ◮ Non-linear Regression: MMSE: E[Y|X] minimizes
E[(Y −g(X))2] over all g(·)
◮ Definition: E[Y|X] := ∑y yPr[Y = y|X = x]