Expectation DS GA 1002 Probability and Statistics for Data Science - - PowerPoint PPT Presentation

expectation
SMART_READER_LITE
LIVE PREVIEW

Expectation DS GA 1002 Probability and Statistics for Data Science - - PowerPoint PPT Presentation

Expectation DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Aim Describe random variables with a few numbers: mean, variance, covariance Expectation


slide-1
SLIDE 1

Expectation

DS GA 1002 Probability and Statistics for Data Science

http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

slide-2
SLIDE 2

Aim

Describe random variables with a few numbers: mean, variance, covariance

slide-3
SLIDE 3

Expectation operator Mean and variance Covariance Conditional expectation

slide-4
SLIDE 4

Discrete random variables

Average of the values of a function weighted by the pmf E (g (X)) =

  • x∈R

g (x) pX (x) E (g (X, Y )) =

  • x∈RX
  • x∈RY

g (x, y) pX,Y (x, y) E

  • g
  • X
  • =
  • x1
  • x2

· · ·

  • xn

g ( x) p

X (

x)

slide-5
SLIDE 5

Continuous random variables

Average of the values of a function weighted by the pdf E (g (X)) = ∞

x=−∞

g (x) fX (x) dx E (g (X, Y )) = ∞

x=−∞

y=−∞

g (x, y) fX,Y (x, y) dx dy E

  • g
  • X
  • =

x1=−∞

x2=−∞

· · · ∞

xn=−∞

g ( x) f

X (

x) dx1 dx2 . . . dxn

slide-6
SLIDE 6

Discrete and continuous random variables

E (g (C, D)) = ∞

c=−∞

  • d∈RD

g (c, d) fC (c) pD|C (d|c) dc =

  • d∈RD

c=−∞

g (c, d) pD (d) fC|D (c|d) dc

slide-7
SLIDE 7

St Petersburg paradox

A casino offers you a game Flip an unbiased coin until it lands on heads You get 2k dollars where k = number of flips Expected gain?

slide-8
SLIDE 8

St Petersburg paradox

E (Gain) =

  • k=1

2k · 1 2k

slide-9
SLIDE 9

St Petersburg paradox

E (Gain) =

  • k=1

2k · 1 2k = ∞

slide-10
SLIDE 10

Linearity of expectation

For any constants a and b and any functions g1 and g2 E (a g1 (X, Y ) + b g2 (X, Y )) = a E (g1 (X, Y )) + b E (g2 (X, Y )) Follows from linearity of sums and integrals

  • x∈RX
  • x∈RY

(ag1 (x, y) + bg2 (x, y))pX,Y (x, y) = a

  • x∈RX
  • x∈RY

g1 (x, y) pX,Y (x, y) + b

  • x∈RX
  • x∈RY

g2 (x, y) pX,Y (x, y)

slide-11
SLIDE 11

Example: Coffee beans

◮ Company buys coffee beans from two local producers ◮ Beans from Colombia: C tons/year ◮ Beans from Vietnam: V tons/year ◮ Model:

◮ C uniform between 0 and 1 ◮ V uniform between 0 and 2 ◮ C and V independent

◮ What is the expected total amount of beans B?

slide-12
SLIDE 12

Example: Coffee beans

E (C + V )

slide-13
SLIDE 13

Example: Coffee beans

E (C + V ) = E (C) + E (V )

slide-14
SLIDE 14

Example: Coffee beans

E (C + V ) = E (C) + E (V ) = 0.5 + 1 = 1.5 tons

slide-15
SLIDE 15

Example: Coffee beans

E (C + V ) = E (C) + E (V ) = 0.5 + 1 = 1.5 tons Holds even if C and V are not independent

slide-16
SLIDE 16

Independence

If X, Y are independent then E (g (X) h (Y )) = E (g (X)) E (h (Y ))

slide-17
SLIDE 17

Independence

E (g (X) h (Y )) = ∞

x=−∞

y=−∞

g (x) h (y) fX,Y (x, y) dx dy

slide-18
SLIDE 18

Independence

E (g (X) h (Y )) = ∞

x=−∞

y=−∞

g (x) h (y) fX,Y (x, y) dx dy = ∞

x=−∞

y=−∞

g (x) h (y) fX (x) fY (y) dx dy

slide-19
SLIDE 19

Independence

E (g (X) h (Y )) = ∞

x=−∞

y=−∞

g (x) h (y) fX,Y (x, y) dx dy = ∞

x=−∞

y=−∞

g (x) h (y) fX (x) fY (y) dx dy = E (g (X)) E (h (Y ))

slide-20
SLIDE 20

Expectation operator Mean and variance Covariance Conditional expectation

slide-21
SLIDE 21

Mean

The mean or first moment of X is E (X) It’s the center of mass of the distribution

slide-22
SLIDE 22

Bernoulli

E (X) = 0 · pX (0) + 1 · pX (1) = p

slide-23
SLIDE 23

Binomial

A binomial is a sum of n Bernoulli random variables X =

n

  • i=1

Bi

slide-24
SLIDE 24

Binomial

A binomial is a sum of n Bernoulli random variables X =

n

  • i=1

Bi E (X) = E n

  • i=1

Bi

slide-25
SLIDE 25

Binomial

A binomial is a sum of n Bernoulli random variables X =

n

  • i=1

Bi E (X) = E n

  • i=1

Bi

  • =

n

  • i=1

E (Bi)

slide-26
SLIDE 26

Binomial

A binomial is a sum of n Bernoulli random variables X =

n

  • i=1

Bi E (X) = E n

  • i=1

Bi

  • =

n

  • i=1

E (Bi) = np

slide-27
SLIDE 27

Mean of important random variables

Random variable Parameters Mean Bernoulli p p Geometric p

1 p

Binomial n, p np Poisson λ λ Uniform a, b

a+b 2

Exponential λ

1 λ

Gaussian µ, σ µ

slide-28
SLIDE 28

Cauchy random variable

−10 −5 5 10 0.1 0.2 0.3 x fX (x) fX(x) = 1 π(1 + x2).

slide-29
SLIDE 29

Cauchy random variable

E(X) = ∞

−∞

x π(1 + x2) dx = ∞ x π(1 + x2) dx − ∞ x π(1 + x2) dx

slide-30
SLIDE 30

Cauchy random variable

E(X) = ∞

−∞

x π(1 + x2) dx = ∞ x π(1 + x2) dx − ∞ x π(1 + x2) dx ∞ x π(1 + x2) dx = ∞ 1 2π(1 + t)dt = lim

t→∞

log(1 + t) 2π

slide-31
SLIDE 31

Cauchy random variable

E(X) = ∞

−∞

x π(1 + x2) dx = ∞ x π(1 + x2) dx − ∞ x π(1 + x2) dx ∞ x π(1 + x2) dx = ∞ 1 2π(1 + t)dt = lim

t→∞

log(1 + t) 2π = ∞

slide-32
SLIDE 32

Mean of a random vector

Vector formed by the means of its components E

  • X
  • :=

    E (X1) E (X2) · · · E (Xn)     By linearity of expectation, for any matrix A ∈ Rm×n and b ∈ Rm E

  • A

X + b

  • = A E
  • X
  • +

b

slide-33
SLIDE 33

The mean as a typical value

The mean is a typical value of the random variable The probability that X equals E (X) can be zero The mean can be severely distorted by a subset of extreme values

slide-34
SLIDE 34

Density with subset of extreme values

20 40 60 80 100 0.1 x fX (x) Uniform random variable X with support [−4.5, 4.5] ∪ [99.5, 100.5]

slide-35
SLIDE 35

Density with subset of extreme values

E (X) = 4.5

x=−4.5

x fX (x) dx + 100.5

x=99.5

x fX (x) dx = 1 10 100.52 − 99.52 2 = 10

slide-36
SLIDE 36

Density with subset of extreme values

20 40 60 80 100 0.1 x fX (x)

slide-37
SLIDE 37

Median

Midpoint of the distribution: number m such that P (X ≤ m) ≥ 1 2 and P (X ≥ m) ≥ 1 2 For continuous random variables FX (m) = m

−∞

fX (x) dx = 1 2

slide-38
SLIDE 38

Density with subset of extreme values

FX (m) = m

−4.5

fX (x) dx = m + 4.5 10

slide-39
SLIDE 39

Density with subset of extreme values

FX (m) = m

−4.5

fX (x) dx = m + 4.5 10 = 1 2 ⇒ m = 0.5

slide-40
SLIDE 40

Density with subset of extreme values

20 40 60 80 100 0.1 x fX (x) Mean Median

slide-41
SLIDE 41

Variance

The mean square or second moment of X is E

  • X 2

The variance of X is Var (X) := E

  • (X − E (X))2

= E

  • X 2 − 2XE (X) + E2 (X)
  • = E
  • X 2

− E2 (X) The standard deviation of X is σX :=

  • Var (X)
slide-42
SLIDE 42

Bernoulli

E

  • X 2

= 0 · pX (0) + 1 · pX (1) = p Var (X) = E

  • X 2

− E2 (X) = p − p2 = p (1 − p)

slide-43
SLIDE 43

Variance of common random variables

Random variable Parameters Variance Bernoulli p p (1 − p) Geometric p

1−p p2

Binomial n, p np (1 − p) Poisson λ λ Uniform a, b

(b−a)2 12

Exponential λ

1 λ2

Gaussian µ, σ σ2

slide-44
SLIDE 44

Geometric (p = 0.2)

5 10 15 20 5 · 10−2 0.1 0.15 0.2 k pX(k)

slide-45
SLIDE 45

Binomial (n = 20, p = 0.5)

5 10 15 20 5 · 10−2 0.1 0.15 0.2 k

slide-46
SLIDE 46

Poisson (λ = 25)

10 20 30 40 2 4 6 8 ·10−2 k

slide-47
SLIDE 47

Uniform [0, 1]

−0.5 0.5 1 1.5 0.2 0.4 0.6 0.8 1 x fX (x)

slide-48
SLIDE 48

Exponential (λ = 1)

1 2 3 4 5 0.2 0.4 0.6 0.8 1 x

slide-49
SLIDE 49

Gaussian (µ = 0, σ = 1)

−4 −2 2 4 0.1 0.2 0.3 0.4 x

slide-50
SLIDE 50

Variance

The variance operator is not linear, but Var (a X + b) = E

  • (a X + b − E (a X + b))2

= E

  • (a X + b − aE (X) − b)2

= a2 E

  • (X − E (X))2

= a2 Var (X)

slide-51
SLIDE 51

Bounding probabilities using expectations

Aim: Characterize behavior of X to some extent using E (X) and Var (X)

slide-52
SLIDE 52

Markov’s inequality

For any nonnegative random variable X and any a > 0 P (X ≥ a) ≤ E (X) a

slide-53
SLIDE 53

Markov’s inequality

Consider the indicator variable 1X≥a X − a 1X≥a ≥ 0

slide-54
SLIDE 54

Markov’s inequality

Consider the indicator variable 1X≥a X − a 1X≥a ≥ 0 E (X) ≥ a E (1X≥a)

slide-55
SLIDE 55

Markov’s inequality

Consider the indicator variable 1X≥a X − a 1X≥a ≥ 0 E (X) ≥ a E (1X≥a) = a P (X ≥ a)

slide-56
SLIDE 56

Age of students at NYU

Mean: 20 years How many are younger than 30?

slide-57
SLIDE 57

Age of students at NYU

Mean: 20 years How many are younger than 30? P(A ≥ 30) ≤ E (A) 30

slide-58
SLIDE 58

Age of students at NYU

Mean: 20 years How many are younger than 30? P(A ≥ 30) ≤ E (A) 30 = 2 3 At least 1/3

slide-59
SLIDE 59

Chebyshev’s inequality

For any positive constant a > 0, P (|X − E (X)| ≥ a) ≤ Var (X) a 2

slide-60
SLIDE 60

Chebyshev’s inequality

For any positive constant a > 0, P (|X − E (X)| ≥ a) ≤ Var (X) a 2 Corollary: If Var (X) = 0 then P (X = E (X)) = 0

slide-61
SLIDE 61

Chebyshev’s inequality

For any positive constant a > 0, P (|X − E (X)| ≥ a) ≤ Var (X) a 2 Corollary: If Var (X) = 0 then P (X = E (X)) = 0 For any ǫ > 0 P (|X − E (X)| ≥ ǫ) ≤ Var (X) ǫ2 = 0

slide-62
SLIDE 62

Chebyshev’s inequality

Define Y := (X − E (X))2 By Markov’s inequality P (|X − E (X)| ≥ a) = P

  • Y ≥ a2
slide-63
SLIDE 63

Chebyshev’s inequality

Define Y := (X − E (X))2 By Markov’s inequality P (|X − E (X)| ≥ a) = P

  • Y ≥ a2

≤ E (Y ) a2

slide-64
SLIDE 64

Chebyshev’s inequality

Define Y := (X − E (X))2 By Markov’s inequality P (|X − E (X)| ≥ a) = P

  • Y ≥ a2

≤ E (Y ) a2 = Var (X) a 2

slide-65
SLIDE 65

Age of students at NYU

Mean: 20 years, standard deviation: 3 years How many are younger than 30?

slide-66
SLIDE 66

Age of students at NYU

Mean: 20 years, standard deviation: 3 years How many are younger than 30? P(A ≥ 30) ≤ P(|A − 20| ≥ 10)

slide-67
SLIDE 67

Age of students at NYU

Mean: 20 years, standard deviation: 3 years How many are younger than 30? P(A ≥ 30) ≤ P(|A − 20| ≥ 10) ≤ Var (A) 100 = 9 100 At least 91 %

slide-68
SLIDE 68

Expectation operator Mean and variance Covariance Conditional expectation

slide-69
SLIDE 69

Covariance

The covariance of X and Y is Cov (X, Y ) := E ((X − E (X)) (Y − E (Y ))) = E (XY − Y E (X) − XE (Y ) + E (X) E (Y )) = E (XY ) − E (X) E (Y ) If Cov (X, Y ) = 0, X and Y are uncorrelated

slide-70
SLIDE 70

Covariance

Cov (X, Y ) 0.5 0.9 0.99 Cov (X, Y )

  • 0.9
  • 0.99
slide-71
SLIDE 71

Variance of the sum

Var (X + Y ) = E

  • (X + Y − E (X + Y ))2

= E

  • (X − E (X))2

+ E

  • (Y − E (Y ))2

+ 2E ((X − E (X)) (Y − E (Y ))) = Var (X) + Var (Y ) + 2 Cov (X, Y )

slide-72
SLIDE 72

Variance of the sum

Var (X + Y ) = E

  • (X + Y − E (X + Y ))2

= E

  • (X − E (X))2

+ E

  • (Y − E (Y ))2

+ 2E ((X − E (X)) (Y − E (Y ))) = Var (X) + Var (Y ) + 2 Cov (X, Y ) If X and Y are uncorrelated, then Var (X + Y ) = Var (X) + Var (Y )

slide-73
SLIDE 73

Independence implies uncorrelation

Cov (X, Y ) = E (XY ) − E (X) E (Y ) = E (X) E (Y ) − E (X) E (Y ) = 0

slide-74
SLIDE 74

Uncorrelation does not imply independence

X, Y are independent Bernoulli with parameter 1

2

Let U = X + Y and V = X − Y Are U and V independent? Are they uncorrelated?

slide-75
SLIDE 75

Uncorrelation does not imply independence

pU (0) pV (0) pU,V (0, 0)

slide-76
SLIDE 76

Uncorrelation does not imply independence

pU (0) = P (X = 0, Y = 0) = 1 4 pV (0) pU,V (0, 0)

slide-77
SLIDE 77

Uncorrelation does not imply independence

pU (0) = P (X = 0, Y = 0) = 1 4 pV (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 pU,V (0, 0)

slide-78
SLIDE 78

Uncorrelation does not imply independence

pU (0) = P (X = 0, Y = 0) = 1 4 pV (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 pU,V (0, 0) = P (X = 0, Y = 0) = 1 4

slide-79
SLIDE 79

Uncorrelation does not imply independence

pU (0) = P (X = 0, Y = 0) = 1 4 pV (0) = P (X = 1, Y = 1) + P (X = 0, Y = 0) = 1 2 pU,V (0, 0) = P (X = 0, Y = 0) = 1 4 = pU (0) pV (0) = 1 8

slide-80
SLIDE 80

Uncorrelation does not imply independence

Cov (U, V ) = E (UV ) − E (U) E (V ) = E ((X + Y ) (X − Y )) − E (X + Y ) E (X − Y ) = E

  • X 2

− E

  • Y 2

− E2 (X) + E2 (Y )

slide-81
SLIDE 81

Uncorrelation does not imply independence

Cov (U, V ) = E (UV ) − E (U) E (V ) = E ((X + Y ) (X − Y )) − E (X + Y ) E (X − Y ) = E

  • X 2

− E

  • Y 2

− E2 (X) + E2 (Y ) = 0

slide-82
SLIDE 82

Correlation coefficient

Pearson correlation coefficient of X and Y ρX,Y := Cov (X, Y ) σXσY . Covariance between X/σX and Y /σY

slide-83
SLIDE 83

Correlation coefficient

σY = 1, Cov (X, Y ) = 0.9, ρX,Y = 0.9 σY = 3, Cov (X, Y ) = 0.9, ρX,Y = 0.3 σY = 3, Cov (X, Y ) = 2.7, ρX,Y = 0.9

slide-84
SLIDE 84

Cauchy-Schwarz inequality

For any X and Y |E (XY )| ≤

  • E (X 2) E (Y 2).

and E (XY ) =

  • E (X 2) E (Y 2) ⇐

⇒ Y =

  • E (Y 2)

E (X 2)X E (XY ) = −

  • E (X 2) E (Y 2) ⇐

⇒ Y = −

  • E (Y 2)

E (X 2)X

slide-85
SLIDE 85

Cauchy-Schwarz inequality

We have Cov (X, Y ) ≤ σXσY and equivalently |ρX,Y | ≤ 1 In addition |ρX,Y | = 1 ⇐ ⇒ Y = c X + d where c :=

  • σY

σX

if ρX,Y = 1, − σY

σX

if ρX,Y = −1, d := E (Y ) − cE (X)

slide-86
SLIDE 86

Covariance matrix of a random vector

The covariance matrix of X is defined as Σ

X =

     Var (X1) Cov (X1, X2) · · · Cov (X1, Xn) Cov (X2, X1) Var (X2) · · · Cov (X2, Xn) . . . . . . ... . . . Cov (Xn, X2) Cov (Xn, X2) · · · Var (Xn)      = E

  • X

X T − E

  • X
  • E
  • X

T

slide-87
SLIDE 87

Covariance matrix after a linear transformation

ΣA

X+ b

slide-88
SLIDE 88

Covariance matrix after a linear transformation

ΣA

X+ b = E

  • A

X + b A X + b T − E

  • A

X + b

  • E
  • A

X + b T

slide-89
SLIDE 89

Covariance matrix after a linear transformation

ΣA

X+ b = E

  • A

X + b A X + b T − E

  • A

X + b

  • E
  • A

X + b T = A E

  • X

X T AT + b E

  • X

T AT + A E

  • X
  • bT +

b bT − A E

  • X
  • E
  • X

T AT − A E

  • X
  • bT −

b E

  • X

T AT − b bT

slide-90
SLIDE 90

Covariance matrix after a linear transformation

ΣA

X+ b = E

  • A

X + b A X + b T − E

  • A

X + b

  • E
  • A

X + b T = A E

  • X

X T AT + b E

  • X

T AT + A E

  • X
  • bT +

b bT − A E

  • X
  • E
  • X

T AT − A E

  • X
  • bT −

b E

  • X

T AT − b bT = A

  • E
  • X

X T − E

  • X
  • E
  • X

T AT

slide-91
SLIDE 91

Covariance matrix after a linear transformation

ΣA

X+ b = E

  • A

X + b A X + b T − E

  • A

X + b

  • E
  • A

X + b T = A E

  • X

X T AT + b E

  • X

T AT + A E

  • X
  • bT +

b bT − A E

  • X
  • E
  • X

T AT − A E

  • X
  • bT −

b E

  • X

T AT − b bT = A

  • E
  • X

X T − E

  • X
  • E
  • X

T AT = AΣ

XAT

slide-92
SLIDE 92

Variance in a fixed direction

For any unit vector u Var

  • uT

X

  • =

uTΣ

X

u

slide-93
SLIDE 93

Direction of maximum variance

To find direction of maximum variance we must solve arg max

|| u||2=1

uTΣ

X

u

slide-94
SLIDE 94

Linear algebra

Symmetric matrices have orthogonal eigenvectors Σ

X = UΛUT

=

  • u1
  • u2

· · ·

  • un

   λ1 · · · λ2 · · · · · · · · · λn    

  • u1
  • u2

· · ·

  • un

T

slide-95
SLIDE 95

Linear algebra

λ1 = max

||u||2=1 uTAu

u1 = arg max

||u||2=1 uTAu

λk = max

||u||2=1,u⊥u1,...,uk−1

uTAu uk = arg max

||u||2=1,u⊥u1,...,uk−1

uTAu

slide-96
SLIDE 96

Direction of maximum variance

√λ1 = 1.22, √λ2 = 0.71 √λ1 = 1, √λ2 = 1 √λ1 = 1.38, √λ2 = 0.32

slide-97
SLIDE 97

Coloring

Goal: Transform uncorrelated samples with unit variance so that they have a prescribed covariance matrix Σ

  • 1. Compute the eigendecomposition Σ = UΛUT.
  • 2. Set
  • y := U

√ Λ x where √ Λ :=     √λ1 · · · √λ2 · · · · · · · · · √λn    

slide-98
SLIDE 98

Coloring

Σ

Y

slide-99
SLIDE 99

Coloring

Σ

Y = U

√ ΛΣ

X

√ Λ

TUT

slide-100
SLIDE 100

Coloring

Σ

Y = U

√ ΛΣ

X

√ Λ

TUT

= U √ ΛI √ Λ

TUT

slide-101
SLIDE 101

Coloring

Σ

Y = U

√ ΛΣ

X

√ Λ

TUT

= U √ ΛI √ Λ

TUT

= Σ

slide-102
SLIDE 102

Coloring

  • X

√ Λ X U √ Λ X

slide-103
SLIDE 103

Generating Gaussian random vectors

Goal: Sampling from an n-dimensional Gaussian random vector with mean µ and covariance matrix Σ

  • 1. Generate n independent standard Gaussian samples

x

  • 2. Compute the eigendecomposition Σ = UΛUT
  • 3. Set
  • y := U

√ Λ x + µ For non-Gaussian random vectors, coloring does not necessarily preserve the distribution

slide-104
SLIDE 104

For Gaussian rvs uncorrelation implies mutual independence

Uncorrelation implies Σ

X =

     σ2

1

· · · σ2

2

· · · . . . . . . ... . . . · · · σ2

n

     which in turn implies f

X (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

slide-105
SLIDE 105

For Gaussian rvs uncorrelation implies mutual independence

Uncorrelation implies Σ

X =

     σ2

1

· · · σ2

2

· · · . . . . . . ... . . . · · · σ2

n

     which in turn implies f

X (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • =

n

  • i=1

1

  • (2π)σi

exp

  • −(xi − µi)2

2σ2

i

slide-106
SLIDE 106

For Gaussian rvs uncorrelation implies mutual independence

Uncorrelation implies Σ

X =

     σ2

1

· · · σ2

2

· · · . . . . . . ... . . . · · · σ2

n

     which in turn implies f

X (

x) = 1

  • (2π)n |Σ|

exp

  • −1

2 ( x − µ)T Σ−1 ( x − µ)

  • =

n

  • i=1

1

  • (2π)σi

exp

  • −(xi − µi)2

2σ2

i

  • =

n

  • i=1

fXi (xi)

slide-107
SLIDE 107

Expectation operator Mean and variance Covariance Conditional expectation

slide-108
SLIDE 108

Conditional expectation

Expectation of g (X, Y ) given X = x? E (g (X, Y ) |X = x) = ∞

y=−∞

g(x, y) fY |X (y|x) dy, Can be interpreted as a function h (x) := E (g (X, Y ) |X = x) The conditional expectation of g (X, Y ) given X is E (g (X, Y ) |X) := h (X) It’s a random variable

slide-109
SLIDE 109

Iterated expectation

For any X and Y and any function g : R2 → R E (g (X, Y )) = E (E (g (X, Y ) |X))

slide-110
SLIDE 110

Iterated expectation

h (x) := E (g (X, Y ) |X = x) = ∞

y=−∞

g (x, y) fY |X (y|x) dy

slide-111
SLIDE 111

Iterated expectation

h (x) := E (g (X, Y ) |X = x) = ∞

y=−∞

g (x, y) fY |X (y|x) dy E (E (g (X, Y ) |X)) = E (h (X))

slide-112
SLIDE 112

Iterated expectation

h (x) := E (g (X, Y ) |X = x) = ∞

y=−∞

g (x, y) fY |X (y|x) dy E (E (g (X, Y ) |X)) = E (h (X)) = ∞

x=−∞

h (x) fX (x) dx

slide-113
SLIDE 113

Iterated expectation

h (x) := E (g (X, Y ) |X = x) = ∞

y=−∞

g (x, y) fY |X (y|x) dy E (E (g (X, Y ) |X)) = E (h (X)) = ∞

x=−∞

h (x) fX (x) dx = ∞

x=−∞

y=−∞

fX (x) fY |X (y|x) g (x, y) dy dx

slide-114
SLIDE 114

Iterated expectation

h (x) := E (g (X, Y ) |X = x) = ∞

y=−∞

g (x, y) fY |X (y|x) dy E (E (g (X, Y ) |X)) = E (h (X)) = ∞

x=−∞

h (x) fX (x) dx = ∞

x=−∞

y=−∞

fX (x) fY |X (y|x) g (x, y) dy dx = E (g (X, Y ))

slide-115
SLIDE 115

Example: Desert

◮ Car traveling through the desert ◮ Time until the car breaks down: T ◮ State of the motor: M ◮ State of the road: R ◮ Model:

◮ M uniform between 0 (no problem) and 1 (very bad) ◮ R uniform between 0 (no problem) and 1 (very bad) ◮ M and R independent ◮ T exponential with parameter M + R

slide-116
SLIDE 116

Example: Desert

E (T) = E (E (T|M, R))

slide-117
SLIDE 117

Example: Desert

E (T) = E (E (T|M, R)) = E

  • 1

M + R

slide-118
SLIDE 118

Example: Desert

E (T) = E (E (T|M, R)) = E

  • 1

M + R

  • =

1 1 1 m + r dm dr

slide-119
SLIDE 119

Example: Desert

E (T) = E (E (T|M, R)) = E

  • 1

M + R

  • =

1 1 1 m + r dm dr = 1 log (r + 1) − log (r) dr

slide-120
SLIDE 120

Example: Desert

E (T) = E (E (T|M, R)) = E

  • 1

M + R

  • =

1 1 1 m + r dm dr = 1 log (r + 1) − log (r) dr = log 4 = 1.39

slide-121
SLIDE 121

Grizzlies in Yellowstone

Model for the weight of grizzly bears in Yellowstone: Males: Gaussian with µ := 240 kg and σ := 40kg Females: Gaussian with µ := 140 kg and σ := 20kg There are about the same number of females and males

slide-122
SLIDE 122

Grizzlies in Yellowstone

E (W ) = E (E (W |S))

slide-123
SLIDE 123

Grizzlies in Yellowstone

E (W ) = E (E (W |S)) = E (W |S = 0) + E (W |S = 1) 2

slide-124
SLIDE 124

Grizzlies in Yellowstone

E (W ) = E (E (W |S)) = E (W |S = 0) + E (W |S = 1) 2 = 180 kg

slide-125
SLIDE 125

Bayesian coin flip

Bayesian methods often endow parameters of discrete distributions with a continuous marginal distribution

◮ You suspect a coin is biased ◮ You are uncertain about the bias so you model it as a random variable

with pdf fB (b) = 2t for t ∈ [0, 1]

◮ What is the expected value of the coin flip X?

slide-126
SLIDE 126

Bayesian coin flip

E (X) = E (E (X|B))

slide-127
SLIDE 127

Bayesian coin flip

E (X) = E (E (X|B)) = E (B)

slide-128
SLIDE 128

Bayesian coin flip

E (X) = E (E (X|B)) = E (B) = 1 2b2 db

slide-129
SLIDE 129

Bayesian coin flip

E (X) = E (E (X|B)) = E (B) = 1 2b2 db = 2 3