Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - - PowerPoint PPT Presentation

applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - - PowerPoint PPT Presentation

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial:


slide-1
SLIDE 1

Applied Machine Learning

Multivariate Gaussian

Siamak Ravanbakhsh

COMP 551 (Fall 2020)

slide-2
SLIDE 2

Admin

Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later

Coding tutorial: Arnab will go over the Numpy code for different methods 2 pm on Wednesdays and Fridays starting this Friday Zoom link will be posted

slide-3
SLIDE 3

Gaussian distribution: motivation and the functional form of its density covariance matrix correlation and dependence linear transformations of Gaussian marginalization, chain rule and conditioning for Gaussian

Learning objectives

slide-4
SLIDE 4

Gaussian probability density function (pdf)

N(x; μ, σ) = e

2πσ2 1 −

2σ2 (x−μ)2

Univariate Gaussian density

E[x] = μ

E[(x − μ) ] =

2

σ2

two parameters are μ, σ

μ, σ2

this is a random variable; we are using the same notation for a random variable and a particular value of that variable

turn out to be the mean and variance

slide-5
SLIDE 5

Gaussian probability density function (pdf)

N(x; μ, σ) = e

2πσ2 1 −

2σ2 (x−μ)2

Univariate Gaussian density

68.2% 95.4% 99.7%

given a dataset D = {x , … , x }

(1) (N)

maximum likelihood estimate of are empirical mean and variance

μ, σ2

μ =

MLE

x

N 1 ∑n (n)

σ =

2MLE

(x − μ )

N 1 ∑n (n) MLE 2

how can we derive this?

slide-6
SLIDE 6

COMP 551 | Fall 2020

Univariate Gaussian density

two reasons why Gaussian is an important dist. maximum entropy dist. with a fixed variance central limit theorem

let's throw three dice, repeatedly plot the histogram of the average outcome looks familiar? the average (and sum) of IID random variables has a Gaussian distribution let's replace the dice with uniformly distributed values in [0,1] justifies use of Gaussian for observations that are mean or sum of some random values lets use 10 dice

slide-7
SLIDE 7

Multivariate Gaussian

N(x; μ, Σ) = exp − (x − μ) Σ (x − μ)

∣2πΣ∣ 1

(

2 1 T −1

)

N(x; μ, σ) = exp −

2πσ2 1

(

2σ2 (x−μ)2

)

x ∈ RD instead of it is a (column) vector x ∈ R

D × D D dimensional

univariate normal density

= ∣2πΣ∣ (2π) ∣Σ∣

2 D 2 1

determinant: for a DxD matrix we have ∣cA∣ = c ∣A∣ D so we have

slide-8
SLIDE 8

Covariance matrix

variance of a random variable Var(x) = E[(x − E[x]) ]

2

= E[x ] −

2

E[x]2 covariance of two random variable Cov(x, y) = E[(x − E[x])(y − E[y])] = E[xy] − E[x]E[y] Σ = ⎣ ⎢ ⎡ Σ1,1 ⋮ ΣD,1 … ⋱ … Σ1,D ⋮ ΣD,D⎦ ⎥ ⎤ x ∈ RD for we have covariance matrix = E[(x − E[x])(x − E[x]) ]

D × 1 1 × D = E[xx ] −

E[x]E[x]⊤ D × D D × D

Cov(x , x )

1 D

Cov(x , x ) =

1 1

Var(x )

1

D = {x , … , x }

(1) (N)

given a dataset = Σ ^ E [(x −

D

E [x])(x −

D

E [x]) ]

D ⊤

sample covariance matrix

ΣMLE

x − x ( N

1 ∑x∈D

)

the empirical estimate

recall

slide-9
SLIDE 9

Covariance matrix

D = {x , … , x }

(1) (N)

given a dataset = Σ ^ E [(x −

D

E [x])(x −

D

E [x]) ]

D ⊤

sample covariance matrix

ΣMLE

x − x ( N

1 ∑x∈D

)

the empirical estimate

example

estimating the mean and the covariance of Iris dataset contour lines N(μ, Σ) = const

slide-10
SLIDE 10

COMP 551 | Fall 2020

Covariance matrices

example

Σ = σ2 [1 1]

Isotropic Gaussian

considering bivariate case for visualization

Σ = [4 1]

axis-aligned

Σ = [9 4 4 4]

slide-11
SLIDE 11

Linear transformations

if

and

D ×

D

x ∼ N(μ , Σ )

x x

y = Qx

μ =

y

E[Qx] = QE[x] = Qμx Σ =

y

E[Qxx Q ] −

⊤ ⊤

E[Qx]E[x Q ] =

⊤ ⊤

Q(E[xx ] −

E[x]E[x] )Q =

⊤ ⊤

QΣ Q

x ⊤

y ∼ N(μ , Σ )

y y

then

example

Σ =

x

[1 1]

can we construct any multivariate Gaussian from axis-aligned Gaussians in this way?

Σ =

y

QΣ Q =

x ⊤

[1 2 2 20] Q = [1 2 4]

slide-12
SLIDE 12

Decomposing the covariance matrix

covariance matrix is symmetric positive semi definite symmetric because positive semi definite because for any

Σ =

d,d′

Cov(x , x ) =

d d′

Cov(x , x ) =

d′ d

Σd ,d

y Σy =

(y E[(x −

E[x])(x − E[x]) ]y) =

Var(y x) ≥

y ∈ RD so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian any symmetric positive semi-definite matrix can be decomposed as

Σ = QΛQ⊤

diagonal

  • rthogonal

D × D

QQ =

Q Q =

I (rotation and reflection)

slide-13
SLIDE 13

COMP 551 | Fall 2020

Decomposing the covariance matrix

Σ = QΛQ⊤

diagonal

  • rthogonal

D × D

QQ =

Q Q =

I so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian (rotation and reflection)

Σ = ≈ [10 5 5 5] [−.85 −.52 −.52 .85 ] [13.09 1.90] [−.85 −.52 −.52 .85 ]

example

columns of Q tell us where original bases go variance of the "new axes" aligned Gaussian

(more on this in the PCA lecture)

slide-14
SLIDE 14

Marginalization

what is the distribution of IQ?

p(y) = p(y, z)dz ∫z

we need to marginalize over height people's height and IQ are jointly normally distributed

x , x ∼

H IQ

N , ([μH μIQ] [ σH

2

σH,IQ σH,IQ σIQ

2

])

for Gaussian distributions the marginal is also Gaussian the same idea extends to marginalizing more than one variables marginalization corresponds to a linear transformation

∼ [0 1] [xH xIQ] N , ([0 1] [μH μIQ] [0 1] [ σH

2

σH,IQ σH,IQ σIQ

2

] [0 1]

⊤)

x ∼

IQ

N μ , σ (

IQ IQ 2 )

slide-15
SLIDE 15

Correlation and dependence

correlation is normalized covariance

Corr(x , x ) =

i j Var(x )Var(x )

i j

Cov(x ,x )

i j

∈ [−1, +1]

two variables that are independent are uncorrelated as well

p(x , x ) =

i j

p(x )p(x )

i j

E[x x ] =

i j

E[x ]E[x ]

i j

Cov(x x ) =

i j

image from wikipedia

the inverse is generally not true (zero correlation doesn't mean independence) in each example above correlation between two coordinates is zero, but they are not independent

slide-16
SLIDE 16

COMP 551 | Fall 2020

Correlation and dependence

two variables that are independent are uncorrelated as well correlation is normalized covariance

Corr(x , x ) =

i j Var(x )Var(x )

i j

Cov(x ,x )

i j

∈ [−1, +1] p(x , x ) =

i j

p(x )p(x )

i j

E[x x ] =

i j

E[x ]E[x ]

i j

Cov(x x ) =

i j

inverse is true for Gaussian!

Corr(x , x ) =

i j

0 ⇔ Σ =

i,j

why? marginalize out all variables except x , x

i j

image from wikipedia

∼ [xi xj] N , ([μi μj] [σi

2

σj

2])

but this is the product of two univariate Gaussian dists. therefore is independent of

xi xj

slide-17
SLIDE 17

Chain rule

p(x ∣x ) =

B A

N Qx + c, Σ (

A B∣A)

for this we need to use the chain rule: given

p(x ) =

A

N μ , Σ (

A A)

you probably guessed this based on the formula for linear transformation

then the joint distribution is also normal

p(x , x ) =

B A

N , ([Qμ + c

A

μA ] [Σ + QΣ Q

B∣A A ⊤

Σ Q

A ⊤

QΣA ΣA ])

a dragon's life-span is approximated normally distributed with the average heat of dragon's breath is normal with what is the probability that a random dragon at death bed can melt stainless steel !? μ =

A

1000, σ =

A

100 μ =

B

2x −

A

273, σ =

B∣A

30 example

slide-18
SLIDE 18

Chain rule

substituting into the chair rule formula p(x , x ) =

B A

N , ([2000 − 273 1000 ] [900 + 40000 20000 20000 10000]) we just care about the marginal dist. over the heat of dragon's breath p(x ) =

B

N 1727, 40900 ( ) p(x >

B

1500)

steel's melting point is 1500 C, so we want a dragon's life-span is approximated normally distributed with the average heat of dragon's breath is normal with what is the probability that a random dragon at its death bed can melt stainless steel !? μ =

A

1000, σ =

A

100 μ =

B

2x −

A

273, σ =

B∣A

30 example

−1.22

use CDF of standard normal

40900 1727−1500

1.22

about 13% of dragons can't do it!

slide-19
SLIDE 19

Conditioning

  • ptional

given that (birth weight) and (shoe size) are jointly normally distributed

x , x ∼

W S

N , ([μW μS ] [ σW

2

σW,S σW,S σS

2 ])

xW xS

given an assignment what is the distribution of their shoe size?

x =

W

x ˉW

p(x , x )

W S

x =

W

.7 xW xS xS p(x )

S

p(x ∣x =

S W

.7)

conditional dist. marginal dist.

slide-20
SLIDE 20

Conditioning

  • ptional

let denote a partitioning of so that

x , x ∼

A B

N , ([μA μB] [ ΣA ΣBA ΣAB ΣB ]) x , x

A B

p(x ∣x =

A B

) = x ˉB N μ , Σ (

A∣B A∣B)

x

then

Σ =

A∣B

Σ −

A

Σ Σ Σ

AB B −1 BA

μ =

A∣B

μ +

A

Σ Σ ( −

AB B −1 x

ˉB μ )

B

after conditioning the variance decreases conditional variance is independent of the observation x

ˉB

Σ =

BA

ΣAB

by conditioning on the mean we have

= x ˉB μB μ =

A∣B

μA

slide-21
SLIDE 21

COMP 551 | Fall 2020

Conditioning

  • ptional

a dragon's life-span is approximated normally distributed with the average heat of dragon's breath is normal with at its death bed a dragon claims that its breath is twice hotter than lava (2 x 1250 C). what is your best guess for its age? μ =

A

1000, σ =

A

100 μ =

B

2x −

A

273, σ =

B∣A

30 example we have p(x ), p(x ∣x )

A B A

we want , so we need to use the Bayes rule:

  • 1. calculate using the chain rule
  • 2. condition on
  • 3. report the MAP assignment, which is the same as the mean for Gaussian

p(x ∣x )

A B

p(x , x )

A B

x =

B

2500

slide-22
SLIDE 22

Summary

Gaussian distribution is motivated by the central limit theorem the expression for multivariate Gaussian the maximum-likelihood estimate of parameters covariance matrix and its decomposition zero covariance means independence in Gaussians linear transformations of Gaussians produces Gaussian marginalization and conditioning produces Gaussian sum of independent Gaussian random variables is Gaussian