Applied Machine Learning
Multivariate Gaussian
Siamak Ravanbakhsh
COMP 551 (Fall 2020)
Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - - PowerPoint PPT Presentation
Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial:
Siamak Ravanbakhsh
COMP 551 (Fall 2020)
Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later
Coding tutorial: Arnab will go over the Numpy code for different methods 2 pm on Wednesdays and Fridays starting this Friday Zoom link will be posted
Gaussian probability density function (pdf)
2πσ2 1 −
2σ2 (x−μ)2
E[x] = μ
E[(x − μ) ] =
2
σ2
two parameters are μ, σ
μ, σ2
this is a random variable; we are using the same notation for a random variable and a particular value of that variable
turn out to be the mean and variance
Gaussian probability density function (pdf)
2πσ2 1 −
2σ2 (x−μ)2
68.2% 95.4% 99.7%
given a dataset D = {x , … , x }
(1) (N)
maximum likelihood estimate of are empirical mean and variance
μ, σ2
μ =
MLE
x
N 1 ∑n (n)
σ =
2MLE
(x − μ )
N 1 ∑n (n) MLE 2
how can we derive this?
COMP 551 | Fall 2020
two reasons why Gaussian is an important dist. maximum entropy dist. with a fixed variance central limit theorem
let's throw three dice, repeatedly plot the histogram of the average outcome looks familiar? the average (and sum) of IID random variables has a Gaussian distribution let's replace the dice with uniformly distributed values in [0,1] justifies use of Gaussian for observations that are mean or sum of some random values lets use 10 dice
∣2πΣ∣ 1
2 1 T −1
N(x; μ, σ) = exp −
2πσ2 1
(
2σ2 (x−μ)2
)
x ∈ RD instead of it is a (column) vector x ∈ R
D × D D dimensional
univariate normal density
= ∣2πΣ∣ (2π) ∣Σ∣
2 D 2 1
determinant: for a DxD matrix we have ∣cA∣ = c ∣A∣ D so we have
variance of a random variable Var(x) = E[(x − E[x]) ]
2
= E[x ] −
2
E[x]2 covariance of two random variable Cov(x, y) = E[(x − E[x])(y − E[y])] = E[xy] − E[x]E[y] Σ = ⎣ ⎢ ⎡ Σ1,1 ⋮ ΣD,1 … ⋱ … Σ1,D ⋮ ΣD,D⎦ ⎥ ⎤ x ∈ RD for we have covariance matrix = E[(x − E[x])(x − E[x]) ]
⊤
D × 1 1 × D = E[xx ] −
⊤
E[x]E[x]⊤ D × D D × D
Cov(x , x )
1 D
Cov(x , x ) =
1 1
Var(x )
1
D = {x , … , x }
(1) (N)
given a dataset = Σ ^ E [(x −
D
E [x])(x −
D
E [x]) ]
D ⊤
sample covariance matrix
ΣMLE
x − x ( N
1 ∑x∈D
)
the empirical estimate
recall
D = {x , … , x }
(1) (N)
given a dataset = Σ ^ E [(x −
D
E [x])(x −
D
E [x]) ]
D ⊤
sample covariance matrix
ΣMLE
x − x ( N
1 ∑x∈D
)
the empirical estimate
example
estimating the mean and the covariance of Iris dataset contour lines N(μ, Σ) = const
COMP 551 | Fall 2020
example
Σ = σ2 [1 1]
Isotropic Gaussian
considering bivariate case for visualization
Σ = [4 1]
axis-aligned
Σ = [9 4 4 4]
if
and
D ×
′
D
x ∼ N(μ , Σ )
x x
y = Qx
μ =
y
E[Qx] = QE[x] = Qμx Σ =
y
E[Qxx Q ] −
⊤ ⊤
E[Qx]E[x Q ] =
⊤ ⊤
Q(E[xx ] −
⊤
E[x]E[x] )Q =
⊤ ⊤
QΣ Q
x ⊤
y ∼ N(μ , Σ )
y y
then
example
Σ =
x
[1 1]
can we construct any multivariate Gaussian from axis-aligned Gaussians in this way?
Σ =
y
QΣ Q =
x ⊤
[1 2 2 20] Q = [1 2 4]
covariance matrix is symmetric positive semi definite symmetric because positive semi definite because for any
Σ =
d,d′
Cov(x , x ) =
d d′
Cov(x , x ) =
d′ d
Σd ,d
′
y Σy =
⊤
(y E[(x −
⊤
E[x])(x − E[x]) ]y) =
⊤
Var(y x) ≥
⊤
y ∈ RD so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian any symmetric positive semi-definite matrix can be decomposed as
diagonal
D × D
QQ =
⊤
Q Q =
⊤
I (rotation and reflection)
COMP 551 | Fall 2020
diagonal
D × D
QQ =
⊤
Q Q =
⊤
I so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian (rotation and reflection)
Σ = ≈ [10 5 5 5] [−.85 −.52 −.52 .85 ] [13.09 1.90] [−.85 −.52 −.52 .85 ]
⊤
example
columns of Q tell us where original bases go variance of the "new axes" aligned Gaussian
(more on this in the PCA lecture)
what is the distribution of IQ?
p(y) = p(y, z)dz ∫z
we need to marginalize over height people's height and IQ are jointly normally distributed
x , x ∼
H IQ
N , ([μH μIQ] [ σH
2
σH,IQ σH,IQ σIQ
2
])
for Gaussian distributions the marginal is also Gaussian the same idea extends to marginalizing more than one variables marginalization corresponds to a linear transformation
∼ [0 1] [xH xIQ] N , ([0 1] [μH μIQ] [0 1] [ σH
2
σH,IQ σH,IQ σIQ
2
] [0 1]
⊤)
x ∼
IQ
N μ , σ (
IQ IQ 2 )
correlation is normalized covariance
Corr(x , x ) =
i j Var(x )Var(x )
i j
Cov(x ,x )
i j
∈ [−1, +1]
two variables that are independent are uncorrelated as well
p(x , x ) =
i j
p(x )p(x )
i j
E[x x ] =
i j
E[x ]E[x ]
i j
Cov(x x ) =
i j
image from wikipedia
the inverse is generally not true (zero correlation doesn't mean independence) in each example above correlation between two coordinates is zero, but they are not independent
COMP 551 | Fall 2020
two variables that are independent are uncorrelated as well correlation is normalized covariance
Corr(x , x ) =
i j Var(x )Var(x )
i j
Cov(x ,x )
i j
∈ [−1, +1] p(x , x ) =
i j
p(x )p(x )
i j
E[x x ] =
i j
E[x ]E[x ]
i j
Cov(x x ) =
i j
inverse is true for Gaussian!
Corr(x , x ) =
i j
0 ⇔ Σ =
i,j
why? marginalize out all variables except x , x
i j
image from wikipedia
∼ [xi xj] N , ([μi μj] [σi
2
σj
2])
but this is the product of two univariate Gaussian dists. therefore is independent of
xi xj
p(x ∣x ) =
B A
N Qx + c, Σ (
A B∣A)
for this we need to use the chain rule: given
p(x ) =
A
N μ , Σ (
A A)
you probably guessed this based on the formula for linear transformation
then the joint distribution is also normal
p(x , x ) =
B A
N , ([Qμ + c
A
μA ] [Σ + QΣ Q
B∣A A ⊤
Σ Q
A ⊤
QΣA ΣA ])
a dragon's life-span is approximated normally distributed with the average heat of dragon's breath is normal with what is the probability that a random dragon at death bed can melt stainless steel !? μ =
A
1000, σ =
A
100 μ =
B
2x −
A
273, σ =
B∣A
30 example
substituting into the chair rule formula p(x , x ) =
B A
N , ([2000 − 273 1000 ] [900 + 40000 20000 20000 10000]) we just care about the marginal dist. over the heat of dragon's breath p(x ) =
B
N 1727, 40900 ( ) p(x >
B
1500)
steel's melting point is 1500 C, so we want a dragon's life-span is approximated normally distributed with the average heat of dragon's breath is normal with what is the probability that a random dragon at its death bed can melt stainless steel !? μ =
A
1000, σ =
A
100 μ =
B
2x −
A
273, σ =
B∣A
30 example
−1.22
use CDF of standard normal
≈
40900 1727−1500
1.22
about 13% of dragons can't do it!
given that (birth weight) and (shoe size) are jointly normally distributed
x , x ∼
W S
N , ([μW μS ] [ σW
2
σW,S σW,S σS
2 ])
xW xS
given an assignment what is the distribution of their shoe size?
x =
W
x ˉW
p(x , x )
W S
x =
W
.7 xW xS xS p(x )
S
p(x ∣x =
S W
.7)
conditional dist. marginal dist.
let denote a partitioning of so that
x , x ∼
A B
N , ([μA μB] [ ΣA ΣBA ΣAB ΣB ]) x , x
A B
p(x ∣x =
A B
) = x ˉB N μ , Σ (
A∣B A∣B)
x
then
A∣B
A
AB B −1 BA
A∣B
A
AB B −1 x
B
after conditioning the variance decreases conditional variance is independent of the observation x
ˉB
Σ =
BA
ΣAB
⊤
by conditioning on the mean we have
= x ˉB μB μ =
A∣B
μA
COMP 551 | Fall 2020
a dragon's life-span is approximated normally distributed with the average heat of dragon's breath is normal with at its death bed a dragon claims that its breath is twice hotter than lava (2 x 1250 C). what is your best guess for its age? μ =
A
1000, σ =
A
100 μ =
B
2x −
A
273, σ =
B∣A
30 example we have p(x ), p(x ∣x )
A B A
we want , so we need to use the Bayes rule:
p(x ∣x )
A B
p(x , x )
A B
x =
B
2500
Gaussian distribution is motivated by the central limit theorem the expression for multivariate Gaussian the maximum-likelihood estimate of parameters covariance matrix and its decomposition zero covariance means independence in Gaussians linear transformations of Gaussians produces Gaussian marginalization and conditioning produces Gaussian sum of independent Gaussian random variables is Gaussian