applied machine learning
play

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - PowerPoint PPT Presentation

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial:


  1. Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020)

  2. Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial: Arnab will go over the Numpy code for different methods 2 pm on Wednesdays and Fridays starting this Friday Zoom link will be posted

  3. Learning objectives Gaussian distribution: motivation and the functional form of its density covariance matrix correlation and dependence linear transformations of Gaussian marginalization, chain rule and conditioning for Gaussian

  4. Univariate Gaussian density Gaussian probability density function (pdf) ( x − μ )2 − 1 N ( x ; μ , σ ) = e 2 σ 2 2 πσ 2 μ , σ 2 two parameters are μ , σ turn out to be the mean and variance E [ x ] = μ 2 σ 2 E [( x − μ ) ] = this is a random variable; we are using the same notation for a random variable and a particular value of that variable

  5. Univariate Gaussian density Gaussian probability density function (pdf) ( x − μ )2 − 1 N ( x ; μ , σ ) = e 2 σ 2 2 πσ 2 99.7% 95.4% 68.2% (1) ( N ) given a dataset D = { x , … , x } μ , σ 2 maximum likelihood estimate of are empirical mean and variance 1 ∑ n ( n ) = MLE μ x N 1 ∑ n 2 MLE ( n ) MLE 2 = ( x − μ ) σ N how can we derive this?

  6. Univariate Gaussian density two reasons why Gaussian is an important dist. maximum entropy dist. with a fixed variance central limit theorem let's throw three dice, repeatedly plot the histogram of the average outcome looks familiar? lets use 10 dice let's replace the dice with uniformly distributed values in [0,1] the average (and sum) of IID random variables has a Gaussian distribution justifies use of Gaussian for observations that are mean or sum of some random values COMP 551 | Fall 2020

  7. Multivariate Gaussian ( ) univariate normal density ( x − μ ) 2 1 N ( x ; μ , σ ) = exp − 2 σ 2 2 πσ 2 x ∈ R x ∈ R D instead of it is a (column) vector 1 1 −1 N ( x ; μ , Σ) = exp − ( x − μ ) Σ ( x − μ ) ( T ) 2 ∣2 π Σ∣ determinant : for a DxD matrix we have ∣ cA ∣ = c ∣ A ∣ D D × D 1 D ∣2 π Σ∣ = (2 π ) ∣Σ∣ so we have 2 2 D dimensional

  8. Covariance matrix recall variance of a random variable Var( x ) = E [( x − E [ x ]) ] 2 = E [ x ] − 2 E [ x ] 2 covariance of two random variable Cov( x , y ) = E [( x − E [ x ])( y − E [ y ])] = E [ xy ] − E [ x ] E [ y ] for we have covariance matrix x ∈ R D Cov( x , x ) = Var( x ) ⎡ Σ 1,1 Cov( x , x ) ⎤ 1 1 1 1 D … Σ 1, D ⎢ ⎥ Σ = = E [( x − E [ x ])( x − E [ x ]) ] ⊤ = E [ xx ] − ⊤ E [ x ] E [ x ] ⊤ ⎣ ⋮ ⋱ Σ D , D ⎦ ⋮ Σ D ,1 … D × D D × 1 1 × D D × D (1) ( N ) sample covariance matrix given a dataset D = { x , … , x } Σ MLE ^ E [( x − E [ x ])( x − E [ x ]) ] ⊤ Σ = D D D the empirical estimate 1 ∑ x ∈ D x − ( N ) x

  9. Covariance matrix sample covariance matrix (1) ( N ) given a dataset D = { x , … , x } Σ MLE ^ E [( x − E [ x ])( x − E [ x ]) ] ⊤ Σ = D D D the empirical estimate 1 ∑ x ∈ D x − ( N ) x example estimating the mean and the covariance of Iris dataset contour lines N ( μ , Σ) = const

  10. Covariance matrices example considering bivariate case for visualization Isotropic Gaussian axis-aligned Σ = σ 2 [ 1 0 Σ = [ 4 0 Σ = [ 9 4 1 ] 1 ] 4 ] 0 0 4 COMP 551 | Fall 2020

  11. Linear transformations if x ∼ N ( μ , Σ ) y = Qx and x x ′ D × D y ∼ N ( μ , Σ ) then y y E [ Qxx Q ] − ⊤ ⊤ E [ Qx ] E [ x Q ] = ⊤ ⊤ Q ( E [ xx ] − ⊤ E [ x ] E [ x ] ) Q ⊤ ⊤ ⊤ Σ = = Q Σ Q y x E [ Qx ] = Q E [ x ] = Qμ x μ = y example Q = [ 1 0 4 ] 2 [ 1 0 [ 1 2 1 ] 20 ] ⊤ Σ = Σ = Q Σ Q = x 0 y x 2 can we construct any multivariate Gaussian from axis-aligned Gaussians in this way?

  12. Decomposing the covariance matrix covariance matrix is symmetric positive semi definite symmetric because Σ = Cov( x , x ) = Cov( x , x ) = Σ d , d d , d ′ d ′ d ′ ′ d d positive semi definite because for any y ∈ R D ⊤ ⊤ ⊤ ⊤ ( y E [( x − E [ x ])( x − E [ x ]) ] y ) = y Σ y = Var( y x ) ≥ 0 any symmetric positive semi-definite matrix can be decomposed as Σ = Q Λ Q ⊤ diagonal D × D orthogonal ⊤ ⊤ (rotation and reflection) = Q Q = QQ I so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian

  13. Decomposing the covariance matrix Σ = Q Λ Q ⊤ diagonal D × D orthogonal (rotation and reflection) ⊤ ⊤ = Q Q = QQ I so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian example ⊤ [ 10 5 [ −.85 −.52 .85 ] [ 13.09 1.90 ] [ −.85 0 −.52 5 ] .85 ] Σ = ≈ 5 −.52 0 −.52 variance of the "new axes" aligned Gaussian columns of Q tell us where original bases go (more on this in the PCA lecture) COMP 551 | Fall 2020

  14. Marginalization people's height and IQ are jointly normally distributed 2 ([ μ H μ IQ ] [ σ H ]) σ H , IQ x , x ∼ , N H IQ 2 σ H , IQ σ IQ p ( y ) = p ( y , z )d z ∫ z what is the distribution of IQ? we need to marginalize over height for Gaussian distributions the marginal is also Gaussian marginalization corresponds to a linear transformation 2 1 ] [ x H x IQ ] ( [ 0 1 ] [ μ H μ IQ ] [ 0 1 ] [ σ H ] [ 0 ⊤ ) σ H , IQ [ 0 1 ] ∼ , N 2 σ H , IQ σ IQ 2 ) ∼ ( , σ N μ x IQ IQ IQ the same idea extends to marginalizing more than one variables

  15. Correlation and dependence correlation is normalized covariance Cov( x , x ) Corr( x , x ) = i j ∈ [−1, +1] i j Var( x )Var( x ) i j two variables that are independent are uncorrelated as well E [ x x ] = E [ x ] E [ x ] Cov( x x ) = 0 p ( x , x ) = p ( x ) p ( x ) i j i j i j i j i j the inverse is generally not true (zero correlation doesn't mean independence) in each example above correlation between two coordinates is zero, but they are not independent image from wikipedia

  16. Correlation and dependence correlation is normalized covariance Cov( x , x ) Corr( x , x ) = i j ∈ [−1, +1] i j Var( x )Var( x ) i j two variables that are independent are uncorrelated as well E [ x x ] = E [ x ] E [ x ] Cov( x x ) = 0 p ( x , x ) = p ( x ) p ( x ) i j i j i j i j i j inverse is true for Gaussian! Corr( x , x ) = 0 ⇔ Σ = 0 i , j i j why? 2 0 [ x i x j ] ([ μ i μ j ] [ σ i 2 ]) ∼ , N marginalize out all variables except x , x 0 σ j i j but this is the product of two univariate Gaussian dists. therefore is independent of x i x j image from wikipedia COMP 551 | Fall 2020

  17. Chain rule a dragon's life-span is approximated normally distributed with = 1000, σ = 100 example μ A A the average heat of dragon's breath is normal with = 2 x − 273, σ = 30 μ B ∣ A B A what is the probability that a random dragon at death bed can melt stainless steel !? for this we need to use the chain rule : given p ( x ) = N μ , Σ ( A ) A A p ( x ∣ x ) = + c , Σ N Qx ( B ∣ A ) B A A then the joint distribution is also normal ⊤ + c ] [ Σ + Q Σ Q Q Σ A ([ Qμ Σ A ]) B ∣ A A A p ( x , x ) = , N ⊤ B A Σ Q μ A A you probably guessed this based on the formula for linear transformation

  18. Chain rule a dragon's life-span is approximated normally distributed with = 1000, σ = 100 example μ A A the average heat of dragon's breath is normal with = 2 x − 273, σ = 30 μ B ∣ A B A what is the probability that a random dragon at its death bed can melt stainless steel !? ([ 2000 − 273 ] [ 900 + 40000 20000 10000 ]) substituting into the chair rule formula p ( x , x ) = , N B A 1000 20000 we just care about the marginal dist. over the heat of dragon's breath p ( x ) = N 1727, 40900 ( ) B p ( x > 1500) steel's melting point is 1500 C, so we want B −1.22 1727−1500 ≈ 1.22 use CDF of standard normal 40900 about 13% of dragons can't do it!

  19. Conditioning optional given that (birth weight) and (shoe size) are jointly normally distributed x W x S 2 ([ μ W μ S ] [ σ W 2 ]) σ W , S , x ∼ , N x W S σ W , S σ S = ˉ W given an assignment what is the distribution of their shoe size? x x W x W = .7 p ( x ∣ x = .7) x W S W conditional dist. p ( x , x ) W S p ( x ) S marginal dist. x S x S

  20. Conditioning optional x , x let denote a partitioning of so that x A B μ B ] [ Σ A Σ AB ([ μ A Σ B ]) x , x ∼ , N A B Σ BA Σ = Σ AB ⊤ BA p ( x ∣ x = ˉ B ) = , Σ ( A ∣ B ) N μ then x A ∣ B A B −1 Σ = Σ − Σ Σ Σ A ∣ B A AB BA B after conditioning the variance decreases ˉ B conditional variance is independent of the observation x −1 x = μ + Σ Σ ( ˉ B − μ ) μ A ∣ B A AB B B ˉ B = = by conditioning on the mean we have x μ B μ μ A A ∣ B

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend