Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - PowerPoint PPT Presentation

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020)

Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial: Arnab will go over the Numpy code for different methods 2 pm on Wednesdays and Fridays starting this Friday Zoom link will be posted

Learning objectives Gaussian distribution: motivation and the functional form of its density covariance matrix correlation and dependence linear transformations of Gaussian marginalization, chain rule and conditioning for Gaussian

Univariate Gaussian density Gaussian probability density function (pdf) ( x − μ )2 − 1 N ( x ; μ , σ ) = e 2 σ 2 2 πσ 2 μ , σ 2 two parameters are μ , σ turn out to be the mean and variance E [ x ] = μ 2 σ 2 E [( x − μ ) ] = this is a random variable; we are using the same notation for a random variable and a particular value of that variable

Univariate Gaussian density Gaussian probability density function (pdf) ( x − μ )2 − 1 N ( x ; μ , σ ) = e 2 σ 2 2 πσ 2 99.7% 95.4% 68.2% (1) ( N ) given a dataset D = { x , … , x } μ , σ 2 maximum likelihood estimate of are empirical mean and variance 1 ∑ n ( n ) = MLE μ x N 1 ∑ n 2 MLE ( n ) MLE 2 = ( x − μ ) σ N how can we derive this?

Univariate Gaussian density two reasons why Gaussian is an important dist. maximum entropy dist. with a fixed variance central limit theorem let's throw three dice, repeatedly plot the histogram of the average outcome looks familiar? lets use 10 dice let's replace the dice with uniformly distributed values in [0,1] the average (and sum) of IID random variables has a Gaussian distribution justifies use of Gaussian for observations that are mean or sum of some random values COMP 551 | Fall 2020

Multivariate Gaussian ( ) univariate normal density ( x − μ ) 2 1 N ( x ; μ , σ ) = exp − 2 σ 2 2 πσ 2 x ∈ R x ∈ R D instead of it is a (column) vector 1 1 −1 N ( x ; μ , Σ) = exp − ( x − μ ) Σ ( x − μ ) ( T ) 2 ∣2 π Σ∣ determinant : for a DxD matrix we have ∣ cA ∣ = c ∣ A ∣ D D × D 1 D ∣2 π Σ∣ = (2 π ) ∣Σ∣ so we have 2 2 D dimensional

Covariance matrix recall variance of a random variable Var( x ) = E [( x − E [ x ]) ] 2 = E [ x ] − 2 E [ x ] 2 covariance of two random variable Cov( x , y ) = E [( x − E [ x ])( y − E [ y ])] = E [ xy ] − E [ x ] E [ y ] for we have covariance matrix x ∈ R D Cov( x , x ) = Var( x ) ⎡ Σ 1,1 Cov( x , x ) ⎤ 1 1 1 1 D … Σ 1, D ⎢ ⎥ Σ = = E [( x − E [ x ])( x − E [ x ]) ] ⊤ = E [ xx ] − ⊤ E [ x ] E [ x ] ⊤ ⎣ ⋮ ⋱ Σ D , D ⎦ ⋮ Σ D ,1 … D × D D × 1 1 × D D × D (1) ( N ) sample covariance matrix given a dataset D = { x , … , x } Σ MLE ^ E [( x − E [ x ])( x − E [ x ]) ] ⊤ Σ = D D D the empirical estimate 1 ∑ x ∈ D x − ( N ) x

Covariance matrix sample covariance matrix (1) ( N ) given a dataset D = { x , … , x } Σ MLE ^ E [( x − E [ x ])( x − E [ x ]) ] ⊤ Σ = D D D the empirical estimate 1 ∑ x ∈ D x − ( N ) x example estimating the mean and the covariance of Iris dataset contour lines N ( μ , Σ) = const

Covariance matrices example considering bivariate case for visualization Isotropic Gaussian axis-aligned Σ = σ 2 [ 1 0 Σ = [ 4 0 Σ = [ 9 4 1 ] 1 ] 4 ] 0 0 4 COMP 551 | Fall 2020

Linear transformations if x ∼ N ( μ , Σ ) y = Qx and x x ′ D × D y ∼ N ( μ , Σ ) then y y E [ Qxx Q ] − ⊤ ⊤ E [ Qx ] E [ x Q ] = ⊤ ⊤ Q ( E [ xx ] − ⊤ E [ x ] E [ x ] ) Q ⊤ ⊤ ⊤ Σ = = Q Σ Q y x E [ Qx ] = Q E [ x ] = Qμ x μ = y example Q = [ 1 0 4 ] 2 [ 1 0 [ 1 2 1 ] 20 ] ⊤ Σ = Σ = Q Σ Q = x 0 y x 2 can we construct any multivariate Gaussian from axis-aligned Gaussians in this way?

Decomposing the covariance matrix covariance matrix is symmetric positive semi definite symmetric because Σ = Cov( x , x ) = Cov( x , x ) = Σ d , d d , d ′ d ′ d ′ ′ d d positive semi definite because for any y ∈ R D ⊤ ⊤ ⊤ ⊤ ( y E [( x − E [ x ])( x − E [ x ]) ] y ) = y Σ y = Var( y x ) ≥ 0 any symmetric positive semi-definite matrix can be decomposed as Σ = Q Λ Q ⊤ diagonal D × D orthogonal ⊤ ⊤ (rotation and reflection) = Q Q = QQ I so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian

Decomposing the covariance matrix Σ = Q Λ Q ⊤ diagonal D × D orthogonal (rotation and reflection) ⊤ ⊤ = Q Q = QQ I so we can produce any Gaussian by rotation and reflection of an axis-aligned Gaussian example ⊤ [ 10 5 [ −.85 −.52 .85 ] [ 13.09 1.90 ] [ −.85 0 −.52 5 ] .85 ] Σ = ≈ 5 −.52 0 −.52 variance of the "new axes" aligned Gaussian columns of Q tell us where original bases go (more on this in the PCA lecture) COMP 551 | Fall 2020

Marginalization people's height and IQ are jointly normally distributed 2 ([ μ H μ IQ ] [ σ H ]) σ H , IQ x , x ∼ , N H IQ 2 σ H , IQ σ IQ p ( y ) = p ( y , z )d z ∫ z what is the distribution of IQ? we need to marginalize over height for Gaussian distributions the marginal is also Gaussian marginalization corresponds to a linear transformation 2 1 ] [ x H x IQ ] ( [ 0 1 ] [ μ H μ IQ ] [ 0 1 ] [ σ H ] [ 0 ⊤ ) σ H , IQ [ 0 1 ] ∼ , N 2 σ H , IQ σ IQ 2 ) ∼ ( , σ N μ x IQ IQ IQ the same idea extends to marginalizing more than one variables

Correlation and dependence correlation is normalized covariance Cov( x , x ) Corr( x , x ) = i j ∈ [−1, +1] i j Var( x )Var( x ) i j two variables that are independent are uncorrelated as well E [ x x ] = E [ x ] E [ x ] Cov( x x ) = 0 p ( x , x ) = p ( x ) p ( x ) i j i j i j i j i j the inverse is generally not true (zero correlation doesn't mean independence) in each example above correlation between two coordinates is zero, but they are not independent image from wikipedia

Correlation and dependence correlation is normalized covariance Cov( x , x ) Corr( x , x ) = i j ∈ [−1, +1] i j Var( x )Var( x ) i j two variables that are independent are uncorrelated as well E [ x x ] = E [ x ] E [ x ] Cov( x x ) = 0 p ( x , x ) = p ( x ) p ( x ) i j i j i j i j i j inverse is true for Gaussian! Corr( x , x ) = 0 ⇔ Σ = 0 i , j i j why? 2 0 [ x i x j ] ([ μ i μ j ] [ σ i 2 ]) ∼ , N marginalize out all variables except x , x 0 σ j i j but this is the product of two univariate Gaussian dists. therefore is independent of x i x j image from wikipedia COMP 551 | Fall 2020

Chain rule a dragon's life-span is approximated normally distributed with = 1000, σ = 100 example μ A A the average heat of dragon's breath is normal with = 2 x − 273, σ = 30 μ B ∣ A B A what is the probability that a random dragon at death bed can melt stainless steel !? for this we need to use the chain rule : given p ( x ) = N μ , Σ ( A ) A A p ( x ∣ x ) = + c , Σ N Qx ( B ∣ A ) B A A then the joint distribution is also normal ⊤ + c ] [ Σ + Q Σ Q Q Σ A ([ Qμ Σ A ]) B ∣ A A A p ( x , x ) = , N ⊤ B A Σ Q μ A A you probably guessed this based on the formula for linear transformation

Chain rule a dragon's life-span is approximated normally distributed with = 1000, σ = 100 example μ A A the average heat of dragon's breath is normal with = 2 x − 273, σ = 30 μ B ∣ A B A what is the probability that a random dragon at its death bed can melt stainless steel !? ([ 2000 − 273 ] [ 900 + 40000 20000 10000 ]) substituting into the chair rule formula p ( x , x ) = , N B A 1000 20000 we just care about the marginal dist. over the heat of dragon's breath p ( x ) = N 1727, 40900 ( ) B p ( x > 1500) steel's melting point is 1500 C, so we want B −1.22 1727−1500 ≈ 1.22 use CDF of standard normal 40900 about 13% of dragons can't do it!

Conditioning optional given that (birth weight) and (shoe size) are jointly normally distributed x W x S 2 ([ μ W μ S ] [ σ W 2 ]) σ W , S , x ∼ , N x W S σ W , S σ S = ˉ W given an assignment what is the distribution of their shoe size? x x W x W = .7 p ( x ∣ x = .7) x W S W conditional dist. p ( x , x ) W S p ( x ) S marginal dist. x S x S

Conditioning optional x , x let denote a partitioning of so that x A B μ B ] [ Σ A Σ AB ([ μ A Σ B ]) x , x ∼ , N A B Σ BA Σ = Σ AB ⊤ BA p ( x ∣ x = ˉ B ) = , Σ ( A ∣ B ) N μ then x A ∣ B A B −1 Σ = Σ − Σ Σ Σ A ∣ B A AB BA B after conditioning the variance decreases ˉ B conditional variance is independent of the observation x −1 x = μ + Σ Σ ( ˉ B − μ ) μ A ∣ B A AB B B ˉ B = = by conditioning on the mean we have x μ B μ μ A A ∣ B

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - PowerPoint PPT Presentation

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial:

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

. . . . . . . . . . . . . . . . . . . . . Let denote an average .

Dimensionality Reduc1on Machine Learning 10-601B Seyoung Kim

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Potential PCA Interpretation Problems for Volatility Smile Dynamics Robert Tompkins, Dimitri

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures

Structural Variations 02-715 Advanced Topics in Computa8onal Genomics

Exploiting Latency Variation for Access Conflict Reduction of NAND Flash Memory Jinhua Cui,

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh - PowerPoint PPT Presentation

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Late midterm exam: November 11th the exam will be available for 72 hours we will announce if it is a timed exam or not later Coding tutorial:

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

. . . . . . . . . . . . . . . . . . . . . Let denote an average .

Dimensionality Reduc1on Machine Learning 10-601B Seyoung Kim

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Potential PCA Interpretation Problems for Volatility Smile Dynamics Robert Tompkins, Dimitri

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun &amp; Rich Zemels lectures

Structural Variations 02-715 Advanced Topics in Computa8onal Genomics

Exploiting Latency Variation for Access Conflict Reduction of NAND Flash Memory Jinhua Cui,

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures