Applied Machine Learning Expectation Maximization for Mixture of - PowerPoint PPT Presentation

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization algorithm relationship to k-means

Probabilistic modeling so far... (1) ( N ) given data D = { x , … , x } Model p ( x ; θ ) model e.g., multivariate Gaussian, Bernoulli (1) (1) ( N ) ( N ) or if we have labels D = {( x , y ), … , ( x , y )} p ( x , y ; θ ) ∝ p ( y ; θ ) p ( x ∣ y ; θ ) we saw generative models for classification Learning used maximum likelihood to fit the data or Bayesian inference ^ ( n ) ( n ) = arg max log p ( x , y ; θ ) p ( θ ∣ D ) = p ( θ ) p ( D ∣ θ ) θ ∑ n θ e.g., we used this to fit the naive Bayes

Latent variable models sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables (1) ( N ) data D = { x , … , x } is partial or incomplete model p ( x , z ; θ ) accounts for both observed (x) and latent variables (z) the latent variable examples bias (unobserved) leading to a hiring practice (observed) 3D scene (unobserved) producing a 2D photograph (observed) gravity (unobserved) leading to apple falling (observed) genotype (unobserved) leading to some phenotype (observed) input features (observed) having some unobserved class labels ...

Latent variable models sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables (1) ( N ) D = { x , … , x } is partial or incomplete data p ( x , z ; θ ) = p ( z ; θ ) p ( x ∣ z ; θ ) model often we model it gives us a lot of flexibility in modeling the data find hidden factors and learn about how they lead to our observations both natural and powerful way to model complex observations difficult to "learn" the model from partial observations

Latent variable models sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables (1) ( N ) data D = { x , … , x } is partial or incomplete p ( x , z ; θ ) = p ( z ; θ ) p ( x ∣ z ; θ ) model often we model if the latent variable is the class label, this resembles generative classification p ( x , y ; θ ) = p ( y ; θ ) p ( x ∣ y ; θ ) but here we don't observe the labels we saw clustering performs classification without having labels so we can use latent variable models for clustering?

Mixture models suppose the latent variable has a categorical distribution (unobserved class label) p ( x , z ; θ , π ) = Categorical( z ; π ) p ( x ∣ z ; θ ) p ( z ; π ) we only observe x we can marginalize out z to get the data distribution p ( x ; θ , π ) = Categorical( z = k ; π ) p ( x ∣ z = k ; θ ) = π p ( x ∣ z = k ; θ ) ∑ k ∑ k k k mixture weights p ( x ∣ z = k ; θ ) each datapoint with probability comes from the x π k k the marginal over the observed variables is a mixture of K distributions lets consider the case where this is Gaussian

Mixture of Gaussians model the data as a mixture of K Gaussian distributions p ( x ; π , { μ , Σ }) = ∑ k π N ( x ; μ , Σ ) k k k k k Gaussian mixture model for D=2 we can calculate the probability of each datapoint belonging to a cluster k also called the responsibility of cluster k for data point (n) ( n ) x ( n ) π N ( x ; μ ,Σ ) weighted density of k'th Gaussian at ( n ) p ( z = k ∣ x ) = k k k π N ( x ( n ) ; μ ,Σ ) ∑ c density of the whole mixture at that point c c c

Mixture of Gaussians p ( x , z ) visualizing samples from the join distribution colors show the value of z marginal distribution responsibilities ( n ) ∼ p ( z ; π ) z ( n ) ( n ) ( n ) π N ( x ; μ ,Σ ) ∼ p ( x ∣ z ; θ ) ( n ) x p ( x ) = p ( x ∣ z = k ) p ( z = k ) p ( z = k ∣ x ) = ∑ k k k k π N ( x ( n ) ; μ ,Σ ) ∑ c c c c complete data incomplete data (we only have x) (we have both x and z)

Clustering with Gaussian mixture mixture of Gaussians p ( x ; π , { μ , Σ }) = ∑ k π N ( x ; μ , Σ ) k k k k k we can calculate the probability of each datapoint belonging to a cluster k ( n ) π N ( x ; μ ,Σ ) ( n ) p ( z = k ∣ x ) = k k k π N ( x ( n ) ; μ ,Σ ) ∑ c c c c a probabilistic alternative to K-means: soft cluster membership (responsibilities) r ( n ) = p ( z = k ∣ x ) n , k cluster mean μ k cluster covariance matrix Σ k cluster membership r ∈ {0, 1} n , k cluster mean μ k COMP 551 | Fall 2020

Learning the Gaussian mixture model maximize the marginal likelihood of observations under our model ( n ) ℓ( π , { μ , Σ }) = ∑ n log (∑ k π N ( x ; μ , Σ ) k ) k k k k p ( x ) set the derivatives to zero (see our references for step-by-step calculation) 1 ( n ) weighted mean μ = ∑ n r x ∂ℓ = 0 n , k k ∑ n r n , k ∂ μ k ( n ) weight is the responsibility = p ( z = k ∣ x ) r n , k probability of sample (n) belonging to cluster k ∂ℓ 1 ( n ) ( n ) k ⊤ weighted covariance = 0 Σ = ( x − μ )( x − μ ) ∑ n r n , k ∂Σ k k k ∑ n ′ r ′ n , k ∑ n r the total amount of responsibilities accepted by cluster k π = n , k ∂ℓ = 0 k N ∂ π k model parameters depend on the responsibilities problem responsibilities depend on model parameters

Expectation Maximization algorithm for Gaussian Mixture solution iteratively update both parameters and responsibilities until convergence start from some initial model { μ , Σ }, π k k repeat until convergence: update the responsibilities given the model ∀ n , k ( n ) π N ( x ; μ ,Σ ) expectation step ← r k k k n , k ( n ) ∑ c π N ( x ; μ ,Σ ) c c c ∀ k update the model given the responsibilities 1 ( n ) μ ← ∑ n r x n , k k ∑ n r n , k 1 ( n ) ( n ) k ⊤ Σ ← ( x − μ )( x − μ ) ∑ n maximization step r n , k k k ∑ n ′ r ′ n , k ∑ n r π ← n , k k N

EM algorithm for Gaussian Mixture EM converges after 20 iteration (D=2, K=2) example expectation step maximization step initialize (finding responsibilities) (finding model parameters) iteration 2 iteration 5 iteration 20

EM algorithm for Gaussian Mixture example Iris flowers dataset, multiple runs which model is better? converged after 50 iterations, average log-likelihood: -1.45 converged after 120 iterations, average log-likelihood: -1.47 converged after 34 iterations, average log-likelihood: -1.49 converged after 43 iterations, average log-likelihood: -1.45 COMP 551 | Fall 2020

Comparison with K-Means K-Means EM for Gaussian mixture model objective objective minimize the sum of squared Euclidean minimize the negative log-(marginal) likelihood distance to cluster centers parameters cluster centers means, covariances and mixture weights parameters hard cluster memberships responsibilities soft cluster memberships responsibilities alternating minimization wrt parameters and responsibility algorithm robust, because of learning the covariance feature scaling sensitive feature scaling slower convergence faster convergence efficiency efficiency both converge to a local optima, and in both swapping cluster indices makes no difference in the optimality objective

Expectation Maximization we saw application of EM to Gaussian mixture EM is a general algorithm for learning latent variable models: p ( x , z ; θ ) we have a model and partial observations D = { x (1) ( N ) , … , x } to learn model parameters and infer the latent variables use EM start from some initial model θ repeat until convergence: ( n ) ( n ) E-step: do a probabilistic completion p ( z ∣ x ; θ )∀ n M-step: fit the model to the (probabilistically) completed data θ

Expectation Maximization a simple variation called hard EM algorithm start from some initial model θ repeat until convergence: ( n ) ( n ) = arg max p ( z ∣ x ; θ )∀ n E-step: do a deterministic completion z z θ M-step: fit the model to the completed data using max-likelihood K-means is performing hard-EM using a fixed covariance and mixture weights find the closest center (finding the Gaussian with the highest probability) fit Gaussians to completed data (x,z) COMP 551 | Fall 2020

Summary Latent variable models: a general and powerful type of probablistic model we have only partial observations can use EM to learn the parameters and infer hidden values Expectation maximization (EM): useful when we have hidden variables or missing values tries to maximize log-likelihood of observations iteratates between learning model parameters and inferring the latents converges to a local optima (performance depends on initialization The only concrete example that we saw: Gaussian mixture model (GMM) EM in GMM for soft clustering relationship to K-means

Applied Machine Learning Expectation Maximization for Mixture of - PowerPoint PPT Presentation

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed

Stacks and Queues 25 Stack In/Out LIFO: Last-in First-out Push Pop Undo/Redo Back/Forward

ss 3 Cl Class CSC 495/583 Topics of Software Security X86 Assembly & Stack & Stack

Deep Hybrid Models: Bridging Discriminative and Generative Approaches Volodymyr Kuleshov and

Lecture 22 & 23: Variational Autoencoders April 2020 Lecturer: Steven Wu Scribe: Steven Wu

A characterization of combinatorial demand C. Chambers F. Echenique UC San Diego Caltech

and industrial organization: Supply function and equilibrium Giovanni Marin Department of

Applied Machine Learning Expectation Maximization for Mixture of - PowerPoint PPT Presentation

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed

Stacks and Queues 25 Stack In/Out LIFO: Last-in First-out Push Pop Undo/Redo Back/Forward

ss 3 Cl Class CSC 495/583 Topics of Software Security X86 Assembly &amp; Stack &amp; Stack

Deep Hybrid Models: Bridging Discriminative and Generative Approaches Volodymyr Kuleshov and

Lecture 22 &amp; 23: Variational Autoencoders April 2020 Lecturer: Steven Wu Scribe: Steven Wu

A characterization of combinatorial demand C. Chambers F. Echenique UC San Diego Caltech

and industrial organization: Supply function and equilibrium Giovanni Marin Department of

ss 3 Cl Class CSC 495/583 Topics of Software Security X86 Assembly & Stack & Stack

Lecture 22 & 23: Variational Autoencoders April 2020 Lecturer: Steven Wu Scribe: Steven Wu