CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec8 1 / 44

Recap Last week took a probabilistic perspective on parameter estimation. We modeled a biased coin as a Bernoulli random variable with parameter θ , which we estimated using: ◮ maximum likelihood estimation: ˆ θ ML = max θ p ( D | θ ) ◮ expected Bayesian posterior: E [ θ | D ] where p ( θ | D ) ∝ p ( θ ) p ( D | θ ) by Bayes’ Rule. ◮ Maximum a-posteriori (MAP) estimation: ˆ θ MAP = arg max θ p ( θ | D ) We also saw parameter estimation in context of a Na¨ ıve Bayes classifier. Today we will continuing developing the probabilistic perspective: ◮ Gaussian Discriminant Analysis: Use Gaussian generative model of the data for classification ◮ Principal Component Analysis: Simplify a Gaussian model by projected it onto a lower dimensional subspace Intro ML (UofT) CSC311-Lec8 2 / 44

Gaussian Discriminant Analysis Generative model for classification Instead of trying to separate classes, try to model what each class “looks like”: p ( x | t = k ). Recall p ( x | t = k ) may be very complex for high dimensional data: p ( x 1 , · · · , x d , t ) = p ( x 1 | x 2 , · · · , x d , t ) · · · p ( x d − 1 | x d , t ) p ( x d , t ) Naive bayes used a conditional independence assumption. What else could we do? Choose a simple distribution. Next, we will discuss fitting Gaussian distributions to our data. Intro ML (UofT) CSC311-Lec8 3 / 44

Classification: Diabetes Example Observation per patient: White blood cell count & glucose value. p ( x | t = k ) for each class is shaped like an ellipse = ⇒ we model each class as a multivariate Gaussian Intro ML (UofT) CSC311-Lec8 4 / 44

Univariate Gaussian distribution Recall the Gaussian, or normal, distribution: − ( x − µ ) 2 1 � � N ( x ; µ, σ 2 ) = √ 2 πσ exp 2 σ 2 The Central Limit Theorem says that sums of lots of independent random variables are approximately Gaussian. In machine learning, we use Gaussians a lot because they make the calculations easy. Intro ML (UofT) CSC311-Lec8 5 / 44

Multivariate Data Multiple measurements (sensors) D inputs/features/attributes N instances/observations/examples  x (1) x (1) x (1)  [ x (1) ] ⊤ · · ·   1 2 D x (2) x (2) x (2) [ x (2) ] ⊤ · · ·     1 2 D   X =  =   . . . . ...   . . . .   . . . .      [ x ( N ) ] ⊤ x ( N ) x ( N ) x ( N ) · · · 1 2 D Intro ML (UofT) CSC311-Lec8 6 / 44

Multivariate Mean and Covariance Mean   µ 1 . . µ = E [ x ] =   .   µ d Covariance  σ 2  σ 12 · · · σ 1 D 1 σ 2 σ 12 · · · σ 2 D  2  Σ = Cov( x ) = E [( x − µ )( x − µ ) ⊤ ] =  . . .  ... . . .   . . .   σ 2 · · · σ D 1 σ D 2 D The statistics ( µ and Σ ) uniquely define a multivariate Gaussian (or multivariate Normal) distribution, denoted N ( µ , Σ ) or N ( x ; µ , Σ ) ◮ This is not true for distributions in general! Intro ML (UofT) CSC311-Lec8 7 / 44

Multivariate Gaussian Distribution Normally distributed variable x ∼ N ( µ , Σ ) has distribution: 1 � − 1 � 2( x − µ ) T Σ − 1 ( x − µ ) p ( x ) = (2 π ) d/ 2 | Σ | 1 / 2 exp Intro ML (UofT) CSC311-Lec8 8 / 44

Gaussian Intuition: (Multivariate) Shift + Scale Recall that in the univariate case, all normal distributions are shaped like the standard normal distribution The densities are related to the standard normal by a shift ( µ ), a scale (or stretch, or dilation) σ , and a normalization factor Intro ML (UofT) CSC311-Lec8 9 / 44

Gaussian Intuition: (Multivariate) Shift + Scale The same intuition applies in the multivariate case. We can think of the multivariate Gaussian as a shifted and “scaled” version of the standard multivariate normal distribution. ◮ The standard multivariate normal has µ = 0 and Σ = I Multivariate analog of the shift is simple: it’s a vector µ But what about the scale? ◮ In the univariate case, the scale factor was the square root of the √ σ 2 variance: σ = ◮ But in the multivariate case, the covariance Σ is a matrix! 1 2 exist, and can we scale by it? Does Σ Intro ML (UofT) CSC311-Lec8 10 / 44

Multivariate Scaling (Intuitive) (optional draw-on slide for intuition) We call a matrix “positive definite” if it scales the space in orthogonal directions. The univariate analog is positive scalar α > 0. Consider, e.g., how these two matrices transform the orthogonal vectors: � 1 � 2 � � Consider 0 0 . 5 matrix: 0 0 . 5 0 . 5 1 � 1 � 1 � � 0 � � 1 � � Consider action on: ⊥ ⊥ 0 1 1 − 1 Draw action on slide: Notice : both matrices are symmetric! Intro ML (UofT) CSC311-Lec8 11 / 44

Multivariate Scaling (Formal) (details optional) We summarize some definitions/results from linear algebra (without proof). Knowing them is optional, but they may help with intuition (esp. for PCA). Definition. Symmetric matrix A is positive semidefinite if x ⊤ Ax ≥ 0 for all non-zero x . It is positive definite if x ⊤ Ax > 0 for all non-zero x . ◮ Any positive definite matrix is positive semidefinite. ◮ Positive definite matrices have positive eigenvalues, and positive semidefinite matrices have non-negative eigenvalues. ◮ For any matrix X , X ⊤ X and XX ⊤ are positive semidefinite. Theorem ( Unique Positive Square Root ) . Let A be a positive semidefinite real matrix. Then there is a unique positive semidefinite matrix B such that 1 A = B ⊤ B = BB . We call A 2 � B the positive square root of A . Theorem ( Spectral Theorem ) . The following are equivalent for A ∈ R d × d : 1. A is symmetric. 2. R D has an orthonormal basis consisting of the eigenvectors of A . 3. There exists orthogonal matrix Q and diagonal matrix Λ such that A = QΛQ T . This is called the spectral decomposition of A . ◮ The columns of Q are (unit) eigenvectors of A . Intro ML (UofT) CSC311-Lec8 12 / 44

Properties of Σ Key properties of Σ : 1. Σ is positive semidefinite (and therefore symmetric). 2. For a distribution with density, Σ is positive definite. Other properties (optional / for reference): 3. Σ = E [ xx ⊤ ] − µµ ⊤ (generalizes Var( x ) = E [ x 2 ] − µ 2 )) 4. Cov( Ax + b ) = AΣA ⊤ (generalizes Var( ax + b ) = a 2 Var( x )) So here is the “scale” intuition: 1 2 . For positive definite Σ , consider its unique positive square root Σ 1 2 is also positive definite, so by the Real Spectral Theorem, it “scales” Σ the space in orthogonal directions (its eigenvectors) by its eigenvalues. 1 2 ! So we can think of N ( µ , Σ ) as N ( 0 , I ) shifted by µ and “scaled” by Σ ◮ Note that if Σ = QΛQ T , Σ 1 1 2 = QΛ 2 Q T Lets see some examples... Intro ML (UofT) CSC311-Lec8 13 / 44

Bivariate Gaussian � 1 � � 1 � � 1 � 0 0 0 Σ = Σ = 0 . 5 Σ = 2 0 1 0 1 0 1 Figure: Probability density function Figure: Contour plot of the pdf Intro ML (UofT) CSC311-Lec8 14 / 44

Bivariate Gaussian � 1 0 � � 2 0 � � 1 0 � Σ = Σ = Σ = 0 1 0 1 0 2 Figure: Probability density function Figure: Contour plot of the pdf Intro ML (UofT) CSC311-Lec8 15 / 44

Bivariate Gaussian � 1 � 1 � 1 � � � 0 0 . 5 0 . 8 Σ = Σ = Σ = 0 1 0 . 5 1 0 . 8 1 � 1 . 5 � � 1 . 8 � 0 . 0 . Q ⊤ Q ⊤ = Q 1 = Q 2 1 2 0 . 0 . 5 0 . 0 . 2 Test your intuition: Does Q 1 = Q 2 ? Figure: Probability density function Figure: Contour plot of the pdf Intro ML (UofT) CSC311-Lec8 16 / 44

Bivariate Gaussian � 1 � 1 � 1 � � � 0 0 . 5 − 0 . 5 Σ = Σ = Σ = 0 1 0 . 5 1 − 0 . 5 1 � 1 . 5 � � λ 1 � 0 . 0 . Q ⊤ Q ⊤ = Q 1 = Q 2 1 2 0 . 0 . 5 0 . λ 2 Test your intuition: Does Q 1 = Q 2 ? What are λ 1 and λ 2 ? Figure: Probability density function Figure: Contour plot of the pdf Intro ML (UofT) CSC311-Lec8 17 / 44

Bivariate Gaussian Intro ML (UofT) CSC311-Lec8 18 / 44

Bivariate Gaussian Intro ML (UofT) CSC311-Lec8 19 / 44

Gaussian Maximum Likelihood Suppose we want to model the distribution of highest and lowest temperatures in Toronto in March, and we’ve recorded the following observations � (-2.5,-7.5) (-9.9,-14.9) (-12.1,-17.5) (-8.9,-13.9) (-6.0,-11.1) Assume they’re drawn from a Gaussian distribution with mean µ , and covariance Σ . We want to estimate these using data. Log-likelihood function: N � � �� 1 − 1 2( x ( i ) − µ ) T Σ − 1 ( x ( i ) − µ ) ℓ ( µ , Σ ) = log (2 π ) d/ 2 | Σ | 1 / 2 exp i =1 N � � �� 1 − 1 � 2( x ( i ) − µ ) T Σ − 1 ( x ( i ) − µ ) = log (2 π ) d/ 2 | Σ | 1 / 2 exp i =1 N − log | Σ | 1 / 2 − 1 � 2( x ( i ) − µ ) T Σ − 1 ( x ( i ) − µ ) − log(2 π ) d/ 2 = � �� i =1 constant Optional intuition building: why does | Σ | 1 / 2 show up in the Gaussian density p ( x )? Hint: determinant is product of eigenvalues Intro ML (UofT) CSC311-Lec8 20 / 44

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec8 1 / 44 Recap Last week took a

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression,

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Symbolic Verification of Epistemic Properties in Programs Ioana Boureanu (Univ. of Surrey, SCCS)

HLSM & Time Constraints on Sequential Circuits Prof. Usagi RTL(Register Transfer Level)

E906/SeaQuest Report Michelle M. Medeiros for the SeaQuest Collaboration Mar. 27, 2017. 1/4

So Matsuura (Department of Physics Hiyoshi,

Optimizing Convex Functions over Non-Convex Domains Dan Bienstock and Alex Michalka

YAGO: Yet Another Great Ontology Fabian M. Suchanek (joint work with Gjergji Kasneci, Mauro Sozio

Virtual Forensics 2.0 Investigating virtual environments Christiaan Beek Agenda Who am I?

Evaluation of MapReduce for Gridding LIDAR Data Sriram Krishnan, Ph.D., Senior Distributed

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec8 1 / 44 Recap Last week took a

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees &amp; Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression,

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Symbolic Verification of Epistemic Properties in Programs Ioana Boureanu (Univ. of Surrey, SCCS)

HLSM &amp; Time Constraints on Sequential Circuits Prof. Usagi RTL(Register Transfer Level)

E906/SeaQuest Report Michelle M. Medeiros for the SeaQuest Collaboration Mar. 27, 2017. 1/4

So Matsuura (Department of Physics Hiyoshi,

Optimizing Convex Functions over Non-Convex Domains Dan Bienstock and Alex Michalka

YAGO: Yet Another Great Ontology Fabian M. Suchanek (joint work with Gjergji Kasneci, Mauro Sozio

Virtual Forensics 2.0 Investigating virtual environments Christiaan Beek Agenda Who am I?

Evaluation of MapReduce for Gridding LIDAR Data Sriram Krishnan, Ph.D., Senior Distributed

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

HLSM & Time Constraints on Sequential Circuits Prof. Usagi RTL(Register Transfer Level)