Statistical Machine Learning Lecture 06: Probability Density - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 06: Probability Density Estimation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 77

Today’s Objectives Make you understand how to do find p ( x ) Covered Topics Density Estimation Maximum Likelihood Estimation Non-Parametric Methods Mixture Models Expectation Maximization K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 77

Outline 1. Probability Density 2. Parametric models Maximum Likelihood Method 3. Non-Parametric Models Histograms Kernel Density Estimation K-nearest Neighbors 4. Mixture models 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 77

1. Probability Density Outline 1. Probability Density 2. Parametric models Maximum Likelihood Method 3. Non-Parametric Models Histograms Kernel Density Estimation K-nearest Neighbors 4. Mixture models 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 77

1. Probability Density Training Data 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 How do we get the probability distributions from this so that we can classify with them? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 77

1. Probability Density Probability Density Estimation So far we have seen: Bayes optimal classification, based on probability distributions p ( x | C k ) p ( C k ) The prior p ( C k ) is easy to deal with. We can “just count” the number of occurrences of each class in the training data We need to estimate (learn) the class-conditional probability density p ( x | C k ) Supervised training: we know the input data points and their true labels (classes) Estimate the density separately for each class C k “Abbreviation”: p ( x ) = p ( x | C k ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 77

1. Probability Density Probability Density Estimation Training data x 1 , x 2 , x 3 , . . . Estimation p ( x ) Methods Parametric model Non-parametric model Mixture models K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 77

2. Parametric models Outline 1. Probability Density 2. Parametric models Maximum Likelihood Method 3. Non-Parametric Models Histograms Kernel Density Estimation K-nearest Neighbors 4. Mixture models 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 77

2. Parametric models 2. Parametric models Simple case: Gaussian Distribution � − ( x − µ ) 2 � 1 p ( x | µ, σ ) = √ 2 πσ 2 exp 2 σ 2 Is governed by two parameters: mean and variance. That is, if we know these parameters we can fully describe p ( x ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 77

2. Parametric models 2. Parametric models Notation for parametric density models x ∼ p ( x | θ ) For the Gaussian distribution θ = ( µ, σ ) � � � x ∼ p � µ, σ x � K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 77

2. Parametric models : Maximum Likelihood Method 2. Parametric models Learning means to estimate the parameters θ given the training data X = { x 1 , x 2 , . . . } Likelihood of θ is defined as the probability that the data X was generated from the probability density function with parameters θ L ( θ ) = p ( X | θ ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 77

2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method Consider a set of points X = { x 1 , . . . , x N } Computing the likelihood Of a single datum? p ( x n | θ ) Of all data? Assumption: the data is i.i.d. (independent and identically distributed) The random variables x 1 and x 2 are independent if P ( x 1 ≤ α, x 2 ≤ β ) = P ( x 1 ≤ α ) P ( x 2 ≤ β ) ∀ α, β ∈ R The random variables x 1 and x 2 are identically distributed if P ( x 1 ≤ α ) = P ( x 2 ≤ α ) ∀ α ∈ R K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 77

2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method Likelihood � � � L ( θ ) = p ( X | θ ) = p x 1 , . . . , x N � θ � (using the i.i.d. assumption) = p ( x 1 | θ ) · . . . · p ( x n | θ ) N � p ( x n | θ ) = n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 77

2. Parametric models : Maximum Likelihood Method Maximum log-Likelihood Method Maximize the (log-)likelihood w.r.t. θ N N � � log L ( θ ) = log p ( X | θ ) = log p ( x n | θ ) = log p ( x n | θ ) n = 1 n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 77

2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method - Gaussian Maximum likelihood estimation of a Gaussian N � � � � µ, ˆ ˆ σ = arg max µ,σ log L ( θ ) = log p ( X | θ ) = log p x n � µ, σ � n = 1 Take the partial derivatives and set them to zero ∂µ = 0 , ∂ L ∂ L ∂σ = 0 This leads to a closed form solution N µ = 1 � ˆ x n N n = 1 N σ 2 = 1 � µ ) 2 ˆ ( x n − ˆ N n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 77

2. Parametric models : Maximum Likelihood Method Maximum Likelihood Method - Gaussian K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 77

2. Parametric models : Maximum Likelihood Method Likelihood N � L ( θ ) = p ( X | θ ) = p ( x n | θ ) n = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 77

2. Parametric models : Maximum Likelihood Method Degenerate case If N = 1, X = { x 1 } , the resulting Gaussian looks like K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 77

2. Parametric models : Maximum Likelihood Method Degenerate case What can we do to still get a useful estimate? We can put a prior on the mean! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 77

2. Parametric models : Maximum Likelihood Method Bayesian Estimation Bayesian estimation / learning of parametric distributions, assumes that the parameters are not fixed, but are random variables too This allows us to use prior knowledge about the parameters How do we achieve that? What do we want? A density model for x , p ( x ) What do we have? Data X K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 77

2. Parametric models : Maximum Likelihood Method Bayesian Estimation � � � Formalize this as a conditional probability p x � X � � � � � � � � d θ p x � X = p x , θ � X � � � � � � � � � � � p x , θ � X = p x � θ, X p θ � X � � � p ( x ) can be fully determined with the parameters θ , i.e., θ is a sufficient statistic � � � � � � Hence, we have p � θ, X = p � θ x x � � � � � � � � � � � � = � θ θ d θ p x � X p x p � X � � � K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 77

2. Parametric models : Maximum Likelihood Method Bayesian Estimation � � � � � � � � � � = � θ θ d θ p x � X p x p � X � � � � � � � θ p ( θ ) p X � = L ( θ ) p ( θ ) � � � p θ � X = � p ( X ) p ( X ) � � � � � p ( X ) = � θ p ( θ ) d θ = L ( θ ) p ( θ ) d θ p X � 1 � � � � � � � L ( θ ) p ( θ ) d θ p x � X = p x � θ � � p ( X ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 77

2. Parametric models : Maximum Likelihood Method Bayesian Estimation � � � � � � � � � � d θ p x � X = p x � θ p θ � X � � � � � � The probability p θ � X makes it explicit how the parameter � estimation depends on the training data � � � is small in most places, but large for a single ˆ If p θ θ then � X � we can approximate � � � � � � � ˆ p x � X ≈ p x θ � � Sometimes referred to as the Bayes point The more uncertain we are about estimating ˆ θ , the more we average K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 23 / 77

2. Parametric models : Maximum Likelihood Method Bayesian Estimation Problem: In general, it is intractable to integrate out the parameters θ (or only possible to do so numerically) Example with closed form solution Gaussian data distribution, the variance is known and fixed We estimate the distribution of the mean � � � p X � µ p ( µ ) � � � � µ = p � X � p ( X ) With prior � � µ 0 , σ 2 p ( µ ) = N 0 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 24 / 77

Statistical Machine Learning Lecture 06: Probability Density - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 06: Probability Density Estimation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 77 Todays Objectives

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Steins method and Malliavin calculus Ciprian A. Tudor Universit e de Lille 1 International

CS70: Lecture 32. Normal (Gaussian) Distribution. For any and , a normal (aka Gaussian )

using Gaussian process regression Christopher Moore 20/08/2015 Institute of Astronomy,

Quantitative Security Colorado State University Yashwant K Malaiya CS 559 L6: Probability &

The Returns to Education Source: Bureau of Labor Statistics 1 Total Enrollment Over Time

Economic Trends Chip Filer, Ph.D. Associat e Professor Old Dominion Universit y, Depart ment

Chapter 2. Aggregate Income and Production UMSL Max Gillman Max Gillman () 1 / 42 Facts: An

26th Conference on Postal and Delivery Economics 30 May 2018 2 June 2018, Split, Croatia 1

Statistical Machine Learning Lecture 06: Probability Density - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 06: Probability Density Estimation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 77 Todays Objectives

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Steins method and Malliavin calculus Ciprian A. Tudor Universit e de Lille 1 International

CS70: Lecture 32. Normal (Gaussian) Distribution. For any and , a normal (aka Gaussian )

using Gaussian process regression Christopher Moore 20/08/2015 Institute of Astronomy,

Quantitative Security Colorado State University Yashwant K Malaiya CS 559 L6: Probability &amp;

The Returns to Education Source: Bureau of Labor Statistics 1 Total Enrollment Over Time

Economic Trends Chip Filer, Ph.D. Associat e Professor Old Dominion Universit y, Depart ment

Chapter 2. Aggregate Income and Production UMSL Max Gillman Max Gillman () 1 / 42 Facts: An

26th Conference on Postal and Delivery Economics 30 May 2018 2 June 2018, Split, Croatia 1

Quantitative Security Colorado State University Yashwant K Malaiya CS 559 L6: Probability &