Applied Machine Learning Maximum Likelihood and Bayesian Reasoning - PowerPoint PPT Presentation

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551 (fall 2020)

Objectives understand what it means to learn a probabilistic model of the data using maximum likelihood principle using Bayesian inference prior, posterior, posterior predictive MAP inference Beta-Bernoulli conjugate pairs

Parameter estimation a thumbtack's head/tail outcome has a Bernoulli distribtion = 1 = 0 θ ) (1− x ) Bernoulli( x ∣ θ ) = θ (1 − x this is our probabilistic model of some head/tail IID data D = {0, 0, 1, 1, 0, 0, 1, 0, 0, 1} Objective: learn the model parameter θ since we are only interested in the counts, we can also use Binomial distribution N ) N h θ ) N − N h Binomial( N , N ∣ θ ) = (1 − ( N h θ h # heads N = ∑ x ∈ D x h N t ∣ D ∣

Maximum likelihood a thumbtack's head/tail outcome has a Bernoulli distribtion = 1 = 0 θ ) (1− x ) Bernoulli( x ∣ θ ) = θ (1 − x this is our probabilistic model of some head/tail IID data D = {0, 0, 1, 1, 0, 0, 1, 0, 0, 1} Objective: learn the model parameter θ Max-likelihood assignment Idea: find the parameter that maximizes the probability of observing D θ Likelihood is a function of θ 4 θ ) 6 L ( θ ; D ) = Bernoulli( x ∣ θ ) = θ (1 − ∏ x ∈ D note that this is not a probability density!

Maximizing log-likelihood likelihood L ( θ ; D ) = p ( x ; θ ) ∏ x ∈ D using product here creates extreme values for 100 samples in our example, the likelihood shrinks below 1e-30 log-likelihood has the same maximum but it is well-behaved ℓ( θ ; D ) = log( L ( θ ; D )) = log( p ( x ; θ )) ∑ x ∈ D how do we find the max-likelihood parameter? θ = ∗ arg max ℓ( θ ; D ) θ for some simple models we can get the closed form solution for complex models we need to use numerical optimization

Maximizing log-likelihood log-likelihood ℓ( θ ; D ) = log( L ( θ ; D )) = log(Bernoulli( x ; θ )) ∑ x ∈ D observation: at maximum, the derivative of is zero ℓ( θ ; D ) idea: set the the derivative to zero and solve for θ example max-likelihood for Bernoulli θ ∗ ∂ ∑ x ∈ D ∂ (1− x ) ) ℓ( θ ; D ) = log θ (1 − θ ) ( x ∂ θ ∂ θ ∂ ∑ x = x log θ + (1 − x ) log(1 − θ ) ∂ θ 1= x = x − = 0 ∑ x θ 1− θ ∑ x ∈ D x which gives is simply the portion of heads in our dataset MLE = θ ∣ D ∣ COMP 551 | Fall 2020

Bayesian approach max-likelihood estimate does not reflect our uncertainty: e.g., for both 1/5 heads and 1000/5000 heads ∗ θ = .2 in the Bayesian approach we maintain a distribution over parameters p ( θ ) prior after observing we update this distribution p ( θ ∣ D ) posterior D using Bayes rule how to do this update? likelihood of the data prior p ( θ ) p ( D ∣ θ ) previously denoted by L ( θ ; D ) p ( θ ∣ D ) = p ( D ) evidence : this is a normalization p ( D ) = p ( θ ) p ( D ∣ θ )d θ ∫

Conjugate Priors in our running example, we know the form of likelihood: prior p ( θ )? posterior p ( θ ∣ D )? p ( D ∣ θ ) = Bernoulli( x ; θ ) = θ N h (1 − θ ) N t likelihood ∏ x ∈ D we want prior and posterior to have the same form (so that we can easily update our belief with new observations.) this gives us the following form p ( θ ∣ a , b ) ∝ θ (1 − a θ ) b this means there is a normalization constant that does not depend on θ distribution of this form has a name, Beta distribution we say Beta distribution is a conjugate prior to the Bernoulli likelihood

Beta distribution Beta distribution has the following density Γ( α )Γ( β ) α −1 θ ) β −1 Beta( θ ∣ α , β ) = (1 − θ Γ( α + β ) normalization Γ is the generalization of factorial to real number Γ( a + 1) = a Γ( a ) α , β > 0 is uniform Beta( θ ∣ α = β = 1) mean of the distribution is E [ θ ] = α + β α α −1 for the dist. is unimodal; its mode is α + β −2 α , β > 1

Beta-Bernoulli conjugate pair α −1 θ ) β −1 p ( θ ) = Beta( θ ∣ α , β ) ∝ θ (1 − prior product of Bernoulli likelihoods likelihood p ( D ∣ θ ) = θ N h (1 − θ ) N t equivalent to Binomial likelihood α + N −1 θ ) β + N −1 posterior p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) ∝ (1 − θ h t h t are called pseudo-counts α , β their effect is similar to imaginary observation of heads ( ) and tails ( ) α β

Effect of more data with few observations, prior has a high influence as we increase the number of observations the effect of prior diminishes N = ∣ D ∣ the likelihood term dominates the posterior example prior p ( θ ; 5, 5) plot of the posterior density with n observations 5+ H θ ) 5+ N − H p ( θ ∣ D ) ∝ θ (1 − COMP 551 | Fall 2020

Posterior predictive p ( x ∣ θ ) our goal was to estimate the parameters ( ) so that we can make predictions θ but now we have a (posterior) distribution over parameters p ( θ ∣ D ) rather than using a single parameter p ( x ∣ θ ) we need to calculate the average prediction p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ θ posterior predictive for each possible , weight the prediction by the θ posterior probability of that parameter being true

Posterior predictive for Beta-Bernoulli start from a Beta prior p ( θ ) = Beta( θ ∣ α , β ) observe heads and tails, the posterior is p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) N h N t h t what is the probability that the next coin flip is head? p ( x = 1∣ D ) = Bernoulli( x = 1∣ θ )Beta( θ ∣ α + N , β + N )d θ ∫ θ h t α + N h = θ Beta( θ ∣ α + N , β + N ) = ∫ θ h t α + β + N mean of Beta dist. compare with prediction of maximum-likelihood: p ( x = 1∣ D ) = N N h if we assume a uniform prior, the posterior predictive is p ( x = 1∣ D ) = N +1 Laplace smoothing h N +2

Strength of the prior with a strong prior we need many samples to really change the posterior for Beta distribution decides how strong the prior is α + β example as our dataset grows our estimate becomes more accurate α different prior means different prior strength α + β α + β posterior predictive posterior predictive p ( x = 1∣ D ) p ( x = 1∣ D ) true value N N example: Koller & Friedman

Maximum a Posteriori (MAP) p ( θ ∣ D ) sometimes it is difficult to work with the posterior dist. over parameters alternative : use the parameter with the highest posterior probability MAP estimate MAP = arg max p ( θ ∣ D ) = arg max p ( θ ) p ( D ∣ θ ) θ θ θ compare with max-likelihood estimate (the only difference is in the prior term) MLE = arg max p ( D ∣ θ ) θ θ example for the posterior p ( θ ∣ D ) = Beta( θ ∣ α + N , β + N ) h t α + N −1 MAP = MAP estimate is the mode of posterior θ h α + β + N + N −2 h t MLE = N h compare with MLE θ N + N h t they are equal for uniform prior α = β = 1 COMP 551 | Fall 2020

Categorical distribution what if we have more than two categories (e.g., loaded dice instead of coin) instead of Bernoulli we have multinoulli or categorical dist. # categories I ( x = k ) K Cat( x ∣ θ ) = ∏ k =1 θ k θ = 1 ∑ k where k belongs to probability simplex θ θ + θ + θ = 1 1 2 3 K = 3

Maximum likelihood for categorical dist. likelihood p ( D ∣ θ ) = Cat( x ∣ θ ) ∏ x ∈ D log-likelihood I ( x = ℓ( θ , D ) = k ) log( θ ) ∑ x ∈ D ∑ k k ∂ θ = 1 ∑ k ℓ( θ , D ) = 0 subject to we need to solve k ∂ θ k similar to the binary case, max-likelihood estimate is given by data-frequencies MLE N k = θ k N example categorical distribution with K=8 frequencies are max-likelihood parameter estimates = .149 MLE θ 5

Dirichlet distribution optional is a distribution over the parameters of a Categorical dist. θ is a generalization of Beta distribution to K categories this should be a dist. over prob. simplex ∑ k θ = 1 k α = k ∏ k Γ( α ) ∑ k α −1 Dir( θ ∣ α ) = θ k K = 3 Γ( α ) ∏ k k k normalization constant vector of psedo-counts for K categories (aka concentration parameters) α > 0 ∀ k k α = [1, … , 1] for , we get uniform distribution for K=2, it reduces to Beta distribution Dir( θ , [.2, .2, .2])

Dirichlet-Categorical conjugate pair optional k ∏ k Γ( α ) ∑ k α −1 Dirichlet dist. Dir( θ ∣ α ) = k is a conjugate prior for θ Γ( α ) ∏ k k k I ( x = k ) Categorical dist. Cat( x ∣ θ ) = ∏ k θ k α −1 prior p ( θ ) = Dir( θ ∣ α ) ∝ ∏ k θ k k η N k p ( D ∣ θ ) = ∏ k likelihood θ we observe values from each category N , … , N 1 k K N + α −1 posterior p ( θ ∣ D ) = Dir( θ ∣ α + η ) ∝ ∏ k again, we add the real counts to pseudo-counts θ k k k α + N posterior predictive p ( x = k ∣ D ) = k k ∑ k ′ α + N k ′ k ′ α + N −1 = MAP MAP θ k k ( α + N )− K k ∑ k ′ k ′ k ′ COMP 551 | Fall 2020

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning - PowerPoint PPT Presentation

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551 (fall 2020) Objectives understand what it means to learn a probabilistic model of the data using maximum likelihood principle using Bayesian

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages Chihiro

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology,

Electroweak constraints on non-minimal UED and split UED Thomas Flacke Universitt Wrzburg TF

General gauge mediation in higher dimensions Moritz McGarrie NEXT Workshop, July 2011 Based on

Year in Review 2015 Brad Sharpe, MD UCSF Division of Hospital Medicine VS. Update in Hospital

General Gauge Mediation @ the EW scale Diego Redigolo GGI, Florence September 4th based on

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning - PowerPoint PPT Presentation

Applied Machine Learning Maximum Likelihood and Bayesian Reasoning Siamak Ravanbakhsh COMP 551 (fall 2020) Objectives understand what it means to learn a probabilistic model of the data using maximum likelihood principle using Bayesian

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages Chihiro

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 7: Maximum likelihood

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN

6. Linear &amp; logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology,

Electroweak constraints on non-minimal UED and split UED Thomas Flacke Universitt Wrzburg TF

General gauge mediation in higher dimensions Moritz McGarrie NEXT Workshop, July 2011 Based on

Year in Review 2015 Brad Sharpe, MD UCSF Division of Hospital Medicine VS. Update in Hospital

General Gauge Mediation @ the EW scale Diego Redigolo GGI, Florence September 4th based on

6. Linear & logistjc regressions Chlo-Agathe Azencot Centre for Computatjonal Biology,