Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabás Póczos & Aarti Singh 2014 Spring

Administration http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/index.html � Blackboard manager & Peer grading: Dani � Webpage manager and autolab: Pulkit � Camera man: Pengtao � Homework manager: Jit � Piazza manager: Prashant Recitation : Wean 7500, 6pm-7pm, on Wednesdays 2

Outline Theory : � Probabilities: � Dependence, Independence, Conditional Independence � Parameter estimation: � Maximum Likelihood Estimation (MLE) � Maximum aposteriori (MAP) � Bayes rule � Naïve Bayes Classifier Application : Naive Bayes Classifier for � Spam filtering � “Mind reading” = fMRI data processing 3

Independence

Independence Independent random variables: Y and X don’t contain information about each other. Observing Y doesn’t help predicting X. Observing X doesn’t help predicting Y. Examples: Independent: Winning on roulette this week and next week. Dependent: Russian roulette 5

Dependent / Independent Y Y X X Independent X,Y Dependent X,Y 6

Conditionally Independent Conditionally independent : Knowing Z makes X and Y independent Examples: Dependent: show size and reading skills Conditionally independent: show size and reading skills given …? age Storks deliver babies : Highly statistically significant correlation exists between stork populations and human birth rates across Europe. 7

Conditionally Independent London taxi drivers: A survey has pointed out a positive and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of accidents. A new law was prepared to prohibit drivers from wearing coats when driving. Finally another study pointed out that people wear coats when it rains… 8

Correlation ≠ Causation xkcd.com 9

Conditional Independence Formally: X is conditionally independent of Y given Z: Equivalent to: Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn’t give more info about Thunder 10

Conditional vs. Marginal Independence � C calls A and B separately and tells them a number n ϵ {1,…,10} � Due to noise in the phone, A and B each imperfectly (and independently) draw a conclusion about what the number was. � A thinks the number was n a and B thinks it was n b . � Are n a and n b marginally independent? n a n b – No, we expect e.g. P(n a = 1 | n b = 1) > P(n a = 1) � Are n a and n b conditionally independent given n? n – Yes, because if we know the true number, the outcomes n a and n b are purely determined by the noise in each phone. P(n a = 1 | n b = 1, n = 2) = P(n a = 1 | n = 2) 11

Our first machine learning problem: Parameter estimation: MLE, MAP Estimating Probabilities 12

Flipping a Coin I have a coin, if I flip it, what’s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: “Frequency of heads” The estimated probability is: 3/5 13

Flipping a Coin 3/5 “Frequency of heads” The estimated probability is: Questions: (1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem??? We are going to answer these questions 14

Question (1) Why frequency of heads??? � Frequency of heads is exactly the maximum likelihood estimator for this problem � MLE has nice properties (interpretation, statistical guarantees, simple) 15

Maximum Likelihood Estimation 16

MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events – Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 17

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data Independent draws Identically distributed 18

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data That’s exactly the “ Frequency of heads” 19

Question (2) How good is this MLE estimation??? 20

How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? • Which estimator should we trust more? • The more the merrier??? 21

Simple bound Let θ * be the true parameter. For n = α H + α T , and For any ε >0: Hoeffding’s inequality: 22

Probably Approximate Correct (PAC )Learning I want to know the coin parameter θ , within ε = 0.1 error with probability at least 1- δ = 0.95. How many flips do I need? Sample complexity: 23

Question (3) Why is this a machine learning problem??? � improve their performance (accuracy of the predicted prob. ) � at some task (predicting the probability of heads) � with experience (the more coins we flip the better we are) 24

What about continuous features? 3 4 5 6 7 8 9 Let us try Gaussians… σ 2 2 2 2 σ σ σ σ 2 2 2 2 σ σ σ µ =0 µ µ µ µ =0 µ µ µ 25

MLE for Gaussian mean and variance Choose θ = ( µ , σ 2 ) that maximizes the probability of observed data Independent draws Identically distributed 26

MLE for Gaussian mean and variance Note: MLE for the variance of a Gaussian is biased [Expected result of estimation is not the true parameter!] Unbiased variance estimator: 27

What about prior knowledge ? (MAP Estimation)

What about prior knowledge ? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data 50-50 29

Prior distribution What prior? What distribution do we want for a prior? � Represents expert knowledge (philosophical approach) � Simple posterior form (engineer’s approach) Uninformative priors: � Uniform distribution Conjugate priors: � Closed-form representation of posterior � P( θ ) and P( θ |D) have the same form 30

In order to proceed we will need: Bayes Rule 31

Chain Rule & Bayes Rule Chain rule: Bayes rule: Bayes rule is important for reverse conditioning. 32

Bayesian Learning • Use Bayes rule: • Or equivalently: posterior likelihood prior 33

MLE vs. MAP Maximum Likelihood estimation (MLE) � Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation � Choose value that is most probable given observed data and prior belief When is MAP same as MLE? 34

MAP estimation for Binomial distribution Coin flip problem: Likelihood is Binomial If the prior is Beta distribution, ⇒ posterior is Beta distribution Beta function: 35

MAP estimation for Binomial distribution Likelihood is Binomial: Prior is Beta distribution: ⇒ posterior is Beta distribution P( θ ) and P( θ |D) have the same form! [Conjugate prior] 36

Beta distribution More concentrated as values of α , β increase 37

Beta conjugate prior As n = α H + α T increases As we get more samples, effect of prior is “washed out” 38

From Binomial to Multinomial Example : Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial( θ = {θ 1 , θ 2 , … , θ k }) If prior is Dirichlet distribution, Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution 39

Conjugate prior for Gaussian? Gaussian Conjugate prior on mean: Conjugate prior on covariance matrix: Inverse Wishart 40

Bayesians vs.Frequentists You are no good when sample is You give a small different answer for different priors 41

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos & Aarti Singh 2014 Spring Administration http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/index.html Blackboard manager & Peer grading:

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Why all of this talk of populations, parameters, samples, and statistics? For simplicity, lets

Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen 1 of 32 Outline (1)

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at

Basics of Geographic Analysis in R Spatial Regression Yuri M. Zhukov GOV 2525: Political

correction Linear imperfections and correction, JUAS, January 2014 Yannis PAPAPHILIPPOU

Results of June 23 Pbar LowBeta La6ce Measurement A. Valishev Tevatron Dept. Mtg. 7/2/2010

Multiple Regression Rick Balkin, Ph.D., LPC-S, NCC Department of Counseling Texas A & M

Ten Years of Implementation and Experience Kirk Glerum , Kinshuman Kinshumann , Steve Greenberg ,

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos & Aarti Singh 2014 Spring Administration http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/index.html Blackboard manager & Peer grading:

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Why all of this talk of populations, parameters, samples, and statistics? For simplicity, lets

Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen 1 of 32 Outline (1)

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at

Basics of Geographic Analysis in R Spatial Regression Yuri M. Zhukov GOV 2525: Political

correction Linear imperfections and correction, JUAS, January 2014 Yannis PAPAPHILIPPOU

Results of June 23 Pbar LowBeta La6ce Measurement A. Valishev Tevatron Dept. Mtg. 7/2/2010

Multiple Regression Rick Balkin, Ph.D., LPC-S, NCC Department of Counseling Texas A &amp; M

Ten Years of Implementation and Experience Kirk Glerum , Kinshuman Kinshumann , Steve Greenberg ,

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Multiple Regression Rick Balkin, Ph.D., LPC-S, NCC Department of Counseling Texas A & M