Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a - PowerPoint PPT Presentation

Lecture 8: − Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier Aykut Erdem March 2016 Hacettepe University

Last time… Flipping a Coin I have a coin, if I flip it, what’s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: slide by Barnabás Póczos & Alex Smola “Frequency of heads” The estimated probability is: 3/5 2

Last time… Flipping a Coin 3/5 “Frequency of heads” The estimated probability is: Questions: (1) Why frequency of heads??? (2) How good is this estimation??? slide by Barnabás Póczos & Alex Smola (3) Why is this a machine learning problem??? We are going to answer these questions 3

Question (1) Why frequency of heads???   • Frequency of heads is exactly the   maximum likelihood estimator for this problem   • MLE has nice properties   (interpretation, statistical guarantees, simple) slide by Barnabás Póczos & Alex Smola 4

MLE for Bernoulli distribution Data, D = P(Heads) = θ , P(Tails) = 1- θ Flips are i.i.d. : – Independent events slide by Barnabás Póczos & Alex Smola Identically distributed according to Bernoulli distribution – MLE: Choose θ that maximizes the probability of observed data 5

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally   distributed slide by Barnabás Póczos & Alex Smola 9

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws iden,cally   distributed slide by Barnabás Póczos & Alex Smola 10

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data independent draws identically   distributed slide by Barnabás Póczos & Alex Smola 11

Maximum Likelihood Estimation � MLE: Choose θ that maximizes the probability of observed data slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 14

Question (2) • How good is this MLE estimation??? slide by Barnabás Póczos & Alex Smola 17

How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? slide by Barnabás Póczos & Alex Smola • Which estimator should we trust more? • The more the merrier??? 18

Simple bound Let θ * be the true parameter. For n = α H + α T , and For any ε >0: Hoe ff ding’s inequality: slide by Barnabás Póczos & Alex Smola 19

Probably Approximate Correct   (PAC) Learning I want to know the coin parameter θ , within ε = 0.1   error with probability at least 1- δ = 0.95. How many flips do I need? Sample complexity: slide by Barnabás Póczos & Alex Smola 20

Question (3) Why is this a machine learning problem??? • improve their performance (accuracy of the predicted prob. ) • at some task (predicting the probability of heads) • with experience (the more coins we flip the better we are) slide by Barnabás Póczos & Alex Smola 21

What about continuous   features? 3 4 5 6 7 8 9 Let us try Gaussians… slide by Barnabás Póczos & Alex Smola σ 2 2 2 2 σ σ σ σ 2 2 2 2 σ σ σ µ =0 µ =0 µ µ µ µ µ µ 22

MLE for Gaussian mean   and variance and variance Choose θ = ( µ , σ 2 ) that maximizes the probability of observed data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola 23

  MLE for Gaussian mean   and variance and variance Note: MLE for the variance of a Gaussian is biased slide by Barnabás Póczos & Alex Smola [Expected result of estimation is not the true parameter!] Unbiased variance estimator: 24

What about prior knowledge?   (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 25

What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 26

Prior distribution What prior? What distribution do we want for a prior? • Represents expert knowledge (philosophical approach) • Simple posterior form (engineer’s approach)   Uninformative priors: • Uniform distribution   Conjugate priors: • Closed-form representation of posterior slide by Barnabás Póczos & Aarti Singh • P( θ ) and P( θ |D) have the same form   27

In order to proceed we will need: Bayes Rule slide by Barnabás Póczos & Aarti Singh 28

Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh Bayes rule is important for reverse conditioning. 29

Bayesian Learning • Use Bayes rule: • Or equivalently: posterior likelihood prior slide by Barnabás Póczos & Aarti Singh 30

MAP estimation for Binomial distribution Coin flip problem Likelihood is Binomial If the prior is Beta distribution, ) posterior is Beta distribution slide by Barnabás Póczos & Aarti Singh P( � ) and P( � | D) have the same form! [Conjugate prior] 31

Beta distribution slide by Barnabás Póczos & Aarti Singh More concentrated as values of α , β increase 32

Beta conjugate prior slide by Barnabás Póczos & Aarti Singh As n = α H + α T increases As we get more samples, e ff ect of prior is “washed out” 33

Han Solo and Bayesian Priors C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds! https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors 34

MLE vs. MAP Maximum Likelihood estimation (MLE) ! Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation ! Choose value that is most probable given observed data and prior belief slide by Barnabás Póczos & Aarti Singh When is MAP same as MLE? 35

  From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) ) Likelihood is ~ Multinomial( θ = { θ 1 , θ 2 , ... , θ k }) If prior is Dirichlet distribution, chlet distribution, Then posterior is Dirichlet distribution slide by Barnabás Póczos & Aarti Singh For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution 36

Bayesians vs. Frequentists You are no good when sample is You give a small different answer for different slide by Barnabás Póczos & Aarti Singh priors 37

Recap: What about prior knowledge?   (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 38

Recap: What about prior knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ , we obtain a distribution over possible values of θ After data Before data slide by Barnabás Póczos & Aarti Singh 50-50 39

Recap: Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh 40

    Recap: Bayesian Learning D is the measured data Our goal is to estimate parameter θ � • Use Bayes rule:   � • Or equivalently: � � slide by Barnabás Póczos & Aarti Singh posterior prior likelihood 41

  Recap: MAP estimation for Binomial distribution In the coin flip problem: Likelihood is Binomial: If the prior is Beta: slide by Barnabás Póczos & Aarti Singh then the posterior is Beta distribution 42

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a - PowerPoint PPT Presentation

Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a posteriori (MAP) estimation Nave Bayes Classifier Aykut Erdem March 2016 Hacettepe University Last time Flipping a Coin I have a coin, if I flip it, whats the

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Chapter 13: Ranking Models I apply some basic rules of probability theory to calculate the

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 20: Topic

Natural Language Understanding Unsupervised Part-of-Speech Tagging Adam Lopez Slide credits:

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

A characterisation of transient random walks on stochastic matrices with Dirichlet distributed

Uncovering latent jet substructure Barry M . Dillon Jozef Stefan Institute , Ljubljana , Slovenia

The number of facets of three-dimensional Dirichlet stereohedra Francisco Santos (w. D. Bochis,