MachineLearning CMPT726 SimonFraserUniversity - PDF document

Machine Learning  CMPT 726  Simon Fraser  University  Binomial Parameter Estimation 

Outline  • Maximum Likelihood Estimation  • Smoothed Frequencies, Laplace Correction.  • Bayesian Approach.  – Conjugate Prior.  – Uniform Prior. 

Administrivia Machine Learning Curve Fitting Coin Tossing Coin Tossing • Let’s say you’re given a coin, and you want to find out P ( heads ) , the probability that if you flip it it lands as “heads”. • Flip it a few times: H H T • P ( heads ) = 2 / 3 , no need for CMPT726 • Hmm... is this rigorous? Does this make sense?

Administrivia Machine Learning Curve Fitting Coin Tossing Coin Tossing - Model • Bernoulli distribution P ( heads ) = µ , P ( tails ) = 1 − µ • Assume coin flips are independent and identically distributed (i.i.d.) • i.e. All are separate samples from the Bernoulli distribution • Given data D = { x 1 , . . . , x N } , heads: x i = 1 , tails: x i = 0 , the likelihood of the data is: N N � � µ x n ( 1 − µ ) 1 − x n p ( D| µ ) = p ( x n | µ ) = n = 1 n = 1

Administrivia Machine Learning Curve Fitting Coin Tossing Maximum Likelihood Estimation • Given D with h heads and t tails • What should µ be? • Maximum Likelihood Estimation (MLE): choose µ which maximizes the likelihood of the data µ ML = arg max µ p ( D| µ ) • Since ln ( · ) is monotone increasing: µ ML = arg max µ ln p ( D| µ )

Administrivia Machine Learning Curve Fitting Coin Tossing Maximum Likelihood Estimation • Likelihood: N � µ x n ( 1 − µ ) 1 − x n p ( D| µ ) = n = 1 • Log-likelihood: N � ln p ( D| µ ) = x n ln µ + ( 1 − x n ) ln ( 1 − µ ) n = 1 • Take derivative, set to 0: N d 1 1 − µ = 1 1 1 � d µ ln p ( D| µ ) = µ − ( 1 − x n ) x n µ h − 1 − µ t n = 1 h ⇒ µ = t + h

MLE Estimate: The 0 problem.  • h  heads,  t  tails,  n = h+t .  h • Practical problems with using the MLE   n  If  h  or  t  are 0, the 0 prob may be multiplied  with other nonzero probs (singularity).   If  n  = 0, no estimate at all. This happens quite  often in high‐dimensional spaces. 

Smoothing Frequency Estimates  • h  heads,  t  tails,  n = h+t .  • Prior probability estimate  p .  • Equivalent Sample Size  m .  • m‐estimate =  h + mp n + m • Interpretation: we started with a “virtual” sample of  m  tosses  with  mp  heads.  h + 1 • P = ½,m=2     Laplace correction  =   n + 2

Bayesian Approach  • Key idea: don’t even try to pick specific  parameter value μ – use a  probability  distribution over parameter values .  • Learning = use Bayes’ theorem to update  probability distribution.  • Prediction =  model averaging.  

Prior Distribution over Parameters  • Could use uniform distribution.  – Exercise: what does uniform over [0,1] look like?  • What if we don’t think prior distribution is  uniform?   • Use  conjugate prior .  – Prior has parameters  a, b  – “hyperparameters”.  – Prior P(μ|a,b) = f(a,b) is some function of  hyperparameters.  – Posterior has same functional form f(a’,b’) where a’,b’  are updated by Bayes’ theorem. 

Administrivia Machine Learning Curve Fitting Coin Tossing Beta Distribution • We will use the Beta distribution to express our prior knowledge about coins: Beta ( µ | a , b ) = Γ ( a + b ) µ a − 1 ( 1 − µ ) b − 1 Γ ( a ) Γ ( b ) � �� normalization • Parameters a and b control the shape of this distribution

Administrivia Machine Learning Curve Fitting Coin Tossing Posterior P ( µ |D ) P ( D| µ ) P ( µ ) ∝ N � µ x n ( 1 − µ ) 1 − x n µ a − 1 ( 1 − µ ) b − 1 ∝ � �� n = 1 prior � �� likelihood µ h ( 1 − µ ) t µ a − 1 ( 1 − µ ) b − 1 ∝ µ h + a − 1 ( 1 − µ ) t + b − 1 ∝ • Simple form for posterior is due to use of conjugate prior • Parameters a and b act as extra observations • Note that as N = h + t → ∞ , prior is ignored

Bayesian Point Estimation  • What if a Bayesian  had  to guess a single  parameter value given hyperdistribution  P?  •   Use expected value  E P (μ) .  – E.g., for P = Beta( μ |a,b) we  have  E P (μ) = a/a+b.  • If we use uniform prior  P , what is  E P (μ|D) ?  • The Laplace correction! 

MachineLearning CMPT726 SimonFraserUniversity - PDF document

MachineLearning CMPT726 SimonFraserUniversity BinomialParameterEstimation Outline MaximumLikelihoodEstimation SmoothedFrequencies,LaplaceCorrection. BayesianApproach.

Outline Administrivia Introduction to Machine Learning Greg Mori - CMPT 419/726 Machine

Introduction to Machine Learning Greg Mori - CMPT 419/726 Bishop PRML Ch. 1 Administrivia

Question Marks Time budget 1 /24 25 min 2 /12 10 min 3 /9 10 min CMPT 419/726: Machine

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Sequential Data Oliver Schulte - CMPT 726 Bishop PRML Ch. 13 Russell and Norvig, AIMA Hidden

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

PERCEPTION CMPT-TR-1997-15, School of Computing Science, Simon Fraser University, 1997 To See

A Tutorial on Tablet PC Simon Fraser University CMPT 354 Fall 2007 Agenda Tablet PC Overview

Database Systems II Query Compiler CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

Database Systems II Introduction CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 1

Database Systems II Record Organization CMPT 454, Simon Fraser University, Fall 2009, Martin

Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

Database Systems II Index Structures CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Learning Linear Bayesian Networks with Latent Variables Adel Javanmard Stanford University joint

Lets make set theory great again! John Harrison Amazon Web Services AITP 2018, Aussois 27th

CMPS 2200 Fall 2015 Probability and Expected Values Carola Wenk 11/18/15 CMPS 2200

Foundations of Computing II Lecture 5: Introduction to probability Stefano Tessaro

Foundations of Computer Science Lecture 18 Random Variables Measurable Outcomes Probability

MA162: Finite mathematics . Jack Schmidt University of Kentucky April 10th, 2013 Schedule: HW

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by

Advanced Algorithms COMS31900 Probability recap. Rapha el Clifford Slides by Markus

Sambuz

Useful Links

Newsletter

Mail Us