MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, - PowerPoint PPT Presentation

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1

MLE vs. MAP  Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data  Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief When is MAP same as MLE? 2

MAP using Conjugate Prior Coin flip problem Likelihood is ~ Binomial If prior is Beta distribution, Then posterior is Beta distribution For Binomial, conjugate prior is Beta distribution. 3

MLE vs. MAP What if we toss the coin too few times? • You say: Probability next toss is a head = 0 • Billionaire says: You’re fired! …with prob 1  • Beta prior equivalent to extra coin flips (regularization) • As n → 1 , prior is “forgotten” • But, for small sample size, prior is important! 4

Bayesians vs. Frequentists You are no good when sample is You give a small different answer for different priors 5

What about continuous variables? • Billionaire says: If I am measuring a continuous variable, what can you do for me? • You say: Let me tell you about Gaussians… = N( m , s 2 ) s 2 s 2 6 m =0 m =0

Gaussian distribution Data, D = Sleep hrs 3 4 5 6 7 8 9 • Parameters: m – mean, s 2 - variance • Sleep hrs are i.i.d. : – Independent events – Identically distributed according to Gaussian distribution 7

Properties of Gaussians • affine transformation (multiplying by scalar and adding a constant) – X ~ N ( m , s 2 ) – Y = aX + b ! Y ~ N (a m +b,a 2 s 2 ) • Sum of Gaussians – X ~ N ( m X , s 2 X ) – Y ~ N ( m Y , s 2 Y ) – Z = X+Y ! Z ~ N ( m X + m Y , s 2 X + s 2 Y ) 8

MLE for Gaussian mean and variance 9

MLE for Gaussian mean and variance Note: MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator: 10

MAP for Gaussian mean and variance • Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution • Prior for mean: = N( h , l 2 ) 11

MAP for Gaussian Mean (Assuming known variance s 2 ) MAP under Gauss-Wishart prior - Homework 12

What you should know… • Learning parametric distributions: form known, parameters unknown – Bernoulli (q, probability of flip) – Gaussian (m, mean and s 2 , variance) • MLE • MAP 13

What loss function are we minimizing? • Learning distributions/densities – Unsupervised learning (know form of P, except q ) • Task: Learn • Experience: D = • Performance: Negative log Likelihood loss 14

Recitation Tomorrow! • Linear Algebra and Matlab • Strongly recommended!! • Place: NSH 1507 (Note: change from last time) • Time: 5-6 pm Leman 15

Bayes Optimal Classifier Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Classification Goal: Sports Science News Features, X Labels, Y Probability of Error 17

Optimal Classification Optimal predictor: (Bayes classifier) Bayes risk • Even the optimal classifier makes mistakes R(f*) > 0 • Optimal classifier depends on unknown distribution 18

Optimal Classifier Bayes Rule: Optimal classifier: Class prior Class conditional density 19

Example Decision Boundaries • Gaussian class conditional densities (1-dimension/feature) Decision Boundary 20

Example Decision Boundaries • Gaussian class conditional densities (2-dimensions/features) Decision Boundary 21

Learning the Optimal Classifier Optimal classifier: Class conditional Class prior density Need to know Prior P(Y = y) for all y Likelihood P(X=x|Y = y) for all x,y 22

Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 … … X d ) Y n rows Lets learn P(Y|X) – how many parameters? Prior: P(Y = y) for all y K-1 if K labels (2 d – 1)K if d binary features Likelihood: P(X=x|Y = y) for all x,y 23

Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 … … X d ) Y n rows Lets learn P(Y|X) – how many parameters? 2 d K – 1 (K classes, d binary features) Need n >> 2 d K – 1 number of training data to learn all parameters 24

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, - PowerPoint PPT Presentation

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimation Choose value

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

var ol3d = new olcs.OLCesium({map: map, target: id}); ol3d.setEnabled(true); var ol3d = new

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM:

Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy IBM IBM, , Da Data ta an

Estimating Estimands with Estimators Fill In Your Name 30 October 2020 1/88 Key Points Review

Performance and Power Impact of Issue- width in Chip-Multiprocessor Cores Magnus Ekman

Bridge Trolley Width Trolleys can be no longer than ~16 to accommodate loading the south

A CLT for Wishart Tensors Dan Mikulincer Weizmann Institute of Science 1 Wishart Tensors Let {

On Demmel Condition Number Distributions with Applications in Telecommunications Lu Wei and Olav

Asymptotic properties of entanglement polytopes for large number of qubits and RMT Adam Sawicki

Random Matrix Theory in a nutshell and applications Manuela Girotti IFT 6085, February 27th,