MLE/MAP Matt Gormley Lecture 20 Oct 29, 2018 1 Q&A 9 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP Matt Gormley Lecture 20 Oct 29, 2018 1

PROBABILISTIC LEARNING 11

Probabilistic Learning Function Approximation Probabilistic Learning Previously, we assumed that our Today, we assume that our output was generated using a output is sampled from a deterministic target function : conditional probability distribution : Our goal is to learn a probability Our goal was to learn a distribution p(y| x ) that best hypothesis h( x ) that best approximates p * (y| x ) approximates c * ( x ) 12

Robotic Farming Deterministic Probabilistic Classification Is this a picture of Is this plant (binary output) a wheat kernel? drought resistant? Regression How many wheat What will the yield (continuous kernels are in this of this plant be? output) picture? 13

Oracles and Sampling Whiteboard – Sampling from common probability distributions • Bernoulli • Categorical • Uniform • Gaussian – Pretending to be an Oracle (Regression) • Case 1: Deterministic outputs • Case 2: Probabilistic outputs – Probabilistic Interpretation of Linear Regression • Adding Gaussian noise to linear function • Sampling from the noise model – Pretending to be an Oracle (Classification) • Case 1: Deterministic labels • Case 2: Probabilistic outputs (Logistic Regression) • Case 3: Probabilistic outputs (Gaussian Naïve Bayes) 15

In-Class Exercise 1. With your neighbor, write a function which returns samples from a Categorical – Assume access to the rand() function – Function signature should be: categorical_sample(theta) where theta is the array of parameters – Make your implementation as efficient as possible! 2. What is the expected runtime of your function? 16

Generative vs. Discrminative Whiteboard – Generative vs. Discriminative Models • Chain rule of probability • Maximum (Conditional) Likelihood Estimation for Discriminative models • Maximum Likelihood Estimation for Generative models 17

Categorical Distribution Whiteboard – Categorical distribution details • Independent and Identically Distributed (i.i.d.) • Example: Dice Rolls 18

Takeaways • One view of what ML is trying to accomplish is function approximation • The principle of maximum likelihood estimation provides an alternate view of learning • Synthetic data can help debug ML algorithms • Probability distributions can be used to model real data that occurs in the world (don’t worry we’ll make our distributions more interesting soon!) 19

Learning Objectives Oracles, Sampling, Generative vs. Discriminative You should be able to… 1. Sample from common probability distributions 2. Write a generative story for a generative or discriminative classification or regression model 3. Pretend to be a data generating oracle 4. Provide a probabilistic interpretation of linear regression 5. Use the chain rule of probability to contrast generative vs. discriminative modeling 6. Define maximum likelihood estimation (MLE) and maximum conditional likelihood estimation (MCLE) 20

PROBABILITY 21

Random Variables: Definitions Discrete Random variable whose values come X Random from a countable set (e.g. the natural Variable numbers or {True, False}) Probability Function giving the probability that p ( x ) mass discrete r.v. X takes value x. function p ( x ) := P ( X = x ) (pmf) 22

Random Variables: Definitions Continuous Random variable whose values come X Random from an interval or collection of Variable intervals (e.g. the real numbers or the range (3, 5)) Probability Function the returns a nonnegative f ( x ) density real indicating the relative likelihood function that a continuous r.v. X takes value x (pdf) • For any continuous random variable: P(X = x) = 0 • Non-zero probabilities are only available to intervals: � b P ( a ≤ X ≤ b ) = f ( x ) dx a 23

Random Variables: Definitions Cumulative Function that returns the probability F ( x ) distribution that a random variable X is less than or function equal to x: F ( x ) = P ( X ≤ x ) • For discrete random variables: � � P ( X = x � ) = p ( x � ) F ( x ) = P ( X ≤ x ) = x � <x x � <x • For continuous random variables: � x f ( x � ) dx � F ( x ) = P ( X ≤ x ) = �� 24

Notational Shortcuts A convenient shorthand: P ( A | B ) = P ( A, B ) P ( B ) ⇒ For all values of a and b : P ( A = a | B = b ) = P ( A = a, B = b ) P ( B = b ) 25

Notational Shortcuts But then how do we tell P(E) apart from P(X) ? Random Event Variable P ( A | B ) = P ( A, B ) Instead of writing: P ( B ) We should write: P A | B ( A | B ) = P A,B ( A, B ) P B ( B ) …but only probability theory textbooks go to such lengths. 26

COMMON PROBABILITY DISTRIBUTIONS 27

Common Probability Distributions • For Discrete Random Variables: – Bernoulli – Binomial – Multinomial – Categorical – Poisson • For Continuous Random Variables: – Exponential – Gamma – Beta – Dirichlet – Laplace – Gaussian (1D) – Multivariate Gaussian 28

Common Probability Distributions Beta Distribution probability density function: 1 B ( α , β ) x α − 1 (1 − x ) β − 1 f ( φ | α , β ) = 4 3 α = 0 . 1 , β = 0 . 9 f ( φ | α , β ) α = 0 . 5 , β = 0 . 5 2 α = 1 . 0 , β = 1 . 0 α = 5 . 0 , β = 5 . 0 α = 10 . 0 , β = 5 . 0 1 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 φ

Common Probability Distributions Dirichlet Distribution probability density function: 1 B ( α , β ) x α − 1 (1 − x ) β − 1 f ( φ | α , β ) = 4 3 α = 0 . 1 , β = 0 . 9 f ( φ | α , β ) α = 0 . 5 , β = 0 . 5 2 α = 1 . 0 , β = 1 . 0 α = 5 . 0 , β = 5 . 0 α = 10 . 0 , β = 5 . 0 1 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 φ

Common Probability Distributions Dirichlet Distribution probability density function: K ⇥ K k =1 Γ ( � k ) 1 p ( ⌅ ⇤ α k − 1 ⇤ ⇤ | α ) = where B ( α ) = k Γ ( � K B ( α ) k =1 � k ) k =1 15 3 10 p ( ~ � | ~ ↵ ) 2 . 5 p 5 ( � ~ | ↵ ~ ) 2 0 0 1 . 5 0 0 . 25 0 . 25 1 0 . 5 � 1 0 . 8 0 . 5 1 � 0 . 8 1 0 . 6 0 . 75 0 . 6 0 . 75 0 . 4 0 . 4 � 2 � 2 0 . 2 0 . 2 1 1 0 0

EXPECTATION AND VARIANCE 32

Expectation and Variance The expected value of X is E[X] . Also called the mean. • Discrete random variables: Suppose X can take any value in the set X . � E [ X ] = xp ( x ) x ∈ X • Continuous random variables: � + ∞ E [ X ] = xf ( x ) dx −∞ 33

Expectation and Variance The variance of X is Var(X) . V ar ( X ) = E [( X − E [ X ]) 2 ] µ = E [ X ] • Discrete random variables: � ( x − µ ) 2 p ( x ) V ar ( X ) = x ∈ X • Continuous random variables: � + ∞ ( x − µ ) 2 f ( x ) dx V ar ( X ) = −∞ 34

Joint probability Marginal probability Conditional probability MULTIPLE RANDOM VARIABLES 35

Joint Probability • Key concept: two or more random variables may interact. Thus, the probability of one taking on a certain value depends on which value(s) the others are taking. • We call this a joint ensemble and write p ( x, y ) = prob( X = x and Y = y ) z p(x,y,z) y x 36 Slide from Sam Roweis (MLSS, 2005)

Marginal Probabilities • We can ”sum out” part of a joint distribution to get the marginal distribution of a subset of variables: � p ( x ) = p ( x, y ) y • This is like adding slices of the table together. p(x,y) Σ z z y y x x • Another equivalent definition: p ( x ) = � y p ( x | y ) p ( y ) . 37 Slide from Sam Roweis (MLSS, 2005)

Conditional Probability Conditional Probability • If we know that some event has occurred, it changes our belief about the probability of other events. • This is like taking a ”slice” through the joint table. p ( x | y ) = p ( x, y ) /p ( y ) z p(x,y|z) y x 38 Slide from Sam Roweis (MLSS, 2005)

Independence and Conditional Independence Independence & Conditional Independence • Two variables are independent i ff their joint factors: p ( x, y ) = p ( x ) p ( y ) p(x,y) p(x) x = p(y) • Two variables are conditionally independent given a third one if for all values of the conditioning variable, the resulting slice factors: p ( x, y | z ) = p ( x | z ) p ( y | z ) ∀ z 39 Slide from Sam Roweis (MLSS, 2005)

MLE AND MAP 40

�� MLE Suppose we have data D = { x ( i ) } N i =1 Principle of Maximum Likelihood Estimation: Choose the parameters that maximize the likelihood MLE N of the data. θ MLE = �� p ( � ( i ) | θ ) θ MAP i =1 Maximum Likelihood Estimate (MLE) 41

MLE What does maximizing likelihood accomplish? • There is only a finite amount of probability mass (i.e. sum-to-one constraint) • MLE tries to allocate as much probability mass as possible to the things we have observed… … at the expense of the things we have not observed 42

MLE Example: MLE of Exponential Distribution • pdf of Exponential ( λ ) : f ( x ) = λ e − λ x • Suppose X i ∼ Exponential ( λ ) for 1 ≤ i ≤ N . • Find MLE for data D = { x ( i ) } N i =1 • First write down log-likelihood of sample. • Compute first derivative, set to zero, solve for λ . • Compute second derivative and check that it is concave down at λ MLE . 43

MLE/MAP Matt Gormley Lecture 20 Oct 29, 2018 1 Q&A 9 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP Matt Gormley Lecture 20 Oct 29, 2018 1 Q&A 9 PROBABILISTIC LEARNING 11 Probabilistic Learning

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

var ol3d = new olcs.OLCesium({map: map, target: id}); ol3d.setEnabled(true); var ol3d = new

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Recipes and Economic Growth: A Combinatorial March Down an Exponential Tail Chad Jones Stanford

The story of the film so far... C.r.v.s X and Y have a joint density f ( x , y ) with Mathematics

Conjugate Priors: Beta and Normal 18.05 Spring 2014 January 1, 2017 1 /20 Review: Continuous

Exponential & Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Input Distributions Reading: Chapter 6 in Law Peter J. Haas CS 590M: Simulation Spring Semester

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

T owards parallelizing the Gillespie SSA Srivastav Ranganathan and Aparna JS Indian Institute

Sampling Methods Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia

MLE/MAP Matt Gormley Lecture 20 Oct 29, 2018 1 Q&A 9 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP Matt Gormley Lecture 20 Oct 29, 2018 1 Q&A 9 PROBABILISTIC LEARNING 11 Probabilistic Learning

MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010 1 MLE vs. MAP Maximum

Making Life Easier Online service for people within North Lanarkshire MLE History MLE website

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Excel2013: Model Logistic MLE 1Y1X Sept 2015 V1A V1A V1A Excel2013 Model Logistic MLE 1Y1X

Logistic Regression: MLE vs. OLS1 in Excel2013 29 Aug 2016 V0B V0B V0B Schield MLE vs.

MLE/MAP + Nave Bayes MLE / MAP Readings: Nave Bayes Readings: Matt Gormley

Laying a Solid Foundation for Learning: Lessons from the Kom MLE Project in Cameroon Paul

MLE, MAP, AND NAIVE BAYES 10-601 RECITATION MARY MCGLOHON MLE The usual representation we come

2015 Schield Logistic MLE1C Excel2013 8/18/2016 V0D V0D V0D 2015 Schield Logistic MLE 1C

2015 Schield Logistic MLE1A Excel2013 10/29/2015 V0D V0D V0D 2015 Schield Logistic MLE 1A

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

var ol3d = new olcs.OLCesium({map: map, target: id}); ol3d.setEnabled(true); var ol3d = new

Homework 2 MLE and Naive Bayes Instructions Answer the questions and upload your answers to

Recipes and Economic Growth: A Combinatorial March Down an Exponential Tail Chad Jones Stanford

The story of the film so far... C.r.v.s X and Y have a joint density f ( x , y ) with Mathematics

Conjugate Priors: Beta and Normal 18.05 Spring 2014 January 1, 2017 1 /20 Review: Continuous

Exponential &amp; Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Input Distributions Reading: Chapter 6 in Law Peter J. Haas CS 590M: Simulation Spring Semester

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

T owards parallelizing the Gillespie SSA Srivastav Ranganathan and Aparna JS Indian Institute

Sampling Methods Henrik I. Christensen Robotics &amp; Intelligent Machines @ GT Georgia

Exponential & Normal Distribution Lec.22 July 29, 2020 Exponential Distribution: Fundamental

Sampling Methods Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia