MLE/MAP + Nave Bayes Matt Gormley Lecture 17 Mar. 20, 2020 1 - - PowerPoint PPT Presentation

mle map na ve bayes
SMART_READER_LITE
LIVE PREVIEW

MLE/MAP + Nave Bayes Matt Gormley Lecture 17 Mar. 20, 2020 1 - - PowerPoint PPT Presentation

10 601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Nave Bayes Matt Gormley Lecture 17 Mar. 20,


slide-1
SLIDE 1

MLE/MAP + NaïveBayes

1

10601IntroductiontoMachineLearning

MattGormley Lecture17 Mar.20,2020

MachineLearningDepartment SchoolofComputerScience CarnegieMellonUniversity

slide-2
SLIDE 2

Reminders

Homework 5:NeuralNetworks

Out:Fri,Feb28 Due:Sun,Mar22at11:59pm

Homework 6:Learning Theory /Generative Models

Out:Fri,Mar20 Due:Fri,Mar27at11:59pm

TIP:Dothe readings! Today’s InClass Poll

http://poll.mlcourse.org

Matt’s newafterclass officehours (on Zoom)

2

slide-3
SLIDE 3

MLEANDMAP

14

slide-4
SLIDE 4

LikelihoodFunction

SupposewehaveNsamplesD={x(1),x(2),…,x(N)}froma randomvariableX Thelikelihood function:

Case1:Xisdiscrete withpmf p(x|) L()=p(x(1)|)p(x(2)|)…p(x(N)|) Case2:Xiscontinuous withpdf f(x|) L()=f(x(1)|)f(x(2)|)…f(x(N)|)

Theloglikelihood function:

Case1:Xisdiscrete withpmf p(x|)

l()=log p(x(1)|)+…+log p(x(N)|)

Case2:Xiscontinuous withpdf f(x|)

l()=log f(x(1)|)+…+log f(x(N)|)

17

Inbothcases (discrete/ continuous),the likelihood tellsus howlikelyone sampleisrelative toanother Inbothcases (discrete/ continuous),the likelihood tellsus howlikelyone sampleisrelative toanother

OneR.V. OneR.V.

slide-5
SLIDE 5

LikelihoodFunction

  • SupposewehaveNsamples D={(x(1),y(1)),…,(x(N),y(N))}froma

pair ofrandomvariablesX,Y

  • Theconditionallikelihoodfunction:

Case1:Yisdiscrete withpmf p(y|x,) L()=p(y(1)|x(1),)…p(y(N)|x(N),) Case2:Yiscontinuous withpdf f(y|x,) L()=f(y(1)|x(1),)…f(y(N)|x(N),)

  • Thejointlikelihoodfunction:

Case1:XandYarediscrete withpmf p(x,y|) L()=p(x(1),y(1)|)…p(x(N),y(N)|) Case2:XandYarecontinuous withpdf f(x,y|) L()=f(x(1),y(1)|)…f(x(N),y(N)|)

18

TwoR.V.s TwoR.V.s

slide-6
SLIDE 6

LikelihoodFunction

  • SupposewehaveNsamples D={(x(1),y(1)),…,(x(N),y(N))}froma

pair ofrandomvariablesX,Y

  • Thejointlikelihoodfunction:

Case1:XandYarediscrete withpmf p(x,y|) L()=p(x(1),y(1)|)…p(x(N),y(N)|) Case2:XandYarecontinuous withpdf f(x,y|) L()=f(x(1),y(1)|)…f(x(N),y(N)|) Case3:Yisdiscrete withpmf p(y|)and Xiscontinuous withpdf f(x|y,) L(,)=f(x(1)|y(1),)p(y(1)|)…f(x(N)|y(N),)p(y(N)|) Case4:Yiscontinuous withpdf f(y|)and Xisdiscrete withpmf p(x|y,) L(,)=p(x(1)|y(1),)f(y(1)|)…p(x(N)|y(N),)f(y(N)|)

19

TwoR.V.s TwoR.V.s

Mixed discrete/ continuous!

slide-7
SLIDE 7

MLE

20

PrincipleofMaximumLikelihoodEstimation: Choosetheparametersthatmaximizethelikelihood

  • fthedata.

MaximumLikelihoodEstimate(MLE) L() MLE MLE 2 1 L(1,2)

slide-8
SLIDE 8

MLE

Whatdoesmaximizinglikelihoodaccomplish? Thereisonlyafiniteamountofprobability mass(i.e.sumtooneconstraint) MLEtriestoallocateasmuchprobability massaspossibletothethingswehave

  • bserved…

…attheexpense ofthethingswehavenot

  • bserved

21

slide-9
SLIDE 9

RecipeforClosedformMLE

1. Assumedatawasgeneratedi.i.d.fromsomemodel (i.e.writethegenerativestory) x(i) ~p(x|) 2. Writeloglikelihood

l()=log p(x(1)|)+…+log p(x(N)|)

3. Computepartialderivatives(i.e.gradient) l()/1 =… l()/2 =… … l()/M =… 4. Setderivativestozeroandsolvefor l()/m =0forallm {1,…,M} MLE =solutiontosystemofMequationsandMvariables 5. Computethesecondderivativeandcheckthatl()isconcavedown atMLE

22

slide-10
SLIDE 10

MLE

Example:MLEofExponentialDistribution

23

Goal: Steps:

slide-11
SLIDE 11

MLE

Example:MLEofExponentialDistribution

24

slide-12
SLIDE 12

MLE

Example:MLEofExponentialDistribution

25

slide-13
SLIDE 13

MLE

InClassExercise ShowthattheMLEof parameter forN samplesdrawnfrom Bernoulli()is: InClassExercise ShowthattheMLEof parameter forN samplesdrawnfrom Bernoulli()is:

26

Stepstoanswer:

  • 1. Writeloglikelihood
  • fsample
  • 2. Computederivative

w.r.t.

  • 3. Setderivativeto

zeroandsolvefor Stepstoanswer:

  • 1. Writeloglikelihood
  • fsample
  • 2. Computederivative

w.r.t.

  • 3. Setderivativeto

zeroandsolvefor

slide-14
SLIDE 14

MLE

Question: AssumewehaveNsamplesx(1), x(2),…,x(N) drawnfroma Bernoulli(). Whatistheloglikelihood of thedatal()? AssumeN1 =#of(x(i) =1) N0 =#of(x(i) =0) Question: AssumewehaveNsamplesx(1), x(2),…,x(N) drawnfroma Bernoulli(). Whatistheloglikelihood of thedatal()? AssumeN1 =#of(x(i) =1) N0 =#of(x(i) =0)

27

Answer: A. l()=N1log()+N0 (1 log()) B. l()=N1log()+N0 log(1) C. l()=log()N1 +(1 log())N0 D. l()=log()N1 +log(1)N0 E. l()=N0log()+N1 (1 log()) F. l()=N0log()+N1 log(1) G. l()=log()N0 +(1 log())N1 H. l()=log()N0 +log(1)N1 I. l()=themostlikelyanswer Answer: A. l()=N1log()+N0 (1 log()) B. l()=N1log()+N0 log(1) C. l()=log()N1 +(1 log())N0 D. l()=log()N1 +log(1)N0 E. l()=N0log()+N1 (1 log()) F. l()=N0log()+N1 log(1) G. l()=log()N0 +(1 log())N1 H. l()=log()N0 +log(1)N1 I. l()=themostlikelyanswer

slide-15
SLIDE 15

MLE

Question: AssumewehaveNsamplesx(1), x(2),…,x(N) drawnfroma Bernoulli(). Whatisthederivative ofthe loglikelihoodl()/? AssumeN1 =#of(x(i) =1) N0 =#of(x(i) =0) Question: AssumewehaveNsamplesx(1), x(2),…,x(N) drawnfroma Bernoulli(). Whatisthederivative ofthe loglikelihoodl()/? AssumeN1 =#of(x(i) =1) N0 =#of(x(i) =0)

28

Answer: A. l()/ =N1+(1 )N0 B. l()/ = /N1+(1 )/N0 C. l()/ =N1/ +N0 /(1 ) D. l()/ =log()/N1+log(1 )/N0 E. l()/ =N1/log()+N0 /log(1 ) Answer: A. l()/ =N1+(1 )N0 B. l()/ = /N1+(1 )/N0 C. l()/ =N1/ +N0 /(1 ) D. l()/ =log()/N1+log(1 )/N0 E. l()/ =N1/log()+N0 /log(1 )

slide-16
SLIDE 16

LearningfromData(Frequentist)

Whiteboard

Example:MLEofBernoulli

29

slide-17
SLIDE 17

MLEvs.MAP

30

PrincipleofMaximumaposteriori(MAP)Estimation: Choosetheparametersthatmaximizetheposterior

  • ftheparametersgiventhedata.

PrincipleofMaximumLikelihoodEstimation: Choosetheparametersthatmaximizethelikelihood

  • fthedata.

MaximumLikelihoodEstimate(MLE) Maximumaposteriori (MAP)estimate

slide-18
SLIDE 18

MLEvs.MAP

31

PrincipleofMaximumaposteriori(MAP)Estimation: Choosetheparametersthatmaximizetheposterior

  • ftheparametersgiventhedata.

PrincipleofMaximumLikelihoodEstimation: Choosetheparametersthatmaximizethelikelihood

  • fthedata.

MaximumLikelihoodEstimate(MLE) Maximumaposteriori (MAP)estimate Prior Prior

slide-19
SLIDE 19

MLEvs.MAP

32

PrincipleofMaximumaposteriori(MAP)Estimation: Choosetheparametersthatmaximizetheposterior

  • ftheparametersgiventhedata.

PrincipleofMaximumLikelihoodEstimation: Choosetheparametersthatmaximizethelikelihood

  • fthedata.

MaximumLikelihoodEstimate(MLE) Maximumaposteriori (MAP)estimate Prior Prior

Important! Usuallytheparametersare continuous,sothepriorisa probabilitydensity function

slide-20
SLIDE 20

LearningfromData(Bayesian)

Whiteboard

maximumaposteriori(MAP)estimation Example:MAPofBernoulli—Beta

33

slide-21
SLIDE 21

RecipeforClosedformMLE

1. Assumedatawasgeneratedi.i.d.fromsomemodel (i.e.writethegenerativestory) x(i) ~p(x|) 2. Writeloglikelihood

l()=log p(x(1)|)+…+log p(x(N)|)

3. Computepartialderivatives(i.e.gradient) l()/1 =… l()/2 =… … l()/M =… 4. Setderivativestozeroandsolvefor l()/m =0forallm {1,…,M} MLE =solutiontosystemofMequationsandMvariables 5. Computethesecondderivativeandcheckthatl()isconcavedown atMLE

34

slide-22
SLIDE 22

LearningfromData(Bayesian)

Whiteboard

maximumaposteriori(MAP)estimation Example:MAPofBernoulli—Beta

35

slide-23
SLIDE 23

Takeaways

OneviewofwhatMListryingtoaccomplishis functionapproximation Theprincipleofmaximumlikelihood estimationprovidesanalternateviewof learning Syntheticdatacanhelpdebug MLalgorithms Probabilitydistributionscanbeusedtomodel realdatathatoccursintheworld (don’tworrywe’llmakeourdistributionsmore interestingsoon!)

36

slide-24
SLIDE 24

LearningObjectives

MLE/MAP Youshouldbeableto… 1. Recallprobabilitybasics,includingbutnotlimitedto:discrete andcontinuousrandomvariables,probabilitymassfunctions, probabilitydensityfunctions,eventsvs.randomvariables, expectationandvariance,jointprobabilitydistributions, marginalprobabilities,conditionalprobabilities,independence, conditionalindependence 2. DescribecommonprobabilitydistributionssuchastheBeta, Dirichlet,Multinomial,Categorical,Gaussian,Exponential,etc. 3. Statetheprincipleofmaximumlikelihoodestimationand explainwhatittriestoaccomplish 4. Statetheprincipleofmaximumaposterioriestimationand explainwhyweuseit 5. DerivetheMLEorMAPparametersofasimplemodelinclosed form

37

slide-25
SLIDE 25

NAÏVEBAYES

38

slide-26
SLIDE 26

NaïveBayesOutline

  • RealworldDataset

Economistvs.Onionarticles Document bagofwords binary featurevector

  • NaiveBayes:Model

Generatingsynthetic"labeleddocuments" Definitionofmodel NaiveBayesassumption Counting#ofparameterswith/without NBassumption

  • NaïveBayes:LearningfromData

Datalikelihood MLEforNaiveBayes MAPforNaiveBayes

  • VisualizingGaussianNaiveBayes

39

slide-27
SLIDE 27

NaïveBayes

WhyarewetalkingaboutNaïveBayes?

It’sjustanotherdecisionfunctionthatfitsinto

  • ur“bigpicture”recipefromlasttime

Butit’sourfirstexampleofaBayesianNetwork andprovidesaclearer pictureofprobabilistic learning JustliketheotherBayesNetswe’llsee,itadmits aclosedformsolution forMLEandMAP Solearningisextremelyefficient(justcounting)

40

slide-28
SLIDE 28

FakeNewsDetector

42

CNN TheOnion

Today’sGoal:Todefineagenerativemodelofemails

  • ftwodifferentclasses(e.g.realvs.fakenews)
slide-29
SLIDE 29

FakeNewsDetector

43

CNN TheOnion

Wecanpretendthenaturalprocessgeneratingthesevectorsisstochastic…

slide-30
SLIDE 30

NaiveBayes:Model

Whiteboard

Document bagofwords binaryfeature vector Generatingsynthetic"labeleddocuments" Definitionofmodel NaiveBayesassumption Counting#ofparameterswith/withoutNB assumption

44

slide-31
SLIDE 31

Model1:BernoulliNaïveBayes

45

IfHEADS,flip eachredcoin Flipweightedcoin IfTAILS,flip eachbluecoin 1 1 … 1 y x1 x2 x3 … xM 1 1 … 1 1 1 1 1 … 1 1 … 1 1 1 … 1 1 1 … Eachredcoin correspondsto anxm … … Wecangenerate datain thisfashion.Thoughin practiceweneverwould sinceourdataisgiven. Instead,thisprovidesan explanationofhow the datawasgenerated (albeitaterribleone).

slide-32
SLIDE 32

What’swrongwiththe NaïveBayesAssumption?

Thefeaturesmightnotbeindependent!!

46

Example1:

Ifadocumentcontainstheword “Donald”,it’sextremelylikelyto containtheword“Trump” Thesearenotindependent!

Example2:

Ifthepetalwidthisveryhigh, thepetallengthisalsolikelyto beveryhigh

slide-33
SLIDE 33

NaïveBayes:LearningfromData

Whiteboard

Datalikelihood MLEforNaiveBayes Example: MLEforNaïveBayeswithTwo Features MAPforNaiveBayes

47

slide-34
SLIDE 34

RecipeforClosedformMLE

1. Assumedatawasgeneratedi.i.d.fromsomemodel (i.e.writethegenerativestory) x(i) ~p(x|) 2. Writeloglikelihood

l()=log p(x(1)|)+…+log p(x(N)|)

3. Computepartialderivatives(i.e.gradient) l()/1 =… l()/2 =… … l()/M =… 4. Setderivativestozeroandsolvefor l()/m =0forallm {1,…,M} MLE =solutiontosystemofMequationsandMvariables 5. Computethesecondderivativeandcheckthatl()isconcavedown atMLE

48