Lecture 8:
− Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier
Aykut Erdem
March 2016 Hacettepe University
Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a - - PowerPoint PPT Presentation
Lecture 8: Maximum Likelihood Estimation (MLE) (contd.) Maximum a posteriori (MAP) estimation Nave Bayes Classifier Aykut Erdem March 2016 Hacettepe University Last time Flipping a Coin I have a coin, if I flip it, whats the
Lecture 8:
− Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier
Aykut Erdem
March 2016 Hacettepe University
Last time… Flipping a Coin
23/5
“Frequency of heads”
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
The estimated probability is:
slide by Barnabás Póczos & Alex SmolaLast time… Flipping a Coin
Questions:
(1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem???
We are going to answer these questions
33/5 “Frequency of heads” The estimated probability is:
slide by Barnabás Póczos & Alex SmolaQuestion (1)
Why frequency of heads???
maximum likelihood estimator for this problem
(interpretation, statistical guarantees, simple)
4 slide by Barnabás Póczos & Alex SmolaMLE for Bernoulli distribution
5Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex SmolaMLE for Bernoulli distribution
6Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex SmolaMLE for Bernoulli distribution
7Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex SmolaMLE for Bernoulli distribution
8Flips are i.i.d.:
– Independent events – Identically distributed according to Bernoulli distribution
Data, D = P(Heads) = θ, P(Tails) = 1-θ
MLE: Choose θ that maximizes the probability of observed data
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
9MLE: Choose θ that maximizes the probability of observed data
independent draws iden,cally distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
10MLE: Choose θ that maximizes the probability of observed data
independent draws iden,cally distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
11MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
12MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
13MLE: Choose θ that maximizes the probability of observed data
independent draws identically distributed
slide by Barnabás Póczos & Alex SmolaMaximum Likelihood Estimation
14MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
Maximum Likelihood Estimation
15MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
Maximum Likelihood Estimation
16MLE: Choose θ that maximizes the probability of observed data
That’s exactly the “Frequency of heads”
Question (2)
How many flips do I need?
I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails?
Simple bound
Let θ* be the true parameter. For n = αH+αT, and For any ε>0:
Hoeffding’s inequality:
19 slide by Barnabás Póczos & Alex SmolaProbably Approximate Correct (PAC) Learning
I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity:
20 slide by Barnabás Póczos & Alex SmolaQuestion (3)
Why is this a machine learning problem???
predicted prob. )
we are)
21 slide by Barnabás Póczos & Alex SmolaWhat about continuous features?
22µ µ µ µ=0 µ µ µ µ=0 σ σ σ σ2
2 2 2σ σ σ σ2
2 2 2Let us try Gaussians…
6 5 4 3 7 8 9
slide by Barnabás Póczos & Alex SmolaMLE for Gaussian mean and variance
23and variance
Choose θ= (µ,σ2) that maximizes the probability of observed data
Independent draws Identically distributed
slide by Barnabás Póczos & Alex SmolaMLE for Gaussian mean and variance
24Note: MLE for the variance of a Gaussian is biased
[Expected result of estimation is not the true parameter!] Unbiased variance estimator:
and variance
slide by Barnabás Póczos & Alex SmolaWhat about prior knowledge? (MAP Estimation)
slide by Barnabás Póczos & Aarti SinghWhat about prior knowledge?
26We know the coin is “close” to 50-50. What can we do now?
The Bayesian way…
Rather than estimating a single θ, we obtain a distribution over possible values of θ
50-50 Before data After data
slide by Barnabás Póczos & Aarti SinghPrior distribution
What prior? What distribution do we want for a prior?
Uninformative priors:
Conjugate priors:
Bayes Rule
In order to proceed we will need:
slide by Barnabás Póczos & Aarti SinghChain Rule & Bayes Rule
29Bayes rule: Chain rule:
Bayes rule is important for reverse conditioning.
slide by Barnabás Póczos & Aarti SinghBayesian Learning
30posterior likelihood prior
slide by Barnabás Póczos & Aarti SinghMAP estimation for Binomial distribution
31Likelihood is Binomial
Coin flip problem
P() and P(| D) have the same form! [Conjugate prior]
If the prior is Beta distribution, ) posterior is Beta distribution
slide by Barnabás Póczos & Aarti SinghBeta distribution
32More concentrated as values of α, β increase
slide by Barnabás Póczos & Aarti SinghBeta conjugate prior
33As we get more samples, effect of prior is “washed out” As n = α H + αT increases
slide by Barnabás Póczos & Aarti SinghHan Solo and Bayesian Priors
C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds!
34https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors
MLE vs. MAP
35When is MAP same as MLE?
!
Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data
!
Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief
slide by Barnabás Póczos & Aarti Singh)
From Binomial to Multinomial
Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ1, θ2, ... , θk}) If prior is Dirichlet distribution, Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution
36chlet distribution,
slide by Barnabás Póczos & Aarti SinghBayesians vs. Frequentists
37You are no good when sample is small You give a different answer for different priors
slide by Barnabás Póczos & Aarti SinghRecap: What about prior knowledge? (MAP Estimation)
slide by Barnabás Póczos & Aarti SinghRecap: What about prior knowledge?
39We know the coin is “close” to 50-50. What can we do now?
The Bayesian way…
Rather than estimating a single θ, we obtain a distribution over possible values of θ
50-50 Before data After data
slide by Barnabás Póczos & Aarti SinghRecap: Chain Rule & Bayes Rule
40Bayes rule: Chain rule:
slide by Barnabás Póczos & Aarti SinghRecap: Bayesian Learning
D is the measured data Our goal is to estimate parameter θ
likelihood prior
Recap: MAP estimation for Binomial distribution
In the coin flip problem: Likelihood is Binomial: If the prior is Beta: then the posterior is Beta distribution
42 slide by Barnabás Póczos & Aarti SinghRecap: Beta conjugate prior
43As we get more samples, effect of prior is “washed out” As n = α H + αT increases
slide by Barnabás Póczos & Aarti SinghApplication of Bayes Rule
slide by Barnabás Póczos & Aarti SinghAIDS test (Bayes rule)
Data
Probability of having AIDS if test is positive
Only 9%!...
slide by Barnabás Póczos & Aarti SinghImproving the diagnosis
46Use a weaker follow-up test!
=
AIDS test (Bayes rule)
Why can’t we use Test 1 twice?
(by assumption):
47The Naïve Bayes Classifier
slide by Barnabás Póczos & Aarti SinghData for spam filtering
Naïve Bayes Assumption
50Naïve Bayes assumption: Features X1 and X2 are conditionally independent given the class label Y: More generally:
slide by Barnabás Póczos & Aarti SinghNaïve Bayes Assumption, Example
51Task: Predict whether or not a picnic spot is enjoyable
16X = (X1 X2 X3 … ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡… ¡ ¡ ¡ ¡ ¡ ¡ ¡Xd) Y n rows Training Data: (2d-1)K vs (2-1)dK
How many parameters to estimate? (X is composed of d binary features, Y has K possible class labels) Naïve Bayes assumption:
… ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡… ¡ ¡ ¡ ¡ ¡ ¡ ¡
slide by Barnabás Póczos & Aarti SinghNaïve Bayes Classifier
52Given:
– Class prior P(Y) – d conditionally independent features X1,…Xd given the class label Y – For each Xi feature, we have the conditional likelihood P(Xi|Y)
Naïve Bayes Decision rule:
– – ,… –
slide by Barnabás Póczos & Aarti SinghNaïve Bayes Algorithm for discrete features
53discrete features
Training data:
We need to estimate these probabilities!
n d-dimensional discrete features + K class labels
Estimate them with MLE (Relative Frequencies)!
slide by Barnabás Póczos & Aarti Singhdiscrete features
NB Prediction for test data: For Class Prior For Likelihood
We need to estimate these probabilities!
19Estimators
Naïve Bayes Algorithm for discrete features
slide by Barnabás Póczos & Aarti SinghSubtlety: Insufficient training data
55data
What now???
21For example,
slide by Barnabás Póczos & Aarti Singh–
Training data:
Use your expert knowledge & apply prior distributions:
Assume priors: MAP Estimate:
# virtual examples with Y = b
22Naïve Bayes Alg — Discrete features
56called Laplace smoothing
slide by Barnabás Póczos & Aarti SinghCase Study: Text Classification
Is this spam?
58Positive or negative movie review?
and some great plot twists
filmed
boxing scenes.
59 slide by Dan JurafskyWhat is the subject of this article?
MeSH Subject Category Hierarchy
?
MEDLINE Article
slide by Dan JurafskyText Classification
Text Classification: definition
Hand-coded rules
features
been selected”)
expensive
63 slide by Dan JurafskyText Classification and Naive Bayes
What are the features X? The text! Let Xi represent ith word in the document
slide by Barnabás Póczos & Aarti SinghXi represents ith word in document
65 slide by Barnabás Póczos & Aarti SinghNB for Text Classification
66A problem: The support of P(X|Y) is huge! – Article at least 1000 words, X={X1,…,X1000} – Xi represents ith word in document, i.e., the domain of Xi is the entire vocabulary, e.g., Webster Dictionary (or more). Xi 2 {1,…,50000} ¡) K(100050000 -1) parameters to estimate ¡without ¡the ¡NB ¡assumption….
slide by Barnabás Póczos & Aarti SinghNB for Text Classification
67 26Xi 2 {1,…,50000} ¡) K(100050000 -1) ¡parameters ¡to ¡estimate…. NB assumption helps a lot!!! If P(Xi=xi|Y=y) is the probability of observing word xi at the ith position in a document on topic y NB assumption helps, but still lots of parameters to estimate. ) 1000K(50000-1) parameters to estimate with NB assumption
slide by Barnabás Póczos & Aarti SinghBag of words model
68Typical additional assumption: Position in ¡document ¡doesn’t ¡matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well!
27The probability of a document with words x1,x2,… ¡
) K(50000-1) parameters to estimate
slide by Barnabás Póczos & Aarti SinghThe bag of words representation
69I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale
just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale
just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
The bag of words representation
slide by Dan Jurafskyx love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
The bag of words representation: using a subset of words
slide by Dan Jurafskygreat 2 love 2 recommend 1 laugh 1 happy 1 ... ...
The bag of words representation
slide by Dan JurafskyChoosing a class: P(c|d5) P(j|d5)
1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001
Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?
73Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) = Priors: P(c)= P(j)= 3 4 1 4
(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9
3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003
∝ ∝
ˆ P(c) = Nc N
ˆ P(w | c) = count(w,c)+1 count(c)+ |V |
slide by Dan Jurafskymean and variance for each class k and each pixel i
Gaussian Naïve Bayes (GNB):
Different mean and variance for each class k and each pixel i. Sometimes assume variance
What if features are continuous?
74Naïve Bayes (GNB):
slide by Barnabás Póczos & Aarti Singhtinuous
ates:
Y discrete, Xi continuou
Estimating parameters: Y discrete, Xi continuous
slide by Barnabás Póczos & Aarti SinghMaximum likelihood estimates:
76tinuous
ates:
jth training image ith pixel in jth training image kth class
Estimating parameters: Y discrete, Xi continuous
slide by Barnabás Póczos & Aarti SinghTwenty news groups results
77Naïve Bayes: 89% accuracy
slide by Barnabás Póczos & Aarti SinghCase Study: Classifying Mental States
Example: GNB for classifying mental states
79[Mitchell et al.]
~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response
slide by Barnabás Póczos & Aarti Singhtrack activation with precision and sensitivity
slide by Barnabás Póczos & Aarti SinghLearned Naïve Bayes Models – Means for P(BrainActivity | WordCategory)
81Pairwise classification accuracy: 78-99%, 12 participants
Tool words Building
–
Building
words
Tool words
[Mitchell et al.]
slide by Barnabás Póczos & Aarti SinghWhat you should know…
Naïve Bayes classifier
Text classification
Gaussian NB