Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - - PowerPoint PPT Presentation
Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - - PowerPoint PPT Presentation
Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012) Bayesian Inference Bayesian inference is a method of
Bayesian Inference
Bayesian inference is a method of statistical inference that uses prior
probability over some hypothesis to determine the likelihood of that hypothesis be true based on an observed evidence.
Three methods: ML - Maximum Likelihood rule MAP - Maximum A Posteriori rule Bayes Posterior rule
Bayes Rule
In Bayesian inference:
data - a known information h - an hypothesis/classification regarding the data distribution
We use Bayes Rule to compute the likelihood that our hypothesis is true: In general:
Example 1: Cancer Detection
A hospital is examining a new cancer detection kit.
The known information (prior) is as followed:
A patient with cancer has a 98% chance for a positive result A healthy patient has a 97% chance for a negative result The Cancer probability in normal population is 1%
How reliable is this kit?
Example 1: Cancer Detection
Let’s calculate Pr[cancer|+]:
According to Bayes rule we get:
Example 1: Cancer Detection
Surprisingly, the test, although it seems very accurate, with high detection
probabilities of 97-98%, is almost useless
3 out of 4 patients found sick in the test, are actually not. For a low error,
we can just tell everyone they do not have cancer, which is right in 99% of the cases
The low detection rate comes from the low probability
- f cancer in the general population = 1%
Example 2: Normal Distribution
A random variable Z is distributed normally with mean and variance
Recall -
Example 2: Normal Distribution
We have m i.i.d samples of a random variable Z
where is a normalization factor
Example 2: Normal Distribution
Maximum Likelihood (ML):
We aim to choose the hypothesis which best explains the sample, independent of the prior over the hypothesis space (the parameters that maximize the likelihood of the sample) in our case -
Example 2: Normal Distribution
Maximum Likelihood (ML):
We take a log to simplify the computation - now we find the maximum for :
It's easy to see that the second derivative is negative, thus it's a maximum
Example 2: Normal Distribution
Maximum Likelihood (ML): Note that this value of is independent of the value of and it is
simply the average of the observations
Now we compute the maximum for given that is :
Example 2: Normal Distribution
Maximum A Posteriori (MAP):
MAP adds the priors to the hypothesis. In this example, the prior distributions of μ and σ are N(0,1) and are now taken into account and since Pr[D] is constant for all we can omit it, and have the following:
Example 2: Normal Distribution
Maximum A Posteriori (MAP): How will the result we got in the ML approach change for MAP? We
added the knowledge that σ and μ are small and around zero, since the prior is σ,μ∼N(0,1)
Therefore, the result (the hypothesis regarding σ and μ) should be
closer to 0 than the one we got in ML
Example 2: Normal Distribution
Maximum A Posteriori (MAP):
Now we should maximize both equations simultaneously: it can be easily seen that μ and σ will be closer to zero than in the ML approach, since
Example 2: Normal Distribution
Posterior (Bayes):
Assume μ~N(η,1) and Z~N(μ,1) and σ = 1. We see only one sample of Z. What is the new posterior distribution of μ? Pr[Z] is a normalizing factor, so we can drop it for the calculations:
Example 2: Normal Distribution
Posterior (Bayes):
normalization factor, does not depend on μ
Example 2: Normal Distribution
Posterior (Bayes):
The new posterior distribution has: after taking into account the sample z, μ moves towards Z and the variance is reduced
Example 2: Normal Distribution
Posterior (Bayes):
In general, for: given m samples we have:
Example 2: Normal Distribution
Posterior (Bayes):
And if we assume S=σ, we get: which is like starting with an additional sample of value μ, i.e.,
Learning A Concept Family (1/2)
We are given a Concept Family H. Our information consist of examples , where is an
unknown target function that classifies all samples.
Assumptions:
(1) The functions in H are deterministic functions ( ). (2) The process that generates the input is independent of the target function f.
For each we will calculate where
and . Case 1: Case 2: ) ( , x f x
H f
} , 1 { ] 1 ) ( Pr[ x h
H h
] | Pr[ h S
} 1 , , { n i b x S
i i
) (
i i
x f b
] | Pr[ ] | , Pr[ ) ( : h S h b x x h b
i i i i i
] Pr[ ] Pr[ ] | Pr[ ] Pr[ ] , | Pr[ ] Pr[ ] | , Pr[ ) ( :
1
S x h S x x h b x h b x x h b
m i i i i i i i i i i i
Learning A Concept Family (2/2)
Definition: A consistent function classifies all the samples
S correctly ( ).
Let be all the functions that are consistent with S.
There are three methods to choose H’: ML - choose any consistent function, each one has the same probability. MAP - choose the consistent function with the highest prior probability. Bayes - weighted combination of all consistent functions to one predictor, .
H h
i i S b x
b x h
i i
) (
,
H H '
'
] ' Pr[ ] Pr[ ) ( ) (
H h
H h y h y B
Example (Biased Coins)
We toss a coin n times and the coin ends up heads k times. We want to estimate the probability p that the coin will come up heads in
the next toss.
The probability that k out of n coin tosses will come up heads is: With the Maximum Likelihood approach, one would choose p that
maximize which is .
Yet this result seems unreasonable when n is small.
(For example, if you toss the coin only once and get a tail, should you believe that it is impossible to get a head on the next toss?)
k n k
p p k n p n k
) 1 ( ] | ) , Pr[(
] | ) , Pr[( p n k
n k p
Laplace Rule (1/3)
Let us suppose a uniform prior distribution on p. That is, the prior
distribution on all the possible coins is uniform,
We will calculate the probability to see k heads out of n tosses.
] Pr[ dp p
)] , 1 Pr[( ] Pr[ ] | 1 Pr[ ) 1 ( 1 ) 1 )( ( 1 ) 1 ( 1 ) 1 ( ] Pr[ ] | Pr[ )] , Pr[(
1 1 1 1 1 1 1 ) ( 1 1 1 1 1 1
n k dp p p k dx x x k n dx x k n k x k n x k x k n dx x x k n dp p p k n k
k n k k n k k n k n k n k k n k parts by Integraion k n k
Laplace Rule (2/3)
Comparing both ends of the above sequence of equalities we realize that
all the probabilities are equal, and therefore, for any k, Intuitively, it means that for a random choice of the bias p, any possible number of heads in a sequence of n coin tosses is equally likely.
We want to calculate the posterior expectation, where
is a specific sequence with k heads out of n.
We have,
1 1 )] , Pr[( n n k )] , ( | [ n k s p E ) , ( n k s
k n k
p p p n k s
) 1 ( ] | ) , ( Pr[
1
1 1 1 ) 1 ( )] , ( Pr[ k n n dp p p n k s
k n k
Laplace Rule (3/3)
Hence, Intuitively, Laplace correction is like adding two samples to the ML
estimator, one with value 0 and one with value 1.
2 1 1 1 1 1 1 1 2 1 1 1 1 ) 1 ( )] , ( Pr[ ] Pr[ ] | ) , ( Pr[ )] , ( | [
1 1
n k k n n k n n k n n dp p p p dp n k s p p n k s p n k p E
k n k
2 1 n k
Loss Function
In lecture #1 we defined a few loss functions, among them was the
logarithmic loss function. We will use it to compare our different approaches.
When considering a loss function we should note that there are two
causes for the loss:
- 1. Bayes Risk - the loss we cannot avoid since we bound to have it even if
we know the target concept. Example: Bias coin problem revisited. Even if we knew the bias p we would probably always predict 0 (if ) which, on the average, should result in mistakes.
- 2. Regret - the loss due to incorrect estimation of the target concept
(having to learn an unknown model.)
2 1 p
n p
Using Log Loss Function (1/3)
A commonly use loss function is the LogLoss which states for the bias coin
problem that if the learner guesses that the bias is p then the loss will be when the outcome is 1 (=head) when the outcome is 0 (=tail)
If the true bias is then the expected LogLoss is
which attains it's minima when .
Let’s consider the loss at , which is known in the Information
Theory literature as the
Entropy of , this is essentially the Bayes Risk.
p 1 log
p
p 1 1 log
p p 1 1 log ) 1 ( 1 log p
1 1 log ) 1 ( 1 log ) ( H
Using Log Loss Function (2/3)
Recall that we cannot do any better then since Bayes Risk is the loss
we cant avoid.
How far are we from the Bayes Risk when using the guess of p according
to the Laplace Rule ?
T n n k T n T n n k k n k T n n k k n k T n n k k n k T n n k
T O Risk Bayes n c d H T n k H n k n n n k n n k n n k n d k n n k n d k n k n d k n k n n k n LogLoss E
1 1 1 1 1 1 1 1 1
)) (log( ] [ 2 1 1 1 1 2 log 2 1 1 1 1 2 log 2 1 1 1 ) 1 ( 1 2 log ) 1 ( 1 2 log ) 1 ( ) 1 2 log ) 1 ( 1 2 log ( ] [
) ( H for some constant c.
Using Log Loss Function (3/3)
In the above we used the fact that, The Difference between the upper and lower bound is: Hence, we showed that by applying the Laplace Rule, we attained the
- ptimal loss (the Bayes Risk) with an additional regret which is only
logarithmic in the number of coin flips (T). Note that the Bayes Risk by itself grows linearly with T.
) ( 1 ) ( ) 1 ( 1
2 / 1 2 / 1 2 / 1
n i H n d H n i H n
n i n i
n H H n n i H n i H n
n i
1 ) ( ) 2 1 ( 1 ) 1 ( ) ( 1
2 / 1
Naïve Bayes Classification: Binary Domain (1/2)
Consider two binary classes +1 and -1, where each example is described with N attributes.
Xn is a binary variable with possible values {0,1}. Example of dataset:
We are looking for an hypothesis, h, to map x1, x2,…xn to {-1,+1} such that:
x1 x2 … xn C 1
- 1
1 1 +1 1
- 1
1 1 +1 … … … … 1 1 +1
Pr(C = +1| x1, x2,...xn) = Pr(x1, x2,...xn |C = +1)Pr(C = +1) Pr(x1, x2,...xn)
Naïve Bayes Classification: Binary Domain (2/2)
The term can be easily calculated from the data if it is not too large.
Since Naive Bayes is based on independence assumption, therefore: Each attribute xi is independent on the other attributes once we know the
value of C.
For each 1 ≤ i ≤ n we have two parameters: Assuming Simple Binomial estimation we can estimate these two
parameters.
Count the number of instances with xi = 1 and with xi = 0 among instances
where C = +1 or C = −1, respectively. The convergence inequalities of Markov, Chebyshev and Chernoff can be used to bound deviations of the
- bserved average from the mean (see the end of lecture 1 notes for more
information.)
Pr(C = +1)
) | Pr( ) | ,... , Pr(
2 1
C x C x x x
N i i n
) 1 | 1 Pr(
1 |
C xi
i
) 1 | 1 Pr(
1 |
C xi
i
Naïve Bayes - Interpretation (1/2)
According to Bayesian and MAP we need to compare two values: We choose the most reasonable probability (maximum). By taking a Log of
the fraction and comparing to 0.
We conclude that:
) ,... , | 1 Pr( and ) ,... , | 1 Pr(
2 1 2 1 n n
x x x x x x
) 1 Pr( ) 1 | ,.... , Pr( ) 1 Pr( ) 1 | ,.... , Pr( log ) ,.... , | 1 Pr( ) ,.... , | 1 Pr( log
2 1 2 1 2 1 2 1
n n n n
x x x x x x x x x x x x
N i i i
C x C x ) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log
C x C x
i i i
) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) ,.... , | 1 Pr( ) ,.... , | 1 Pr( log
2 1 2 1
C x C x x x x x x x
i i i n n
Naïve Bayes - Interpretation (2/2)
Each xi “votes” about the prediction: Let us denote: The classification rule becomes a separating hyperplane:
) veto" (" s
- ther vote
all
- verrides
then ) 1 | Pr( tion classifica in say" no " has then ) 1 | Pr( ) 1 | Pr(
i i i i i
x C x If x C x C x If
i i i i i i i i
C x C x C C b C x C x C x C x ) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) 1 | Pr( ) 1 | Pr( log ) 1 | 1 Pr( ) 1 | 1 Pr( log
Class 1 + say Class 1 +
- r
Class 1
- either
say = Class 1
- say
< if ) ( ) (
i i ix
b sign x h
Naïve Bayes – Practical Considerations
Easy to estimate the parameters (each one has many samples) A relatively naive model Very simple to implement Reasonable performance (pretty often)
Naïve Bayes – Normal Distribution (1/3)
Usually we also say Gaussian distribution We recall the independence assumption: In addition, we make the following assumptions:
Mean of xi depends on class, i.e., μi,c. Variance of xi does not depend on class, i.e., σi.
2
) ( 2 1 2
2 1 ) ( if ) , ( ~
x
e x p N X
dx x p b x a
b a
) ( Pr
2 2 2 2
)) ( ( and x E x E x E x E x Var x E
) | Pr( ) | ,... , Pr(
2 1
C x C x x x
N i i n
) , ( ~ ) | Pr(
2 , i C i i
N C x
Naïve Bayes – Normal Distribution (2/3)
As before, Using the Gaussian parameters,
) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) ,.... , | 1 Pr( ) ,.... , | 1 Pr( log
2 1 2 1
C x C x x x x x x x
i i i n n
2 1 , 2 1 ,
2 1 2 1
2 1 2 1 log ) 1 | Pr( ) 1 | Pr( log
i i i i i i
x i x i i i
e e C x C x
2 1 , 2 1 ,
2 1 2 1
i i i i i i
x x
i i i i i i i
x 2 1
1 , 1 , 1 , 1 ,
Distance between means Distance of xi to midway point
Naïve Bayes – Normal Distribution (3/3)
If we allow different variances, the classification rule is more complex. The term is quadratic in xi.
) 1 | Pr( ) 1 | Pr( log C x C x
i i