Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - - PowerPoint PPT Presentation

machine learning foundations
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - - PowerPoint PPT Presentation

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012) Bayesian Inference Bayesian inference is a method of


slide-1
SLIDE 1

Machine Learning: Foundations

Lecturer: Yishay Mansour

Lecture 2 – Bayesian Inference

Kfir Bar Yaniv Bar Marcelo Bacher

Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012)

slide-2
SLIDE 2

Bayesian Inference

 Bayesian inference is a method of statistical inference that uses prior

probability over some hypothesis to determine the likelihood of that hypothesis be true based on an observed evidence.

 Three methods:  ML - Maximum Likelihood rule  MAP - Maximum A Posteriori rule  Bayes Posterior rule

slide-3
SLIDE 3

Bayes Rule

 In Bayesian inference:

 data - a known information  h - an hypothesis/classification regarding the data distribution

We use Bayes Rule to compute the likelihood that our hypothesis is true: In general:

slide-4
SLIDE 4

Example 1: Cancer Detection

 A hospital is examining a new cancer detection kit.

The known information (prior) is as followed:

 A patient with cancer has a 98% chance for a positive result  A healthy patient has a 97% chance for a negative result  The Cancer probability in normal population is 1%

How reliable is this kit?

slide-5
SLIDE 5

Example 1: Cancer Detection

 Let’s calculate Pr[cancer|+]:

According to Bayes rule we get:

slide-6
SLIDE 6

Example 1: Cancer Detection

 Surprisingly, the test, although it seems very accurate, with high detection

probabilities of 97-98%, is almost useless

 3 out of 4 patients found sick in the test, are actually not. For a low error,

we can just tell everyone they do not have cancer, which is right in 99% of the cases

 The low detection rate comes from the low probability

  • f cancer in the general population = 1%
slide-7
SLIDE 7

Example 2: Normal Distribution

 A random variable Z is distributed normally with mean and variance

Recall -

slide-8
SLIDE 8

Example 2: Normal Distribution

 We have m i.i.d samples of a random variable Z

where is a normalization factor

slide-9
SLIDE 9

Example 2: Normal Distribution

 Maximum Likelihood (ML):

We aim to choose the hypothesis which best explains the sample, independent of the prior over the hypothesis space (the parameters that maximize the likelihood of the sample) in our case -

slide-10
SLIDE 10

Example 2: Normal Distribution

 Maximum Likelihood (ML):

We take a log to simplify the computation - now we find the maximum for :

It's easy to see that the second derivative is negative, thus it's a maximum

slide-11
SLIDE 11

Example 2: Normal Distribution

 Maximum Likelihood (ML):  Note that this value of is independent of the value of and it is

simply the average of the observations

 Now we compute the maximum for given that is :

slide-12
SLIDE 12

Example 2: Normal Distribution

 Maximum A Posteriori (MAP):

MAP adds the priors to the hypothesis. In this example, the prior distributions of μ and σ are N(0,1) and are now taken into account and since Pr[D] is constant for all we can omit it, and have the following:

slide-13
SLIDE 13

Example 2: Normal Distribution

 Maximum A Posteriori (MAP):  How will the result we got in the ML approach change for MAP? We

added the knowledge that σ and μ are small and around zero, since the prior is σ,μ∼N(0,1)

 Therefore, the result (the hypothesis regarding σ and μ) should be

closer to 0 than the one we got in ML

slide-14
SLIDE 14

Example 2: Normal Distribution

 Maximum A Posteriori (MAP):

Now we should maximize both equations simultaneously: it can be easily seen that μ and σ will be closer to zero than in the ML approach, since

slide-15
SLIDE 15

Example 2: Normal Distribution

 Posterior (Bayes):

Assume μ~N(η,1) and Z~N(μ,1) and σ = 1. We see only one sample of Z. What is the new posterior distribution of μ? Pr[Z] is a normalizing factor, so we can drop it for the calculations:

slide-16
SLIDE 16

Example 2: Normal Distribution

 Posterior (Bayes):

normalization factor, does not depend on μ

slide-17
SLIDE 17

Example 2: Normal Distribution

 Posterior (Bayes):

The new posterior distribution has: after taking into account the sample z, μ moves towards Z and the variance is reduced

slide-18
SLIDE 18

Example 2: Normal Distribution

 Posterior (Bayes):

In general, for: given m samples we have:

slide-19
SLIDE 19

Example 2: Normal Distribution

 Posterior (Bayes):

And if we assume S=σ, we get: which is like starting with an additional sample of value μ, i.e.,

slide-20
SLIDE 20

Learning A Concept Family (1/2)

 We are given a Concept Family H.  Our information consist of examples , where is an

unknown target function that classifies all samples.

 Assumptions:

(1) The functions in H are deterministic functions ( ). (2) The process that generates the input is independent of the target function f.

 For each we will calculate where

and . Case 1: Case 2:   ) ( , x f x

H f 

} , 1 { ] 1 ) ( Pr[   x h

H h

] | Pr[ h S

} 1 , , { n i b x S

i i

    

) (

i i

x f b 

] | Pr[ ] | , Pr[ ) ( :         h S h b x x h b

i i i i i

] Pr[ ] Pr[ ] | Pr[ ] Pr[ ] , | Pr[ ] Pr[ ] | , Pr[ ) ( :

1

S x h S x x h b x h b x x h b

m i i i i i i i i i i i

          

slide-21
SLIDE 21

Learning A Concept Family (2/2)

 Definition: A consistent function classifies all the samples

S correctly ( ).

 Let be all the functions that are consistent with S.

There are three methods to choose H’: ML - choose any consistent function, each one has the same probability. MAP - choose the consistent function with the highest prior probability. Bayes - weighted combination of all consistent functions to one predictor, .

H h

i i S b x

b x h

i i

 

 

) (

,

H H  '

 

'

] ' Pr[ ] Pr[ ) ( ) (

H h

H h y h y B

slide-22
SLIDE 22

Example (Biased Coins)

 We toss a coin n times and the coin ends up heads k times.  We want to estimate the probability p that the coin will come up heads in

the next toss.

 The probability that k out of n coin tosses will come up heads is:  With the Maximum Likelihood approach, one would choose p that

maximize which is .

 Yet this result seems unreasonable when n is small.

(For example, if you toss the coin only once and get a tail, should you believe that it is impossible to get a head on the next toss?)

k n k

p p k n p n k

          ) 1 ( ] | ) , Pr[(

] | ) , Pr[( p n k

n k p 

slide-23
SLIDE 23

Laplace Rule (1/3)

 Let us suppose a uniform prior distribution on p. That is, the prior

distribution on all the possible coins is uniform,

 We will calculate the probability to see k heads out of n tosses.

  

  ] Pr[ dp p

 

)] , 1 Pr[( ] Pr[ ] | 1 Pr[ ) 1 ( 1 ) 1 )( ( 1 ) 1 ( 1 ) 1 ( ] Pr[ ] | Pr[ )] , Pr[(

1 1 1 1 1 1 1 ) ( 1 1 1 1 1 1

n k dp p p k dx x x k n dx x k n k x k n x k x k n dx x x k n dp p p k n k

k n k k n k k n k n k n k k n k parts by Integraion k n k

                                                        

    

                            

slide-24
SLIDE 24

Laplace Rule (2/3)

 Comparing both ends of the above sequence of equalities we realize that

all the probabilities are equal, and therefore, for any k, Intuitively, it means that for a random choice of the bias p, any possible number of heads in a sequence of n coin tosses is equally likely.

 We want to calculate the posterior expectation, where

is a specific sequence with k heads out of n.

 We have,

1 1 )] , Pr[(   n n k )] , ( | [ n k s p E ) , ( n k s

k n k

p p p n k s

  ) 1 ( ] | ) , ( Pr[

            

 1

1 1 1 ) 1 ( )] , ( Pr[ k n n dp p p n k s

k n k

slide-25
SLIDE 25

Laplace Rule (3/3)

 Hence,  Intuitively, Laplace correction is like adding two samples to the ML

estimator, one with value 0 and one with value 1.

2 1 1 1 1 1 1 1 2 1 1 1 1 ) 1 ( )] , ( Pr[ ] Pr[ ] | ) , ( Pr[ )] , ( | [

1 1

                                          

n k k n n k n n k n n dp p p p dp n k s p p n k s p n k p E

k n k

2 1   n k

slide-26
SLIDE 26

Loss Function

 In lecture #1 we defined a few loss functions, among them was the

logarithmic loss function. We will use it to compare our different approaches.

 When considering a loss function we should note that there are two

causes for the loss:

  • 1. Bayes Risk - the loss we cannot avoid since we bound to have it even if

we know the target concept. Example: Bias coin problem revisited. Even if we knew the bias p we would probably always predict 0 (if ) which, on the average, should result in mistakes.

  • 2. Regret - the loss due to incorrect estimation of the target concept

(having to learn an unknown model.)

2 1  p

n p

slide-27
SLIDE 27

Using Log Loss Function (1/3)

 A commonly use loss function is the LogLoss which states for the bias coin

problem that if the learner guesses that the bias is p then the loss will be when the outcome is 1 (=head) when the outcome is 0 (=tail)

 If the true bias is then the expected LogLoss is

which attains it's minima when .

 Let’s consider the loss at ,  which is known in the Information

Theory literature as the

 Entropy of , this is essentially the Bayes Risk.

p 1 log

  p

p  1 1 log

p p      1 1 log ) 1 ( 1 log     p

           1 1 log ) 1 ( 1 log ) ( H

slide-28
SLIDE 28

Using Log Loss Function (2/3)

 Recall that we cannot do any better then since Bayes Risk is the loss

we cant avoid.

 How far are we from the Bayes Risk when using the guess of p according

to the Laplace Rule ?

        

             

                                                                                                    

T n n k T n T n n k k n k T n n k k n k T n n k k n k T n n k

T O Risk Bayes n c d H T n k H n k n n n k n n k n n k n d k n n k n d k n k n d k n k n n k n LogLoss E

1 1 1 1 1 1 1 1 1

)) (log( ] [ 2 1 1 1 1 2 log 2 1 1 1 1 2 log 2 1 1 1 ) 1 ( 1 2 log ) 1 ( 1 2 log ) 1 ( ) 1 2 log ) 1 ( 1 2 log ( ] [              

) ( H for some constant c.

slide-29
SLIDE 29

Using Log Loss Function (3/3)

 In the above we used the fact that,  The Difference between the upper and lower bound is:  Hence, we showed that by applying the Laplace Rule, we attained the

  • ptimal loss (the Bayes Risk) with an additional regret which is only

logarithmic in the number of coin flips (T). Note that the Bayes Risk by itself grows linearly with T.

) ( 1 ) ( ) 1 ( 1

2 / 1 2 / 1 2 / 1

n i H n d H n i H n

n i n i

    

  

 

  n H H n n i H n i H n

n i

1 ) ( ) 2 1 ( 1 ) 1 ( ) ( 1

2 / 1

                

slide-30
SLIDE 30

Naïve Bayes Classification: Binary Domain (1/2)

Consider two binary classes +1 and -1, where each example is described with N attributes.

 Xn is a binary variable with possible values {0,1}. Example of dataset:

We are looking for an hypothesis, h, to map x1, x2,…xn to {-1,+1} such that:

x1 x2 … xn C 1

  • 1

1 1 +1 1

  • 1

1 1 +1 … … … … 1 1 +1

Pr(C = +1| x1, x2,...xn) = Pr(x1, x2,...xn |C = +1)Pr(C = +1) Pr(x1, x2,...xn)

slide-31
SLIDE 31

Naïve Bayes Classification: Binary Domain (2/2)

The term can be easily calculated from the data if it is not too large.

 Since Naive Bayes is based on independence assumption, therefore:  Each attribute xi is independent on the other attributes once we know the

value of C.

 For each 1 ≤ i ≤ n we have two parameters:  Assuming Simple Binomial estimation we can estimate these two

parameters.

 Count the number of instances with xi = 1 and with xi = 0 among instances

where C = +1 or C = −1, respectively. The convergence inequalities of Markov, Chebyshev and Chernoff can be used to bound deviations of the

  • bserved average from the mean (see the end of lecture 1 notes for more

information.)

Pr(C = +1)

) | Pr( ) | ,... , Pr(

2 1

C x C x x x

N i i n

) 1 | 1 Pr(

1 |

   

C xi

i

) 1 | 1 Pr(

1 |

   

C xi

i

slide-32
SLIDE 32

Naïve Bayes - Interpretation (1/2)

 According to Bayesian and MAP we need to compare two values:  We choose the most reasonable probability (maximum). By taking a Log of

the fraction and comparing to 0.

 We conclude that:

) ,... , | 1 Pr( and ) ,... , | 1 Pr(

2 1 2 1 n n

x x x x x x  

) 1 Pr( ) 1 | ,.... , Pr( ) 1 Pr( ) 1 | ,.... , Pr( log ) ,.... , | 1 Pr( ) ,.... , | 1 Pr( log

2 1 2 1 2 1 2 1

      

n n n n

x x x x x x x x x x x x

       

N i i i

C x C x ) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log        

C x C x

i i i

) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) ,.... , | 1 Pr( ) ,.... , | 1 Pr( log

2 1 2 1

         

C x C x x x x x x x

i i i n n

slide-33
SLIDE 33

Naïve Bayes - Interpretation (2/2)

 Each xi “votes” about the prediction:  Let us denote:  The classification rule becomes a separating hyperplane:

) veto" (" s

  • ther vote

all

  • verrides

then ) 1 | Pr( tion classifica in say" no " has then ) 1 | Pr( ) 1 | Pr(

i i i i i

x C x If x C x C x If        

                         

i i i i i i i i

C x C x C C b C x C x C x C x ) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) 1 | Pr( ) 1 | Pr( log ) 1 | 1 Pr( ) 1 | 1 Pr( log          

Class 1 + say Class 1 +

  • r

Class 1

  • either

say = Class 1

  • say

< if ) ( ) (

i i ix

b sign x h 

slide-34
SLIDE 34

Naïve Bayes – Practical Considerations

 Easy to estimate the parameters (each one has many samples)  A relatively naive model  Very simple to implement  Reasonable performance (pretty often)

slide-35
SLIDE 35

Naïve Bayes – Normal Distribution (1/3)

 Usually we also say Gaussian distribution  We recall the independence assumption:  In addition, we make the following assumptions:

 Mean of xi depends on class, i.e., μi,c.  Variance of xi does not depend on class, i.e., σi.

2

) ( 2 1 2

2 1 ) ( if ) , ( ~

 

  

 

x

e x p N X

 

dx x p b x a

b a

   ) ( Pr

   

   

 

2 2 2 2

)) ( ( and         x E x E x E x E x Var x E

) | Pr( ) | ,... , Pr(

2 1

C x C x x x

N i i n

) , ( ~ ) | Pr(

2 , i C i i

N C x  

slide-36
SLIDE 36

Naïve Bayes – Normal Distribution (2/3)

 As before,  Using the Gaussian parameters,

) 1 | Pr( ) 1 | Pr( log ) 1 Pr( ) 1 Pr( log ) ,.... , | 1 Pr( ) ,.... , | 1 Pr( log

2 1 2 1

         

C x C x x x x x x x

i i i n n

2 1 , 2 1 ,

2 1 2 1

2 1 2 1 log ) 1 | Pr( ) 1 | Pr( log

                   

 

    

i i i i i i

x i x i i i

e e C x C x

   

 

2 1 , 2 1 ,

2 1 2 1                     

  i i i i i i

x x                 

    i i i i i i i

x 2 1

1 , 1 , 1 , 1 ,

     

Distance between means Distance of xi to midway point

slide-37
SLIDE 37

Naïve Bayes – Normal Distribution (3/3)

 If we allow different variances, the classification rule is more complex.  The term is quadratic in xi.

) 1 | Pr( ) 1 | Pr( log     C x C x

i i