machine learning foundations
play

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 - PowerPoint PPT Presentation

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012) Bayesian Inference Bayesian inference is a method of


  1. Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 – Bayesian Inference Kfir Bar Yaniv Bar Marcelo Bacher Based on notes by Shahar Yifrah, Keren Yizhak, Hadas Zur (2012)

  2. Bayesian Inference  Bayesian inference is a method of statistical inference that uses prior probability over some hypothesis to determine the likelihood of that hypothesis be true based on an observed evidence.  Three methods:  ML - Maximum Likelihood rule  MAP - Maximum A Posteriori rule  Bayes Posterior rule

  3. Bayes Rule In general:  In Bayesian inference:  data - a known information  h - an hypothesis/classification regarding the data distribution We use Bayes Rule to compute the likelihood that our hypothesis is true:

  4. Example 1: Cancer Detection  A hospital is examining a new cancer detection kit. The known information (prior) is as followed:  A patient with cancer has a 98% chance for a positive result  A healthy patient has a 97% chance for a negative result  The Cancer probability in normal population is 1% How reliable is this kit?

  5. Example 1: Cancer Detection  Let’s calculate Pr[cancer|+]: According to Bayes rule we get:

  6. Example 1: Cancer Detection  Surprisingly, the test, although it seems very accurate, with high detection probabilities of 97-98%, is almost useless  3 out of 4 patients found sick in the test, are actually not. For a low error, we can just tell everyone they do not have cancer, which is right in 99% of the cases  The low detection rate comes from the low probability of cancer in the general population = 1%

  7. Example 2: Normal Distribution  A random variable Z is distributed normally with mean and variance Recall -

  8. Example 2: Normal Distribution  We have m i.i.d samples of a random variable Z where is a normalization factor

  9. Example 2: Normal Distribution  Maximum Likelihood (ML): We aim to choose the hypothesis which best explains the sample, independent of the prior over the hypothesis space (the parameters that maximize the likelihood of the sample) in our case -

  10. Example 2: Normal Distribution  Maximum Likelihood (ML): We take a log to simplify the computation - now we find the maximum for : It's easy to see that the second derivative is negative, thus it's a maximum

  11. Example 2: Normal Distribution  Maximum Likelihood (ML):  Note that this value of is independent of the value of and it is simply the average of the observations  Now we compute the maximum for given that is :

  12. Example 2: Normal Distribution  Maximum A Posteriori (MAP): MAP adds the priors to the hypothesis. In this example, the prior distributions of μ and σ are N (0,1) and are now taken into account and since Pr [ D ] is constant for all we can omit it, and have the following:

  13. Example 2: Normal Distribution  Maximum A Posteriori (MAP):  How will the result we got in the ML approach change for MAP? We added the knowledge that σ and μ are small and around zero, since the prior is σ , μ ∼ N (0,1)  Therefore, the result (the hypothesis regarding σ and μ ) should be closer to 0 than the one we got in ML

  14. Example 2: Normal Distribution  Maximum A Posteriori (MAP): Now we should maximize both equations simultaneously: it can be easily seen that μ and σ will be closer to zero than in the ML approach, since

  15. Example 2: Normal Distribution  Posterior (Bayes): Assume μ ~ N ( η ,1) and Z~N ( μ ,1) and σ = 1. We see only one sample of Z . What is the new posterior distribution of μ ? Pr [ Z ] is a normalizing factor, so we can drop it for the calculations:

  16. Example 2: Normal Distribution  Posterior (Bayes): normalization factor, does not depend on μ

  17. Example 2: Normal Distribution  Posterior (Bayes): The new posterior distribution has: after taking into account the sample z, μ moves towards Z and the variance is reduced

  18. Example 2: Normal Distribution  Posterior (Bayes): In general, for: given m samples we have:

  19. Example 2: Normal Distribution  Posterior (Bayes): And if we assume S = σ , we get: which is like starting with an additional sample of value μ , i.e.,

  20. Learning A Concept Family (1/2)  We are given a Concept Family H.    Our information consist of examples , where is an , ( ) f  x f x H unknown target function that classifies all samples.  Assumptions: (1) The functions in H are deterministic functions ( ).   Pr[ ( ) 1 ] { 1 , 0 } h x (2) The process that generates the input is independent of the target function f .  For each we will calculate where      h  Pr[ | ] { , , 1 } S h S x b i n H i i and . b  ( ) f x i i         Case 1: : ( ) Pr[ , | ] 0 Pr[ | ] 0 b h x x b h S h i i i i i Case 2:         : ( ) Pr[ , | ] Pr[ ] Pr[ | , ] Pr[ ] b h x x b h x b h x x i i i i i i i i i m     Pr[ | ] Pr[ ] Pr[ ] S h x S i  1 i

  21. Learning A Concept Family (2/2)  Definition: A consistent functio n classifies all the samples h  H S correctly ( ).   ( ) h x b   , x b S i i i i  Let be all the functions that are consistent with S. H  ' H There are three methods to choose H’: ML - choose any consistent function, each one has the same probability. MAP - choose the consistent function with the highest prior probability. Bayes - weighted combination of all consistent functions to one predictor,  ( ) Pr[ ] h y h   ( ) B y . Pr[ ' ] H  ' h H

  22. Example (Biased Coins)  We toss a coin n times and the coin ends up heads k times.  We want to estimate the probability p that the coin will come up heads in the next toss.  The probability that k out of n coin tosses will come up heads is:   n      k n k Pr[( , ) | ] ( 1 ) k n p p p     k  With the Maximum Likelihood approach, one would choose p that k maximize which is . p  Pr[( , ) | ] k n p n  Yet this result seems unreasonable when n is small. (For example, if you toss the coin only once and get a tail, should you believe that it is impossible to get a head on the next toss?)

  23. Laplace Rule (1/3)  Let us suppose a uniform prior distribution on p . That is, the prior distribution on all the possible coins is uniform,        Pr[ ] p dp 0  We will calculate the probability to see k heads out of n tosses.   1 1 n           k n k Pr[( , )] Pr[ | ] Pr[ ] ( 1 ) k n k p p dp x x dx      k Integraion 0 0 by parts 1         1 k 1 k 1 n n x x              1 n k n k  ( 1 )  ( )( 1 ) x n k x dx          1   1  k  k k k      n ( ) n 0 n k 0              1  1  k k k   1 1 n              1 1 k n k ( 1 ) Pr[ 1 | ] Pr[ ] Pr[( 1 , )] x x dx k p p dp k n      1 k 0 0

  24. Laplace Rule (2/3)  Comparing both ends of the above sequence of equalities we realize that all the probabilities are equal, and therefore, for any k , 1  n Pr[( , )] k n  1 Intuitively, it means that for a random choice of the bias p , any possible number of heads in a sequence of n coin tosses is equally likely.  We want to calculate the posterior expectation, where [ | ( , )] ( , ) E p s k n s k n is a specific sequence with k heads out of n .  We have,    k n k Pr[ ( , ) | ] ( 1 ) s k n p p p 1 1 1       k n k Pr[ ( , )] ( 1 ) s k n p p dp    1 n n   0     k

  25. Laplace Rule (3/3)  Hence, 1     k n k ( 1 ) p p p dp  1   Pr[ ( , ) | ] Pr[ ] s k n p p    0 [ | ( , )] E p k n p dp 1 1 Pr[ ( , )] s k n  0    1 n n       k 1 1      1 2 n n        1  k 1 k   1 1 2 n     1 n n       k  1 k  Intuitively, Laplace correction is like adding two samples to the ML  2 n estimator, one with value 0 and one with value 1.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend