machine learning mt 2017 7 bayesian approach to machine
play

Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning - PowerPoint PPT Presentation

Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning Christoph Haase University of Oxford October 23, 2017 Frequentist vs Bayesian Approaches Different views on probability: Frequentists: Probability of an event represents


  1. Machine Learning - MT 2017 7 Bayesian Approach to Machine Learning Christoph Haase University of Oxford October 23, 2017

  2. Frequentist vs Bayesian Approaches Different views on probability: ◮ Frequentists: Probability of an event represents long-run frequency over a large number of repetitions of an experiment ◮ Bayesians: Probability of an event represents a degree of belief about the event 1

  3. Frequentist vs Bayesian Approaches Different views on probability: ◮ Frequentists: Probability of an event represents long-run frequency over a large number of repetitions of an experiment ◮ Bayesians: Probability of an event represents a degree of belief about the event Different views on statistics: ◮ Frequentists: Parameters are fixed, data are a repeatable random sample, underlying parameters remain constant at every repetition ◮ Bayesians: Data are fixed, parameters are unknown and described probabilistically, repetition adds knowledge about parameters 1

  4. Frequentist vs Bayesian Approaches 2

  5. Bayes’ Theorem Recall basic laws of probability: p ( A ∩ B ) 3

  6. Bayes’ Theorem Recall basic laws of probability: p ( A ∩ B ) = p ( A | B ) · p ( B ) 3

  7. Bayes’ Theorem Recall basic laws of probability: p ( B | A ) · p ( A ) = p ( A ∩ B ) = p ( A | B ) · p ( B ) 3

  8. Bayes’ Theorem Recall basic laws of probability: p ( B | A ) · p ( A ) = p ( A ∩ B ) = p ( A | B ) · p ( B ) Bayes’ Theorem: p ( A | B ) = p ( B | A ) · p ( A ) P ( B ) 3

  9. Bayes’ Theorem Recall basic laws of probability: p ( B | A ) · p ( A ) = p ( A ∩ B ) = p ( A | B ) · p ( B ) Bayes’ Theorem: p ( A | B ) = p ( B | A ) · p ( A ) P ( B ) Viewing A as a proposition and B as evidence: ◮ p ( A ) is the prior representing initial belief about A ◮ p ( A | B ) is the posterior representing belief about A after learning about B ◮ Posterior is proportional to prior times likelihood if we fix B : p ( A | B ) ∝ p ( B | A ) · p ( A ) 3

  10. Priors Matter Suppose we have a test for a disease: ◮ test is 95% effective, i.e., p ( T | D ) = 0 . 95 ◮ rate of false positives is 1% , i.e., p ( T | ¯ D ) = 0 . 01 ◮ the disease occurs in 0 . 5% of the population, i.e., p ( D ) = 0 . 005 4

  11. Priors Matter Suppose we have a test for a disease: ◮ test is 95% effective, i.e., p ( T | D ) = 0 . 95 ◮ rate of false positives is 1% , i.e., p ( T | ¯ D ) = 0 . 01 ◮ the disease occurs in 0 . 5% of the population, i.e., p ( D ) = 0 . 005 Suppose the test is positive, what is p ( D | T ) : 4

  12. Priors Matter Suppose we have a test for a disease: ◮ test is 95% effective, i.e., p ( T | D ) = 0 . 95 ◮ rate of false positives is 1% , i.e., p ( T | ¯ D ) = 0 . 01 ◮ the disease occurs in 0 . 5% of the population, i.e., p ( D ) = 0 . 005 Suppose the test is positive, what is p ( D | T ) : p ( D | T ) = p ( T | D ) · p ( D ) p ( T ) p ( T | D ) · p ( D ) = p ( T | D ) · p ( D ) + p ( T | ¯ D ) · p ( ¯ D )) 0 . 95 · 0 . 005 = 0 . 95 · 0 . 005 + 0 . 01 · 0 . 995 ≈ 0 . 32 4

  13. Bayesian Machine Learning In the discriminative framework, we model the output y as a probability distribution given the input x and the parameters w , say p ( y | w , x ) In Bayesian machine learning, we assume a prior on the parameters w , say p ( w ) This prior represents a ‘‘belief’’ about the model; the uncertainty in our knowledge is expressed mathematically as a probability distribution 5

  14. Bayesian Machine Learning In the discriminative framework, we model the output y as a probability distribution given the input x and the parameters w , say p ( y | w , x ) In Bayesian machine learning, we assume a prior on the parameters w , say p ( w ) This prior represents a ‘‘belief’’ about the model; the uncertainty in our knowledge is expressed mathematically as a probability distribution When observations, D = � ( x i , y i ) � N i =1 are made the belief about the parameters w is updated using Bayes’ rule As before, the posterior distribution on w given the data D is: p ( w | D ) ∝ p ( y | w , X ) · p ( w ) 5

  15. Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ ∈ [0 , 1] p ( H | θ ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ ? 6

  16. Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ ∈ [0 , 1] p ( H | θ ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ ? What is the posterior distribution over θ , assuming a uniform prior on θ ? 6

  17. Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ ∈ [0 , 1] p ( H | θ ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ ? What is the posterior distribution over θ , assuming a Beta(2 , 2) prior on θ ? 6

  18. Least Squares and MLE (Gaussian Noise) Least Squares MLE (Gaussian Noise) Objective Function Likelihood � � � N N � − ( y i − w · x i ) 2 1 ( y i − w · x i ) 2 L ( w ) = p ( y | X , w ) = exp (2 πσ 2 ) N/ 2 2 σ 2 i =1 i =1 7

  19. Least Squares and MLE (Gaussian Noise) Least Squares MLE (Gaussian Noise) Objective Function Likelihood � � � N N � − ( y i − w · x i ) 2 1 ( y i − w · x i ) 2 L ( w ) = p ( y | X , w ) = exp (2 πσ 2 ) N/ 2 2 σ 2 i =1 i =1 For estimating w , the negative log-likelihood under Gaussian noise has the same form as the least squares objective 7

  20. Least Squares and MLE (Gaussian Noise) Least Squares MLE (Gaussian Noise) Objective Function Likelihood � � � N N � − ( y i − w · x i ) 2 1 ( y i − w · x i ) 2 L ( w ) = p ( y | X , w ) = exp (2 πσ 2 ) N/ 2 2 σ 2 i =1 i =1 For estimating w , the negative log-likelihood under Gaussian noise has the same form as the least squares objective Alternatively, we can model the data (only y i -s) as being generated from a distribution defined by exponentiating the negative of the objective function 7

  21. What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 8

  22. What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 2 σ 2 and setting λ = σ 2 Let’s rewrite this objective slightly, scaling by 1 τ 2 . To avoid ambiguity, we’ll denote this by � L 1 1 � 2 σ 2 ( y − Xw ) T ( y − Xw ) + 2 τ 2 w T w L ridge ( w ; D ) = 8

  23. What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 2 σ 2 and setting λ = σ 2 Let’s rewrite this objective slightly, scaling by 1 τ 2 . To avoid ambiguity, we’ll denote this by � L 1 1 � 2 σ 2 ( y − Xw ) T ( y − Xw ) + 2 τ 2 w T w L ridge ( w ; D ) = Let Σ = σ 2 I N and Λ = τ 2 I D , where I m denotes the m × m identity matrix L ridge ( w ) = 1 2( y − Xw ) T Σ − 1 ( y − Xw ) + 1 � 2 w T Λ − 1 w 8

  24. What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = � ( x i , y i ) � N i =1 denote the data L ridge ( w ; D ) = ( y − Xw ) T ( y − Xw ) + λ w T w 2 σ 2 and setting λ = σ 2 Let’s rewrite this objective slightly, scaling by 1 τ 2 . To avoid ambiguity, we’ll denote this by � L 1 1 � 2 σ 2 ( y − Xw ) T ( y − Xw ) + 2 τ 2 w T w L ridge ( w ; D ) = Let Σ = σ 2 I N and Λ = τ 2 I D , where I m denotes the m × m identity matrix L ridge ( w ) = 1 2( y − Xw ) T Σ − 1 ( y − Xw ) + 1 � 2 w T Λ − 1 w Taking the negation of � L ridge ( w ; D ) and exponentiating gives us a non-negative function of w and D which after normalisation gives a density function � � � � − 1 − 1 2( y − Xw ) T Σ − 1 ( y − Xw ) 2 w T Λ − 1 w f ( w ; D ) = exp · exp 8

  25. Bayesian Linear Regression (and connections to Ridge) Let’s start with the form of the density function we had on the previous slide and factor it. � � � � − 1 − 1 2( y − Xw ) T Σ − 1 ( y − Xw ) 2 w T Λ − 1 w f ( w ; D ) = exp · exp 9

  26. Bayesian Linear Regression (and connections to Ridge) Let’s start with the form of the density function we had on the previous slide and factor it. � � � � − 1 − 1 2( y − Xw ) T Σ − 1 ( y − Xw ) 2 w T Λ − 1 w f ( w ; D ) = exp · exp We’ll treat σ as fixed and not as a parameter. Up to a constant factor (which does’t matter when optimising w.r.t. w ), we can rewrite this as p ( w | X , y ) ∝ N ( y | Xw , Σ) · N ( w | 0 , Λ ) � �� � � �� � � �� � posterior Likelihood prior where N ( · | µ , Σ ) denotes the density of the multivariate normal distribution with mean µ and covariance matrix Σ ◮ What the ridge objective is actually finding is the maximum a posteriori or (MAP) estimate which is a mode of the posterior distribution ◮ The linear model is as described before with Gaussian noise ◮ The prior distribution on w is assumed to be a spherical Gaussian 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend