bayesian machine learning a tutorial
play

Bayesian machine learning: a tutorial R emi Bardenet CNRS & - PowerPoint PPT Presentation

Bayesian machine learning: a tutorial R emi Bardenet CNRS & CRIStAL, Univ. Lille, France R emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1 Outline The what Typical statistical problems Statistical decision theory Posterior


  1. Bayesian machine learning: a tutorial R´ emi Bardenet CNRS & CRIStAL, Univ. Lille, France R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1

  2. Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 2

  3. Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 3

  4. Typical jobs for statisticians Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4

  5. Typical jobs for statisticians Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4

  6. Statistical decision theory 1 Figure: Abraham Wald (1902–1950) 1 A. Wald. Statistical decision functions . Wiley, 1950. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 5

  7. Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

  8. Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

  9. Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

  10. Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

  11. Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

  12. Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

  13. Illustration with a simple estimation problem ◮ You have data x 1 , . . . , x n that you assume drawn from n � p ( x 1 , . . . , x n | θ ⋆ ) = N ( x i | θ ⋆ , σ 2 ) , i =1 and you know σ 2 . ◮ You choose a loss function, say L (ˆ θ, θ ) = � ˆ θ − θ � 2 . ◮ You restrict your decision space to unbiased estimators. θ := n − 1 � n ◮ The sample mean ˜ i =1 x i is unbiased, and has minimum variance among unbiased estimators. ◮ Since R (˜ θ, θ ) = Var˜ θ, ˜ θ is the best decision you can make in Wald’s framework. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 7

  14. Wald’s view of frequentist estimation Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ A Waldian answer ◮ Our decisions are estimates d ( x 1 , . . . , x n ) = ˆ θ ( x 1 , . . . , x n ). ◮ We pick a loss, say L ( d , θ ) = L (ˆ θ, θ ) = � ˆ θ − θ � 2 . ◮ If you have an unbiased estimator with minimum variance, then this is the best decision among unbiased estimators. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8

  15. Wald’s view of frequentist estimation Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ A Waldian answer ◮ Our decisions are estimates d ( x 1 , . . . , x n ) = ˆ θ ( x 1 , . . . , x n ). ◮ In general, the loss can be more complex and unbiased estimors unknown/irrelevant. ◮ In these cases, you may settle for a minimax estimator ˆ θ ( x 1 , . . . , x n ) = arg min sup R ( d , θ ) . d θ R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8

  16. Wald’s is only one view of frequentist statistics... ◮ On estimation, some would argue in favour of the maximum likelihood 2 . Figure: Ronald Fisher (1890–1962) 2 S. M. Stigler. “The epic story of maximum likelihood”. In: Statistical Science (2007), pp. 598–620. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 9

  17. ... but bear with me, since it is predominant in machine learning For instance, supervised learning is usually formalized as g ⋆ = arg min E L ( y , g ( x )) . (2) g which you approximate by n � g = arg min ˆ L ( y i , g ( x i )) + penalty( g ) , g i =1 while trying to control the excess risk g ( x )) − E L ( y , g ⋆ ( x )) . E L ( y , ˆ R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 10

  18. Wald’s view of frequentist confidence regions Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. A Waldian answer ◮ Our decisions are subsets of R d : d ( x 1: n ) = A ( x 1: n ). ◮ A common loss is L ( d , θ ) = L ( A , θ ) = 1 θ/ ∈ A + γ | A | . ◮ So you want to find A ( x 1: n ) that minimizes � ∈ A p ( x 1: n | θ ⋆ ) + γ | A | ] dx 1: n . L ( A , θ ) = [1 θ ⋆ / R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend