Bayesian machine learning: a tutorial R emi Bardenet CNRS & - PowerPoint PPT Presentation

Bayesian machine learning: a tutorial R´ emi Bardenet CNRS & CRIStAL, Univ. Lille, France R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 2

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 3

Typical jobs for statisticians Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4

Statistical decision theory 1 Figure: Abraham Wald (1902–1950) 1 A. Wald. Statistical decision functions . Wiley, 1950. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 5

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d ( x 1 , . . . , x n ) ∈ D . ◮ Let L ( d , θ ) denote the loss of making decision d when the state of the world is θ . ◮ Wald defines the risk of a decision as � R ( d , θ ) = L ( d , θ ) p ( x 1: n | θ ) dx 1: n . ◮ Wald says d 1 is a better decision than d 2 if ∀ θ ∈ Θ , L ( d 1 , θ ) � L ( d 2 , θ ) . (1) ◮ d is called admissible if there is no better decision than d . R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

Illustration with a simple estimation problem ◮ You have data x 1 , . . . , x n that you assume drawn from n � p ( x 1 , . . . , x n | θ ⋆ ) = N ( x i | θ ⋆ , σ 2 ) , i =1 and you know σ 2 . ◮ You choose a loss function, say L (ˆ θ, θ ) = � ˆ θ − θ � 2 . ◮ You restrict your decision space to unbiased estimators. θ := n − 1 � n ◮ The sample mean ˜ i =1 x i is unbiased, and has minimum variance among unbiased estimators. ◮ Since R (˜ θ, θ ) = Var˜ θ, ˜ θ is the best decision you can make in Wald’s framework. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 7

Wald’s view of frequentist estimation Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ A Waldian answer ◮ Our decisions are estimates d ( x 1 , . . . , x n ) = ˆ θ ( x 1 , . . . , x n ). ◮ We pick a loss, say L ( d , θ ) = L (ˆ θ, θ ) = � ˆ θ − θ � 2 . ◮ If you have an unbiased estimator with minimum variance, then this is the best decision among unbiased estimators. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8

Wald’s view of frequentist estimation Estimation ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n | θ ⋆ ), with θ ⋆ ∈ R d . θ ( x 1 , . . . , x n ) of θ ⋆ ∈ R d . ◮ You want an estimate ˆ A Waldian answer ◮ Our decisions are estimates d ( x 1 , . . . , x n ) = ˆ θ ( x 1 , . . . , x n ). ◮ In general, the loss can be more complex and unbiased estimors unknown/irrelevant. ◮ In these cases, you may settle for a minimax estimator ˆ θ ( x 1 , . . . , x n ) = arg min sup R ( d , θ ) . d θ R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8

Wald’s is only one view of frequentist statistics... ◮ On estimation, some would argue in favour of the maximum likelihood 2 . Figure: Ronald Fisher (1890–1962) 2 S. M. Stigler. “The epic story of maximum likelihood”. In: Statistical Science (2007), pp. 598–620. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 9

... but bear with me, since it is predominant in machine learning For instance, supervised learning is usually formalized as g ⋆ = arg min E L ( y , g ( x )) . (2) g which you approximate by n � g = arg min ˆ L ( y i , g ( x i )) + penalty( g ) , g i =1 while trying to control the excess risk g ( x )) − E L ( y , g ⋆ ( x )) . E L ( y , ˆ R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 10

Wald’s view of frequentist confidence regions Confidence regions ◮ You have data x 1 , . . . , x n that you assume drawn from p ( x 1 , . . . , x n · | θ ⋆ ), with θ ⋆ ∈ R d . ◮ You want a region A ( x 1 , . . . , x n ) ⊂ R d and make a statement that θ ∈ A ( x 1 , . . . , x n ) with some certainty. A Waldian answer ◮ Our decisions are subsets of R d : d ( x 1: n ) = A ( x 1: n ). ◮ A common loss is L ( d , θ ) = L ( A , θ ) = 1 θ/ ∈ A + γ | A | . ◮ So you want to find A ( x 1: n ) that minimizes � ∈ A p ( x 1: n | θ ⋆ ) + γ | A | ] dx 1: n . L ( A , θ ) = [1 θ ⋆ / R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 11

Bayesian machine learning: a tutorial R emi Bardenet CNRS & - PowerPoint PPT Presentation

Bayesian machine learning: a tutorial R emi Bardenet CNRS & CRIStAL, Univ. Lille, France R emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1 Outline The what Typical statistical problems Statistical decision theory Posterior

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto

Parameter Tuning. Automatic Algorithm Configuration Petr Po s k P. Po s k c

Decision-aid methodologies in transportation Lecture 5: Issues with performance validation Tim

3-3 Multiple Events 21 October 2010 While Im gone Groups of three Two players, one

Computer architecture for deep learning applications David Brooks School of Engineering and

Hyperparameter Optimization using Hyperopt Yassine Alouini - Paul Coursaux 03/11/2016 @qucit

SPEERMINT Working Group Administriva mailing list: speermint@ietf.org subscribe:

Alloy Analyzer 4 Tutorial Session 4: Dynamic Modeling Greg Dennis and Rob Seater Software

Dynamic Programming Talk 5 by Daniela and Christoph Content Reinforcement Learning Problem