Bayesian machine learning: a tutorial R emi Bardenet CNRS & - - PowerPoint PPT Presentation

bayesian machine learning a tutorial
SMART_READER_LITE
LIVE PREVIEW

Bayesian machine learning: a tutorial R emi Bardenet CNRS & - - PowerPoint PPT Presentation

Bayesian machine learning: a tutorial R emi Bardenet CNRS & CRIStAL, Univ. Lille, France R emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1 Outline The what Typical statistical problems Statistical decision theory Posterior


slide-1
SLIDE 1

Bayesian machine learning: a tutorial

R´ emi Bardenet

CNRS & CRIStAL, Univ. Lille, France

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 1

slide-2
SLIDE 2

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 2

slide-3
SLIDE 3

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 3

slide-4
SLIDE 4

Typical jobs for statisticians Estimation ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆), with θ⋆ ∈ Rd. ◮ You want an estimate ˆ θ(x1, . . . , xn) of θ⋆ ∈ Rd. Confidence regions ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn · |θ⋆), with θ⋆ ∈ Rd. ◮ You want a region A(x1, . . . , xn) ⊂ Rd and make a statement that θ ∈ A(x1, . . . , xn) with some certainty.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4

slide-5
SLIDE 5

Typical jobs for statisticians Estimation ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆), with θ⋆ ∈ Rd. ◮ You want an estimate ˆ θ(x1, . . . , xn) of θ⋆ ∈ Rd. Confidence regions ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn · |θ⋆), with θ⋆ ∈ Rd. ◮ You want a region A(x1, . . . , xn) ⊂ Rd and make a statement that θ ∈ A(x1, . . . , xn) with some certainty.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 4

slide-6
SLIDE 6

Statistical decision theory1

Figure: Abraham Wald (1902–1950)

  • 1A. Wald. Statistical decision functions. Wiley, 1950.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 5

slide-7
SLIDE 7

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d(x1, . . . , xn) ∈ D. ◮ Let L(d, θ) denote the loss of making decision d when the state of the world is θ. ◮ Wald defines the risk of a decision as R(d, θ) =

  • L(d, θ)p(x1:n|θ)dx1:n.

◮ Wald says d1 is a better decision than d2 if ∀θ ∈ Θ, L(d1, θ) L(d2, θ). (1) ◮ d is called admissible if there is no better decision than d.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

slide-8
SLIDE 8

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d(x1, . . . , xn) ∈ D. ◮ Let L(d, θ) denote the loss of making decision d when the state of the world is θ. ◮ Wald defines the risk of a decision as R(d, θ) =

  • L(d, θ)p(x1:n|θ)dx1:n.

◮ Wald says d1 is a better decision than d2 if ∀θ ∈ Θ, L(d1, θ) L(d2, θ). (1) ◮ d is called admissible if there is no better decision than d.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

slide-9
SLIDE 9

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d(x1, . . . , xn) ∈ D. ◮ Let L(d, θ) denote the loss of making decision d when the state of the world is θ. ◮ Wald defines the risk of a decision as R(d, θ) =

  • L(d, θ)p(x1:n|θ)dx1:n.

◮ Wald says d1 is a better decision than d2 if ∀θ ∈ Θ, L(d1, θ) L(d2, θ). (1) ◮ d is called admissible if there is no better decision than d.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

slide-10
SLIDE 10

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d(x1, . . . , xn) ∈ D. ◮ Let L(d, θ) denote the loss of making decision d when the state of the world is θ. ◮ Wald defines the risk of a decision as R(d, θ) =

  • L(d, θ)p(x1:n|θ)dx1:n.

◮ Wald says d1 is a better decision than d2 if ∀θ ∈ Θ, L(d1, θ) L(d2, θ). (1) ◮ d is called admissible if there is no better decision than d.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

slide-11
SLIDE 11

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d(x1, . . . , xn) ∈ D. ◮ Let L(d, θ) denote the loss of making decision d when the state of the world is θ. ◮ Wald defines the risk of a decision as R(d, θ) =

  • L(d, θ)p(x1:n|θ)dx1:n.

◮ Wald says d1 is a better decision than d2 if ∀θ ∈ Θ, L(d1, θ) L(d2, θ). (1) ◮ d is called admissible if there is no better decision than d.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

slide-12
SLIDE 12

Statistical decision theory ◮ Let Θ be the “states of the world”, typically the space of parameters of interest. ◮ Decisions are functions d(x1, . . . , xn) ∈ D. ◮ Let L(d, θ) denote the loss of making decision d when the state of the world is θ. ◮ Wald defines the risk of a decision as R(d, θ) =

  • L(d, θ)p(x1:n|θ)dx1:n.

◮ Wald says d1 is a better decision than d2 if ∀θ ∈ Θ, L(d1, θ) L(d2, θ). (1) ◮ d is called admissible if there is no better decision than d.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 6

slide-13
SLIDE 13

Illustration with a simple estimation problem ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆) =

n

  • i=1

N(xi|θ⋆, σ2), and you know σ2. ◮ You choose a loss function, say L(ˆ θ, θ) = ˆ θ − θ2. ◮ You restrict your decision space to unbiased estimators. ◮ The sample mean ˜ θ := n−1 n

i=1 xi is unbiased, and has

minimum variance among unbiased estimators. ◮ Since R(˜ θ, θ) = Var˜ θ, ˜ θ is the best decision you can make in Wald’s framework.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 7

slide-14
SLIDE 14

Wald’s view of frequentist estimation Estimation ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆), with θ⋆ ∈ Rd. ◮ You want an estimate ˆ θ(x1, . . . , xn) of θ⋆ ∈ Rd. A Waldian answer ◮ Our decisions are estimates d(x1, . . . , xn) = ˆ θ(x1, . . . , xn). ◮ We pick a loss, say L(d, θ) = L(ˆ θ, θ) = ˆ θ − θ2. ◮ If you have an unbiased estimator with minimum variance, then this is the best decision among unbiased estimators.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8

slide-15
SLIDE 15

Wald’s view of frequentist estimation Estimation ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆), with θ⋆ ∈ Rd. ◮ You want an estimate ˆ θ(x1, . . . , xn) of θ⋆ ∈ Rd. A Waldian answer ◮ Our decisions are estimates d(x1, . . . , xn) = ˆ θ(x1, . . . , xn). ◮ In general, the loss can be more complex and unbiased estimors unknown/irrelevant. ◮ In these cases, you may settle for a minimax estimator ˆ θ(x1, . . . , xn) = arg min

d

sup

θ

R(d, θ).

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 8

slide-16
SLIDE 16

Wald’s is only one view of frequentist statistics... ◮ On estimation, some would argue in favour of the maximum likelihood2.

Figure: Ronald Fisher (1890–1962)

  • 2S. M. Stigler. “The epic story of maximum likelihood”. In: Statistical

Science (2007), pp. 598–620.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 9

slide-17
SLIDE 17

... but bear with me, since it is predominant in machine learning For instance, supervised learning is usually formalized as g⋆ = arg min

g

EL(y, g(x)). (2) which you approximate by ˆ g = arg min

g n

  • i=1

L(yi, g(xi)) + penalty(g), while trying to control the excess risk EL(y, ˆ g(x)) − EL(y, g⋆(x)).

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 10

slide-18
SLIDE 18

Wald’s view of frequentist confidence regions Confidence regions ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn · |θ⋆), with θ⋆ ∈ Rd. ◮ You want a region A(x1, . . . , xn) ⊂ Rd and make a statement that θ ∈ A(x1, . . . , xn) with some certainty. A Waldian answer ◮ Our decisions are subsets of Rd: d(x1:n) = A(x1:n). ◮ A common loss is L(d, θ) = L(A, θ) = 1θ/

∈A + γ|A|.

◮ So you want to find A(x1:n) that minimizes L(A, θ) =

  • [1θ⋆ /

∈Ap(x1:n|θ⋆) + γ|A|] dx1:n.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 11

slide-19
SLIDE 19

Illustration with a simple confidence interval problem ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆) =

n

  • i=1

N(xi|θ⋆, σ2). ◮ You choose a loss function, say L(A, θ) = 1θ/

∈A + γ|A|.

◮ You restrict your decisions to intervals centered around the sample mean ˜ θ. ◮ Since

θ−˜ θ σ/√n ∼ N(0, 1), we know (exercise) that for

˜ A := [˜ θ − kσ/√n, ˜ θ + kσ/√n], it comes R(A, θ) = P(|N(0, 1)| k) + 2γkσ √n . ◮ All is left to do is choose k. ◮ Textbook examples bypass the need for γ: they fix α > 0 and find the smallest k such that P(|N(0, 1)| k) α.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 12

slide-20
SLIDE 20

Summary so far ◮ Waldian frequentists measure risks as expectations w.r.t. the data-generating process. R(d, θ) =

  • L(d(x1:n), θ)p(x1:n|θ)dx1:n

◮ One major difficulty is that the risk remains a function of θ. ◮ Without additional structure (unbiasedness, Gaussianity, etc.), it is difficult to go beyond minimax rules. Idea What if we introduced a distribution on Θ, and tried to minimize r(d) =

  • R(d, θ)p(θ)dθ

= L(d(x1:n), θ)p(x1:n|θ)dx1:n

  • p(θ)dθ

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 13

slide-21
SLIDE 21

From expected frequentist loss to posterior expected loss r(d) =

  • R(d, θ)p(θ)dθ

= L(d(x1:n), θ)p(x1:n|θ)dx1:n

  • p(θ)dθ

= L(d(x1:n), θ)p(x1:n|θ)p(θ)dθ

  • dx1:n

= L(d(x1:n), θ)p(x1:n|θ)p(θ) p(x1:n) dθ

  • p(x1:n)dx1:n

= L(d(x1:n), θ)p(θ|x1:n)dθ

  • p(x1:n)dx1:n

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 14

slide-22
SLIDE 22

Bayesians minimize posterior expected utility The posterior expected utility paradigm: Bayes rules Pick d to solve arg min

d

  • L(d(x1:n), θ)p(θ|x1:n)dθ.

Bayes rules have good frequentist properties3 ◮ Under general conditions, Bayes decision rules are admissible, all admissible rules are limits of Bayes rules. ◮ Bayes rules with “least favourable priors” are minimax.

  • 3G. Parmigiani and L. Inoue. Decision theory: principles and approaches.
  • Vol. 812. John Wiley & Sons, 2009.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 15

slide-23
SLIDE 23

Illustration with a simple estimation problem ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆) =

n

  • i=1

N(xi|θ⋆, σ2), and you know σ2. ◮ You choose a loss function, say L(ˆ θ, θ) = ˆ θ − θ2. ◮ You choose a prior p over θ. ◮ Your Bayes decision minimizes

  • ˆ

θ − θ2p(θ|x1:n)dθ, so you pick ˆ θ =

  • θp(θ|x1:n)dθ.

◮ Conceptually, it is simpler. In practice, you need to compute an integral.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 16

slide-24
SLIDE 24

Illustration with a simple confidence interval problem ◮ You have data x1, . . . , xn that you assume drawn from p(x1, . . . , xn|θ⋆) =

n

  • i=1

N(xi|θ⋆, σ2). ◮ You choose a loss function, say L(A, θ) = 1θ/

∈A + γ|A|.

◮ You choose a prior p over θ. ◮ Your Bayes decision minimizes

  • 1θ/

∈Ap(θ|x1:n)dθ + γ|A|,

◮ Conceptually, it is simpler. In practice, you need to carefully pick your prior and/or restrict the decision space and/or compute many integrals.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 17

slide-25
SLIDE 25

Summary so far ◮ Bayes rules fit into Wald’s framework. ◮ For a fixed prior, the Bayesian risk completely orders decision rules. ◮ The key idea is posterior expected utility. ◮ You can answer most basic statistical questions using this principle: [more examples].

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 18

slide-26
SLIDE 26

A recent motivating success

GW170814: A Three-Detector Observation of Gravitational Waves from a Binary Black Hole Coalescence

  • B. P. Abbott et al.*

(LIGO Scientific Collaboration and Virgo Collaboration)

(Received 23 September 2017; published 6 October 2017) On August 14, 2017 at 10∶30:43 UTC, the Advanced Virgo detector and the two Advanced LIGO detectors coherently observed a transient gravitational-wave signal produced by the coalescence of two stellar mass black holes, with a false-alarm rate of ≲1 in 27 000 years. The signal was observed with a three-detector network matched-filter signal-to-noise ratio of 18. The inferred masses of the initial black holes are 30.5þ5.7

−3.0M⊙ and 25:3þ2.8 −4.2M⊙ (at the 90% credible level). The luminosity distance of the source is

540þ130

−210 Mpc, corresponding to a redshift of z ¼ 0.11þ0.03 −0.04. A network of three detectors improves the sky

localization of the source, reducing the area of the 90% credible region from 1160 deg2 using only the two LIGO detectors to 60 deg2 using all three detectors. For the first time, we can test the nature of gravitational-wave polarizations from the antenna response of the LIGO-Virgo network, thus enabling a new class of phenomenological tests of gravity.

DOI: 10.1103/PhysRevLett.119.141101

  • I. INTRODUCTION

The era of gravitational-wave (GW) astronomy began with the detection of binary black hole (BBH) mergers, by the Advanced Laser Interferometer Gravitational-Wave Observatory (LIGO) detectors [1], during the first of the waveform obtained from analysis of the LIGO detectors’ data alone, we find that the probability, in 5000 s of data around the event, of a peak in SNR from Virgo data due to noise and as large as the one observed, within a time window determined by the maximum possible time of flight, is 0.3%. (b) A search for unmodeled GW transients PRL 119, 141101 (2017) P H Y S I C A L R E V I E W L E T T E R S

week ending 6 OCTOBER 2017

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 19

slide-27
SLIDE 27

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 20

slide-28
SLIDE 28

The subjectivistic viewpoint ◮ Top requirement is internal coherence of decisions. ◮ Favourizes interpreting probability distributions as personal beliefs.

Figure: Bruno de Finetti (1906–1985) and L. Jimmie Savage (1917–1971)

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 21

slide-29
SLIDE 29

The logical justification ◮ Top requirement is to find a version of propositional logic that allows taking into account uncertainty. ◮ Also favourizes interpreting probability distributions as beliefs, but aims for objective priors.

Figure: Richart T. Cox (1898–1991), Edwin T. Jaynes (1917–1971), and Harold Jeffreys (1891–1989)

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 22

slide-30
SLIDE 30

The hybrid view4 ◮ The starting point is posterior expected utility, loosely justified by Wald’s theory. ◮ It is simple, widely applicable, has good frequentist properties. ◮ It satisfies the likelihood principle. ◮ It is easy to interpret: beliefs are

◮ represented by probabilities, ◮ updated using Bayes’ rule, ◮ integrated when making decisions.

◮ It is easy to communicate your uncertainty

◮ Simply give your posterior. ◮ When making a decision, make sure that the priors of everyone involved would yield the same decision.

  • 4C. P. Robert. The Bayesian choice: from decision-theoretic foundations to

computational implementation. Springer Science & Business Media, 2007.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 23

slide-31
SLIDE 31

Practical advantages of posterior expected utility ◮ Conceptually answers all ML problems. ◮ Suits all applications where quantifying uncertainty is vital vs computational complexity: all basic sciences, health, even

  • ne-shot commercial decisions.

◮ We never invoked any large-sample argument, so suits all sizes

  • f datasets.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 24

slide-32
SLIDE 32

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 25

slide-33
SLIDE 33

Conjugacy ◮ Say we have a linear regression problem yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ), then (exercise) p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) = N(σ−2A−1Xy, A−1), where A = σ−2X TX + Σ−1. ◮ If the loss is not too complicated, then integrals are easy. For instance, prediction with L2 loss is simple:

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 26

slide-34
SLIDE 34

Conjugacy ◮ Say we have a linear regression problem yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ), then (exercise) p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) = N(σ−2A−1Xy, A−1), where A = σ−2X TX + Σ−1. ◮ If the loss is not too complicated, then integrals are easy. For instance, prediction with L2 loss is simple:

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 26

slide-35
SLIDE 35

Conjugacy ◮ Say we have a linear regression problem yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ), then (exercise) p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) = N(σ−2A−1Xy, A−1), where A = σ−2X TX + Σ−1. ◮ If the loss is not too complicated, then integrals are easy. For instance, prediction with L2 loss is simple: arg min

y⋆

  • y⋆ − f (x⋆)2p(θ|(x, y)1:n)dθ

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 26

slide-36
SLIDE 36

Conjugacy ◮ Say we have a linear regression problem yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ), then (exercise) p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) = N(σ−2A−1Xy, A−1), where A = σ−2X TX + Σ−1. ◮ If the loss is not too complicated, then integrals are easy. For instance, prediction with L2 loss is simple: arg min

y⋆

  • y⋆ − θTx⋆2p(θ|(x, y)1:n)dθ

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 26

slide-37
SLIDE 37

Conjugacy ◮ Say we have a linear regression problem yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ), then (exercise) p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) = N(σ−2A−1Xy, A−1), where A = σ−2X TX + Σ−1. ◮ If the loss is not too complicated, then integrals are easy. For instance, prediction with L2 loss is simple: ˆ y⋆ := σ−2xT

⋆ A−1Xy.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 26

slide-38
SLIDE 38

Monte Carlo methods ◮ Sometimes, you’re less lucky. Say we’re doing logistic regression. yi = Bernoulli [σ(f (xi))] , with f (x) = θTx, σ(x) = 1/(1 + e−x). ◮ Even if we choose p(θ) ∼ N(0, Σ), p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) does not have a simple closed form. ◮ We need powerful numerical integration methods, that is, constructions of nodes (θi) and weights wi such that

  • hdπ ≈

N

  • i=1

wih(θi).

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 27

slide-39
SLIDE 39

Monte Carlo methods ◮ Sometimes, you’re less lucky. Say we’re doing logistic regression. yi = Bernoulli [σ(f (xi))] , with f (x) = θTx, σ(x) = 1/(1 + e−x). ◮ Even if we choose p(θ) ∼ N(0, Σ), p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) does not have a simple closed form. ◮ We need powerful numerical integration methods, that is, constructions of nodes (θi) and weights wi such that

  • hdπ ≈

N

  • i=1

wih(θi).

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 27

slide-40
SLIDE 40

Monte Carlo methods ◮ Sometimes, you’re less lucky. Say we’re doing logistic regression. yi = Bernoulli [σ(f (xi))] , with f (x) = θTx, σ(x) = 1/(1 + e−x). ◮ Even if we choose p(θ) ∼ N(0, Σ), p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) does not have a simple closed form. ◮ We need powerful numerical integration methods, that is, constructions of nodes (θi) and weights wi such that

  • hdπ ≈

N

  • i=1

wih(θi).

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 27

slide-41
SLIDE 41

Metropolis-Hastings MH

  • π(θ), q(θ′|θ), θ0, Niter
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α = π(θ′)

π(θ) q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 28

slide-42
SLIDE 42

Metropolis-Hastings MH

  • π(θ), q(θ′|θ), θ0, Niter
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α = π(θ′)

π(θ) q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 28

slide-43
SLIDE 43

Metropolis-Hastings MH

  • π(θ), q(θ′|θ), θ0, Niter
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α = π(θ′)

π(θ) q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 28

slide-44
SLIDE 44

Metropolis-Hastings MH

  • π(θ), q(θ′|θ), θ0, Niter
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α = π(θ′)

π(θ) q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 28

slide-45
SLIDE 45

Metropolis-Hastings MH

  • π(θ), q(θ′|θ), θ0, Niter
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α = π(θ′)

π(θ) q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 28

slide-46
SLIDE 46

The MCMC magic ◮ Under assumptions5,

  • Niter
  • 1

Niter

Niter

  • k=0

h(θk) −

  • h (θ) π(θ)dθ
  • → N(0, σ2

lim(h)),

◮ If you choose q carefully, you can hope for a polynomial increase of the mixing time and σ2

lim(h) with d.

◮ Most MCMC algorithms are instances of Metropolis-Hastings with clever choices of proposal6, even the NUTS HMC of Stan and PyMC3. ◮ For nice illustrations, check out https://chi-feng.github.io/mcmc-demo/

  • 5R. Douc et al. Nonlinear time series. Chapman-Hall, 2014.
  • 6C. P. Robert and G. Casella. Monte Carlo Statistical Methods. New York:

Springer-Verlag, 2004.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 29

slide-47
SLIDE 47

Variational approximations ◮ When in a hurry, you can settle for a good approximation to your posterior π(θ) = p(θ|x) ∝ p(x|θ)p(θ), say minimize in q KL(q, π) = Eq log q − Eq log p(θ|x) = −[−Eq log q + Eq log p(x, θ)] + log p(x). ◮ Equivalently, we can maximize the evidence lower bound (ELBO)7. ◮ Ideally, I would rather cast the choice of q into a Wald-like problem.

  • 7D. M. Blei et al. “Variational inference: A review for statisticians”. In:

Journal of the American Statistical Association 112.518 (2017), pp. 859–877.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 30

slide-48
SLIDE 48

Outline The what Typical statistical problems Statistical decision theory Posterior expected utility and Bayes rules The why The philosophical why The practical why The how Conjugacy Monte Carlo methods Metropolis-Hastings Variational approximations In depth with Gaussian processes in ML From linear regression to GPs Modeling and learning More applications References and open issues

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 31

slide-49
SLIDE 49

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) = [Exercise] where A = σ−2X TX + Σ−1.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-50
SLIDE 50

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) ∝ exp

  • −y − Xθ2

2σ2 − θTΣ−1θ 2

  • where A = σ−2X TX + Σ−1.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-51
SLIDE 51

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) ∝ exp

  • −y − Xθ2

2σ2 − θTΣ−1θ 2

  • ∝ exp

yTXθ σ2 − θTX TXθ 2σ2 − θTΣ−1θ 2

  • where A = σ−2X TX + Σ−1.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-52
SLIDE 52

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) ∝ exp

  • −y − Xθ2

2σ2 − θTΣ−1θ 2

  • ∝ exp

yTXθ σ2 − θTX TXθ 2σ2 − θTΣ−1θ 2

  • ∝ exp
  • −1

2

  • θ − σ−2A−1Xy

T A

  • θ − σ−2A−1Xy
  • where A = σ−2X TX + Σ−1.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-53
SLIDE 53

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) ∝ exp

  • −y − Xθ2

2σ2 − θTΣ−1θ 2

  • ∝ exp

yTXθ σ2 − θTX TXθ 2σ2 − θTΣ−1θ 2

  • ∝ exp
  • −1

2

  • θ − σ−2A−1Xy

T A

  • θ − σ−2A−1Xy
  • = N(σ−2A−1Xy, A−1),

where A = σ−2X TX + Σ−1.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-54
SLIDE 54

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) ∝ exp

  • −y − Xθ2

2σ2 − θTΣ−1θ 2

  • ∝ exp

yTXθ σ2 − θTX TXθ 2σ2 − θTΣ−1θ 2

  • ∝ exp
  • −1

2

  • θ − σ−2A−1Xy

T A

  • θ − σ−2A−1Xy
  • = N(σ−2A−1Xy, A−1),

where A = σ−2X TX + Σ−1. ◮ Remember prediction with L2 loss is simple: arg min

y⋆

  • y⋆ − f (x⋆)2p(θ|(x, y)1:n)dθ = σ−2xT

⋆ A−1Xy.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-55
SLIDE 55

Linear regression ◮ yi = f (xi) + εi, f (x) = θTx, εi i.i.d. Gaussians N(0, σ2). ◮ If we choose p(θ) ∼ N(0, Σ2), then p(θ|(x, y)1:n) ∝ p((x, y)1:n|θ)p(θ) ∝ exp

  • −y − Xθ2

2σ2 − θTΣ−1θ 2

  • ∝ exp

yTXθ σ2 − θTX TXθ 2σ2 − θTΣ−1θ 2

  • ∝ exp
  • −1

2

  • θ − σ−2A−1Xy

T A

  • θ − σ−2A−1Xy
  • = N(σ−2A−1Xy, A−1),

where A = σ−2X TX + Σ−1. ◮ Actually, we can even check that p(f (x⋆)|x⋆, (x, y)1:n) = N(σ−2xT

⋆ A−1Xy, x⋆TA−1x⋆).

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 32

slide-56
SLIDE 56

Linear regression with nonlinear features ◮ Replace each x by a vector of features ϕ(x) ∈ Rp: yi = f (xi) + εi, i = 1, . . . , n, f (x) = θTϕ(x), εi i.i.d. Gaussians N(0, σ2), θ ∼ N(0, Σ). ◮ Think ϕ(x) = (1, x1, x2, x1x2, . . . ) ◮ Recall p(f (x⋆)|x⋆, (x, y)1:n) = N(σ−2ΦT

⋆ A−1Φy, Φ⋆A−1Φ⋆)

where A = σ−2ΦTΦ + Σ−1. ◮ Requires p × p inversion. ◮ But let K = ΦΣΦT, then can rewrite (Exercise) p(f (x⋆)|(x, y)1:n) = N(µ⋆, σ2

⋆),

where µ⋆ = ΦT

⋆ ΣΦT(K + σ2I)−1y,

σ2

⋆ = ϕ⋆Σϕ⋆ − ϕT ⋆ ΣΦT(K + σ2I)−1ΦΣϕ⋆.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 33

slide-57
SLIDE 57

Gaussian processes ◮ A distribution over a space of functions f : Rd → R. Gaussian processes If ∀p ∈ N, ∀x1, . . . , xp ∈ Rd [f (x1), . . . , f (xp)]T ∼ N(m, K), where m = [µ(x1), . . . , µ(xp)] and K = ((K(xi, xj))), then we say f ∼ GP(µ, K). ◮ Unicity is usually easy, existence is tricky. ◮ Necessary condition is that all matrices K are positive definite.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 34

slide-58
SLIDE 58

Sampling, conditioning and predicting See notebook 01 on https://github.com/rbardenet/bnp-course

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 35

slide-59
SLIDE 59

Commonly-used kernels

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 36

slide-60
SLIDE 60

Learning ◮ In regression, p(y|x) =

  • p(y|f)p(f|x)df

= N(y|0, K + σ2In). ◮ So simply put a prior over η = (σ, θ) and integrate. ◮ Prediction becomes f⋆ ∼

  • p(f⋆|x, η)p(η)dη.

◮ Alternately, lots of people maximize the marginal likelihood.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 37

slide-61
SLIDE 61

Learning ◮ In regression, p(y|x, θ) =

  • p(y|f)p(f|x)df

= N(y|0, Kθ + σ2In). ◮ So simply put a prior over η = (σ, θ) and integrate. ◮ Prediction becomes f⋆ ∼

  • p(f⋆|x, η)p(η)dη.

◮ Alternately, lots of people maximize the marginal likelihood.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 37

slide-62
SLIDE 62

Beyond regression: classification8 ◮ (Exercise) Find a simple classification model with GPs. ◮ Take for instance p(y = +1|x, f ) = Bernoulli(σ(f (x))), f ∼ GP(0, K). ◮ Problem: prediction is not easy anymore p(f⋆|X, y, x⋆) =

  • p(f⋆|X, f, x⋆)p(f|X, y)df
  • 8C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine
  • Learning. MIT Press, 2006.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 38

slide-63
SLIDE 63

Beyond regression: classification8 ◮ (Exercise) Find a simple classification model with GPs. ◮ Take for instance p(y = +1|x, f ) = Bernoulli(σ(f (x))), f ∼ GP(0, K). ◮ Problem: prediction is not easy anymore p(f⋆|X, y, x⋆) =

  • p(f⋆|X, f, x⋆)p(f|X, y)df

8Rasmussen and Williams, Gaussian Processes for Machine Learning. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 38

slide-64
SLIDE 64

Beyond regression: classification8 ◮ (Exercise) Find a simple classification model with GPs. ◮ Take for instance p(y = +1|x, f ) = Bernoulli(σ(f (x))), f ∼ GP(0, K). ◮ Problem: prediction is not easy anymore p(f⋆|X, y, x⋆) =

  • p(f⋆|X, f, x⋆)p(f|X, y)df

8Rasmussen and Williams, Gaussian Processes for Machine Learning. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 38

slide-65
SLIDE 65

Beyond regression: ranking9 ◮ (Exercise) Find a simple ranking model with GPs: your data is (u, v)1:n where ∀i, ui ≺ vi. Your user wants to know whether a new u⋆ ≺ v⋆. ◮ Take for instance p(u ≺ v|u, v, f ) = ϕ(f (v)−f (u)), f ∼ GP(0, K), ϕ increasing. ◮ Same difficulties with learning.

  • 9W. Chu and Z. Ghahramani. “Preference learning with Gaussian

processes”. In: Proceedings of the 22nd International Conference on Machine

  • Learning. 2005, pp. 137–144.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 39

slide-66
SLIDE 66

Beyond regression: ranking9 ◮ (Exercise) Find a simple ranking model with GPs: your data is (u, v)1:n where ∀i, ui ≺ vi. Your user wants to know whether a new u⋆ ≺ v⋆. ◮ Take for instance p(u ≺ v|u, v, f ) = ϕ(f (v)−f (u)), f ∼ GP(0, K), ϕ increasing. ◮ Same difficulties with learning.

9Chu and Ghahramani, “Preference learning with Gaussian processes”. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 39

slide-67
SLIDE 67

Beyond regression: ranking9 ◮ (Exercise) Find a simple ranking model with GPs: your data is (u, v)1:n where ∀i, ui ≺ vi. Your user wants to know whether a new u⋆ ≺ v⋆. ◮ Take for instance p(u ≺ v|u, v, f ) = ϕ(f (v)−f (u)), f ∼ GP(0, K), ϕ increasing. ◮ Same difficulties with learning.

9Chu and Ghahramani, “Preference learning with Gaussian processes”. R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 39

slide-68
SLIDE 68

Emulators of expensive models

RESEARCH ARTICLE

Bayesian Sensitivity Analysis of a Cardiac Cell Model Using a Gaussian Process Emulator

Eugene TY Chang1,2, Mark Strong3, Richard H Clayton1,2*

1 Insigneo Institute for in-silico Medicine, University of Sheffield, Sheffield, United Kingdom, 2 Department of Computer Science University of Sheffield, Sheffield, United Kingdom, 3 School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom * r.h.clayton@sheffield.ac.uk

Abstract

Models of electrical activity in cardiac cells have become important research tools as they can provide a quantitative description of detailed and integrative physiology. However, car- diac cell models have many parameters, and how uncertainties in these parameters affect the model output is difficult to assess without undertaking large numbers of model runs. In this study we show that a surrogate statistical model of a cardiac cell model (the Luo-Rudy 1991 model) can be built using Gaussian process (GP) emulators. Using this approach we a11111

OPEN ACCESS Citation: Chang ETY, Strong M, Clayton RH (2015) Bayesian Sensitivity Analysis of a Cardiac Cell Model

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 40

slide-69
SLIDE 69

Nonparametric fits

arXiv:1204.2272v2 [astro-ph.CO] 10 Jul 2012

Gaussian Process Cosmography

Arman Shafieloo1, Alex G. Kim2, Eric V. Linder1,2,3

1 Institute for the Early Universe WCU, Ewha Womans University, Seoul, Korea 2 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA and 3 University of California, Berkeley, CA 94720, USA

(Dated: July 11, 2012) Gaussian processes provide a method for extracting cosmological information from observations without assuming a cosmological model. We carry out cosmography – mapping the time evolution

  • f the cosmic expansion – in a model-independent manner using kinematic variables and a geometric

probe of cosmology. Using the state of the art supernova distance data from the Union2.1 compila- tion, we constrain, without any assumptions about dark energy parametrization or matter density, the Hubble parameter and deceleration parameter as a function of redshift. Extraction of these relations is tested successfully against models with features on various coherence scales, subject to certain statistical cautions. I. INTRODUCTION Cosmic acceleration is a fundamental mystery of great interest and importance to understanding cosmology, gravitation, and high energy physics. The cosmic ex- pansion rate is slowed down by gravitationally attractive matter and sped up by some other, unknown contribu- tion to the dynamical equations. While great effort is being put into identifying the source of this extra dark energy contribution, the overall expansion behavior also holds important clues to origin, evolution, and present ing procedures have been suggested, e.g. [6], but tend to induce bias in the function reconstruction due to para- metric restriction of the behavior or to have poor error

  • control. Using a general orthonormal basis or principal

component analysis is another approach, to describe the distance-redshift relation (e.g. [7]) or the deceleration pa- rameter [8], or using a correlated prior for smoothness on the dark energy equation of state [9], but in practice a finite (and small) number of modes is significant beyond the prior, essentially reducing to a parametric approach. Gaussian processes [10] offer an interesting possibility for improving this situation.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 41

slide-70
SLIDE 70

Natural language processing

Using Gaussian Processes for Rumour Stance Classification in Social Media

MICHAL LUKASIK, University of Sheffield KALINA BONTCHEVA, University of Sheffield TREVOR COHN, University of Melbourne ARKAITZ ZUBIAGA, University of Warwick MARIA LIAKATA, University of Warwick ROB PROCTER, University of Warwick

Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly,

  • ne can mine multiple reactions expressed by social media users in those situations, exploring their stance towards rumours,

ultimately enabling the flagging of highly disputed rumours as being potentially false. In this work, we set out to develop an automated, supervised classifier that uses multi-task learning to classify the stance expressed in each individual tweet in a rumourous conversation as either supporting, denying or questioning the rumour. Using a classifier based on Gaussian Processes, and exploring its effectiveness on two datasets with very different characteristics and varying distributions of stances, we show that our approach consistently outperforms competitive baseline classifiers. Our classifier is especially effective in estimating the distribution of different types of stance associated with a given rumour, which we set forth as a desired characteristic for a rumour-tracking system that will warn both ordinary users of Twitter and professional news practitioners when a rumour is being rebutted.

  • 1. INTRODUCTION

There is an increasing need to interpret and act upon rumours spreading quickly through social me- dia during breaking news, where new reports are released piecemeal and often have an unverified status at the time of posting. Previous research has posited the damage that the diffusion of false

arXiv:1609.01962v1 [cs.CL] 7 Sep 2016

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 42

slide-71
SLIDE 71

Bayesian optimization for hyperparameter tuning

Algorithms for Hyper-Parameter Optimization

James Bergstra The Rowland Institute Harvard University bergstra@rowland.harvard.edu R´ emi Bardenet Laboratoire de Recherche en Informatique Universit´ e Paris-Sud bardenet@lri.fr Yoshua Bengio D´

  • ept. d’Informatique et Recherche Op´

erationelle Universit´ e de Montr´ eal yoshua.bengio@umontreal.ca Bal´ azs K´ egl Linear Accelerator Laboratory Universit´ e Paris-Sud, CNRS balazs.kegl@gmail.com

Abstract

Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel ap- proaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 43

slide-72
SLIDE 72

Bayesian optimization ◮ Goal is to minimize a noisy f with N iterations, N small. ◮ Key application: find the hyperparameters of your ML algorithm that minimize the validation error. ◮ Idea is to sequentially

◮ update your model on f , ◮ optimize an aquisition criterion.

◮ Check out notebook 03 on https://github.com/rbardenet/bnp-course.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 44

slide-73
SLIDE 73

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-74
SLIDE 74

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.00 0.02 0.04 0.06 0.08 0.10 0.12

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-75
SLIDE 75

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-76
SLIDE 76

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-77
SLIDE 77

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-78
SLIDE 78

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-79
SLIDE 79

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.000 0.005 0.010 0.015 0.020 0.025 0.030

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-80
SLIDE 80

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-81
SLIDE 81

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.000 0.002 0.004 0.006 0.008 0.010

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-82
SLIDE 82

An example in 1D

2 4 6 8 10 5 5 10

target

2 4 6 8 10 0.000 0.001 0.002 0.003 0.004 0.005 0.006

EI R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 45

slide-83
SLIDE 83

Popular aquisition criteria Expected improvement10 EI(z) = E

  • max(mN − f (z), 0)|(x, y)1:n
  • ,

where mN = min1iN f (xi). An easy computation yields EI(x) = σ(x)

  • uΦ(u) + ϕ(u)
  • ,

(3) where u = (mn − m(x))/ σ(x), and Φ and ϕ denote the cdf and pdf of the N(0, 1) distribution.

  • 10D. R. Jones. “A Taxonomy of Global Optimization Methods Based on

Response Surfaces”. In: Journal of Global Optimization 21 (2001),

  • pp. 345–383.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 46

slide-84
SLIDE 84

GP-UCB (Srinivas et al., 2010) GP-UCB GP-UCB(z) = m(z) + β σ(z). ◮ If β properly tuned, can use bandit results. ◮ First criterion to give application theoretical results.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 46

slide-85
SLIDE 85

Bayesian optimization for hyperparameter tuning

Algorithms for Hyper-Parameter Optimization

James Bergstra The Rowland Institute Harvard University bergstra@rowland.harvard.edu R´ emi Bardenet Laboratoire de Recherche en Informatique Universit´ e Paris-Sud bardenet@lri.fr Yoshua Bengio D´

  • ept. d’Informatique et Recherche Op´

erationelle Universit´ e de Montr´ eal yoshua.bengio@umontreal.ca Bal´ azs K´ egl Linear Accelerator Laboratory Universit´ e Paris-Sud, CNRS balazs.kegl@gmail.com

Abstract

Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel ap- proaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few

◮ Checkout hyperopt and spearmint.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 47

slide-86
SLIDE 86

Going further: Hyperopt across datasets10

  • 10R. Bardenet et al. “Collaborative hyperparameter tuning”. In:

International Conference on Machine Learning (ICML). Atlanta, Georgia, 2013.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 48

slide-87
SLIDE 87

Some useful hyperlinks ◮ Textbook by C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006,

◮ great for understanding, methods, pointers to ML and stats.

◮ Videolecture by C. Rasmussen. ◮ lecture notes by P. Orbanz. Lecture notes on Bayesian

  • nonparametrics. 2014.

◮ mathematically clean, without losing the focus on ML.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 49

slide-88
SLIDE 88

Some open issues ◮ Fully Bayesian scalable approaches! ◮ Natural approaches to constrained GPs. ◮ Links with other models based on Gaussians and geometry. Back to the roots

◮ Formulate HT across datasets and algorithms as a posterior expected loss problem, including computational constraints. ◮ Solve resulting dynamic programming problem.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 50

slide-89
SLIDE 89

References I Bardenet, R. et al. “Collaborative hyperparameter tuning”. In: International Conference on Machine Learning (ICML). Atlanta, Georgia, 2013. Blei, D. M., A. Kucukelbir, and J. D. McAuliffe. “Variational inference: A review for statisticians”. In: Journal of the American Statistical Association 112.518 (2017), pp. 859–877. Chu, W. and Z. Ghahramani. “Preference learning with Gaussian processes”. In: Proceedings of the 22nd International Conference

  • n Machine Learning. 2005, pp. 137–144.

Douc, R., ´

  • E. Moulines, and D. Stoffer. Nonlinear time series.

Chapman-Hall, 2014. Jones, D. R. “A Taxonomy of Global Optimization Methods Based

  • n Response Surfaces”. In: Journal of Global Optimization 21

(2001), pp. 345–383. Orbanz, P. Lecture notes on Bayesian nonparametrics. 2014.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 51

slide-90
SLIDE 90

References II Parmigiani, G. and L. Inoue. Decision theory: principles and

  • approaches. Vol. 812. John Wiley & Sons, 2009.

Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Robert, C. P. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007. Robert, C. P. and G. Casella. Monte Carlo Statistical Methods. New York: Springer-Verlag, 2004. Stigler, S. M. “The epic story of maximum likelihood”. In: Statistical Science (2007), pp. 598–620. Wald, A. Statistical decision functions. Wiley, 1950.

R´ emi Bardenet (CNRS & Univ. Lille) Bayesian ML 52