Bayesian Optimization for Likelihood-Free Inference Michael Gutmann - - PowerPoint PPT Presentation

bayesian optimization for likelihood free inference
SMART_READER_LITE
LIVE PREVIEW

Bayesian Optimization for Likelihood-Free Inference Michael Gutmann - - PowerPoint PPT Presentation

Bayesian Optimization for Likelihood-Free Inference Michael Gutmann https://sites.google.com/site/michaelgutmann University of Edinburgh 14th September 2016 Reference For further information: M.U. Gutmann and J. Corander Bayesian


slide-1
SLIDE 1

Bayesian Optimization for Likelihood-Free Inference

Michael Gutmann

https://sites.google.com/site/michaelgutmann University of Edinburgh

14th September 2016

slide-2
SLIDE 2

Reference

For further information: M.U. Gutmann and J. Corander Bayesian optimization for likelihood-free inference of simulator-based statistical models Journal of Machine Learning Research, 17(125): 1–47, 2016

  • J. Lintusaari, M.U. Gutmann, R. Dutta, S. Kaski, and J. Corander

Fundamentals and Recent Developments in Approximate Bayesian Computation Systematic Biology, in press, 2016

Michael Gutmann BOLFI 2 / 23

slide-3
SLIDE 3

Overall goal

◮ Inference: Given data yo, learn about properties of its source ◮ Enables decision making, predictions, . . .

yo

Data space

Observation Inference

Data source

Unknown properties

Michael Gutmann BOLFI 3 / 23

slide-4
SLIDE 4

Approach

◮ Set up a model with potential properties θ (hypotheses) ◮ See which θ are in line with the observed data yo

yo

Data space

Observation Inference

Data source

Unknown properties

Model

M(θ)

Michael Gutmann BOLFI 4 / 23

slide-5
SLIDE 5

The likelihood function L(θ)

◮ Measures agreement between θ and the observed data yo ◮ Probability to generate data like yo if hypothesis θ holds

yo

Data space

Observation

Data source

Unknown properties

Model

M(θ)

Data generation

ε

y|θ

Michael Gutmann BOLFI 5 / 23

slide-6
SLIDE 6

Performing statistical inference

◮ If L(θ) is known, inference is straightforward ◮ Maximum likelihood estimation

ˆ θ = argmaxθ L(θ)

◮ Bayesian inference

p(θ|y) ∝ p(θ) × L(θ) posterior ∝ prior × likelihood Allows us to learn from data by updating probabilities

Michael Gutmann BOLFI 6 / 23

slide-7
SLIDE 7

Likelihood-free inference

Statistical inference for models where

  • 1. the likelihood function is too costly to compute
  • 2. sampling – simulating data – from the model is possible

Michael Gutmann BOLFI 7 / 23

slide-8
SLIDE 8

Importance of likelihood-free inference

One reason: Such generative / simulator-based models occur widely

◮ Astrophysics:

Simulating the formation of galaxies, stars, or planets

◮ Evolutionary biology:

Simulating the evolution of life

◮ Neuroscience:

Simulating neural circuits

◮ Computer vision:

Simulating natural scenes

◮ Health science:

Simulating the spread of an infectious disease

◮ . . .

Simulated neural activity in rat somatosensory cortex (Figure from https://bbp.epfl.ch/nmc-portal)

Michael Gutmann BOLFI 8 / 23

slide-9
SLIDE 9

Flavors of likelihood-free inference

◮ There are several flavors of likelihood-free inference. In

Bayesian setting e.g.

◮ Approximate Bayesian computation (ABC) ◮ Synthetic likelihood

(Wood, 2010) ◮ General idea: Identify the values of the parameters of interest

θ for which simulated data resemble the observed data

◮ Simulated data resemble the observed data if some distance

measure d ≥ 0 is small.

Here: Focus on ABC, see JMLR paper for synthetic likelihood

Michael Gutmann BOLFI 9 / 23

slide-10
SLIDE 10

Meta ABC algorithm

◮ Let yo be the observed data. ◮ Iterate many times:

  • 1. Sample θ from a proposal distribution q(θ)
  • 2. Sample y|θ according to the model
  • 3. Compute distance d(y, y o) between simulated and observed

data

  • 4. Retain θ if d(y, y o) ≤ ǫ

◮ Different choices for q(θ) give different algorithms ◮ Produces samples from the (approximate) posterior when ǫ is

small

Michael Gutmann BOLFI 10 / 23

slide-11
SLIDE 11

Implicit likelihood approximation

Likelihood: Probability to generate data like yo if hypothesis θ holds

yo ε

Data space

(1)

(2)

(3)

(4)

(5)

(6)

Model

M(θ)

Likelihood L(θ) ≈ proportion of green outcomes

L(θ) ≈ 1

N

N

i=1 ✶

  • d(y (i)

θ , y o) ≤ ǫ

  • Michael Gutmann

BOLFI 11 / 23

slide-12
SLIDE 12

Example: Bacterial infections in child care centers

◮ Likelihood intractable for cross-sectional data ◮ But generating data from the model is possible

Individual Strain 5 10 15 20 25 30 35 5 10 15 20 25 30 Individual Strain 5 10 15 20 25 30 35 5 10 15 20 25 30 Individual Strain 5 10 15 20 25 30 35 5 10 15 20 25 30

Time

Individual Strain 5 10 15 20 25 30 35 5 10 15 20 25 30

Individual S t r a i n

Parameters of interest:

  • rate of infections within a center
  • rate of infections from outside
  • competition between the strains

(Numminen et al, 2013)

Michael Gutmann BOLFI 12 / 23

slide-13
SLIDE 13

Example: Bacterial infections in child care centers

◮ Data: Streptococcus pneumoniae colonization for 29 centers ◮ Inference with Population Monte Carlo ABC ◮ Reveals strong competition between different bacterial strains

Expensive:

◮ 4.5 days on a cluster with

200 cores

◮ More than one million

simulated data sets

0.2 0.4 0.6 0.8 1 2 4 6 8 10 12 14 16 18

strong weak Competition

Competition parameter probability density function prior posterior

Michael Gutmann BOLFI 13 / 23

slide-14
SLIDE 14

Why is the ABC algorithm so expensive?

  • 1. It rejects most samples when ǫ is small
  • 2. It does not make assumptions about the shape of L(θ)
  • 3. It does not use all information available
  • 4. It aims at equal accuracy for all parameters

L(θ) ≈ 1

N

N

i=1 ✶

  • d(y (i)

θ , y o) ≤ ǫ

  • Approximate lik function for competition
  • parameter. N = 300.

0.05 0.1 0.15 0.2 1 2 3 4 5 6 Competition parameter Approximate likelihood function (rescaled) Threshold ε Average distance Variability distances

Michael Gutmann BOLFI 14 / 23

slide-15
SLIDE 15

Proposed solution

(Gutmann and Corander, 2016)

  • 1. It rejects most samples when ǫ is small

⇒ Don’t reject samples – learn from them

  • 2. It does not make assumptions about the shape of L(θ)

⇒ Model the distances, assume average distance is smooth

  • 3. It does not use all information available

⇒ Use Bayes’ theorem to update the model

  • 4. It aims at equal accuracy for all parameters

⇒ Prioritize parameter regions with small distances

equivalent strategy applies to inference with synthetic likelihood

Michael Gutmann BOLFI 15 / 23

slide-16
SLIDE 16

Modeling (points 1 & 2)

◮ Data are tuples (θi, di), where di = d(y(i) θ , yo) ◮ Model the conditional distribution of d given θ ◮ Estimated model yields approximation ˆ

L(θ) for any choice of ǫ ˆ L(θ) ∝ Pr (d ≤ ǫ | θ)

  • Pr is probability under the estimated model.

◮ Here: Use (log) Gaussian process as model (with squared

exponential covariance function)

◮ Approach not restricted to Gaussian processes.

Michael Gutmann BOLFI 16 / 23

slide-17
SLIDE 17

Data acquisition (points 3 & 4)

◮ Samples of θ could be obtained by sampling from the prior or

some adaptively constructed proposal distribution

◮ Give priority to regions in the parameter space where distance

d tends to be small.

◮ Use Bayesian optimization to find such regions ◮ Here: Use lower confidence bound acquisition function (e.g. Cox and John, 1992; Srinivas et al, 2012)

At(θ) = µt(θ)

post mean

  • η2

t

  • weight

vt(θ)

post var

(1) t: number of samples acquired so far

◮ Approach not restricted to this acquisition function.

Michael Gutmann BOLFI 17 / 23

slide-18
SLIDE 18

Bayesian optimization for likelihood-free inference

0.05 0.1 0.15 0.2

  • 15
  • 10
  • 5

5 Competition parameter

Model based on 2 data points Acquisition function

20% 10% 5% 80% 90% 95% 0.05 0.1 0.15 0.2

  • 3
  • 2
  • 1

1 2 3 4 5 6 Competition parameter

Model based on 3 data points

0.05 0.1 0.15 0.2

  • 1

1 2 3 4 5 6 Competition parameter

Model based on 4 data points Next parameter to try

Data Model

Exploration vs exploitation Bayes' theorem

distance distance 50% mean

Michael Gutmann BOLFI 18 / 23

slide-19
SLIDE 19

Example: Bacterial infections in child care centers

◮ Comparison of the proposed approach with a standard

population Monte Carlo ABC approach.

◮ Roughly equal results using 1000 times fewer simulations.

4.5 days with 200 cores ↓ 90 minutes with seven cores

Posterior means: solid lines, credibility intervals: shaded areas or dashed lines.

2 2.5 3 3.5 4 4.5 5 5.5 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Computational cost (log10) Competition parameter Developed Fast Method Standard Method

(Gutmann and Corander, 2016)

Michael Gutmann BOLFI 19 / 23

slide-20
SLIDE 20

Example: Bacterial infections in child care centers

◮ Comparison of the proposed approach with a standard

population Monte Carlo ABC approach.

◮ Roughly equal results using 1000 times fewer simulations.

2 2.5 3 3.5 4 4.5 5 5.5 6 1 2 3 4 5 6 7 8 9 10 11 Computational cost (log10) Internal infection parameter Developed Fast Method Standard Method 2 2.5 3 3.5 4 4.5 5 5.5 6 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Computational cost (log10) External infection parameter Developed Fast Method Standard Method

Posterior means are shown as solid lines, credibility intervals as shaded areas or dashed lines. Michael Gutmann BOLFI 20 / 23

slide-21
SLIDE 21

Further benefits

◮ The proposed method makes the inference more efficient.

◮ Allowed us to perform far more comprehensive data analysis

than with standard approach (Numminen et al, 2016)

◮ Enables inference for models which were out of reach till now

◮ model of evolution where simulating a single data set took us

12-24 hours (Marttinen et al, 2015)

◮ Enables easier assessment of parameter identifiability for

complex models

◮ model about transmission dynamics of tuberculosis

(Lintusaari et al, 2016)

Michael Gutmann BOLFI 21 / 23

slide-22
SLIDE 22

Open questions

◮ Model: How to best model the distance between simulated

and observed data?

◮ Acquisition function: Can we find strategies which are optimal

for parameter inference?

◮ Efficient high-dimensional inference: Can we use the approach

to infer the joint distribution of 1000 variables?

see JMLR paper for a discussion

Michael Gutmann BOLFI 22 / 23

slide-23
SLIDE 23

Summary

◮ Topic: Inference for models where the likelihood is intractable

but sampling is possible

◮ Inference principle: Find parameter values for which the

distance between simulated and observed data is small

◮ Problem considered: Computational cost ◮ Proposed approach: Combine statistical modeling of the

distance with decision making under uncertainty (Bayesian

  • ptimization)

◮ Outcome: Approach increases the efficiency of the inference

by several orders of magnitude

Michael Gutmann BOLFI 23 / 23

slide-24
SLIDE 24

References

◮ M.U. Gutmann and J. Corander. Bayesian optimization for likelihood-free inference of simulator-based statistical models, Journal of Machine Learning Research, 17(125): 1–47, 2016 ◮ J. Lintusaari, M.U. Gutmann, R. Dutta, S. Kaski, and J. Corander. Fundamentals and Recent Developments in Approximate Bayesian Computation, Systematic Biology, in press, 2016 ◮ E. Numminen, M.U. Gutmann, M. Shubin, et al. The impact of host metapopulation structure on the population genetics of colonizing bacteria Journal of Theoretical Biology, 396: 53–62, 2016 ◮ J. Lintusaari, M.U. Gutmann, S. Kaski, and J. Corander. On the identifiability

  • f transmission dynamic models for infectious disease Genetics, 202(3):

911–918, 2016 ◮ P. Marttinen, N.J. Croucher, M.U. Gutmann, J. Corander, and W.P. Hanage. Recombination produces coherent bacterial species clusters in both core and accessory genomes, Microbial Genomics, 1(5), 2015 ◮ Numminen et al. Estimating the Transmission Dynamics of Streptococcus pneumoniae from Strain Prevalence Data. Biometrics 9, 2013. ◮ N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012. ◮ S.N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems, Nature, 466: 1102–1104, 2010 ◮ D. Cox and S. John. A statistical method for global optimization, Proc. IEEE Conference on Systems, Man and Cybernetics, 2: 1241–1246, 1992