Kernel Recursive ABC: Point Estimation with Intractable Likelihood - - PowerPoint PPT Presentation

kernel recursive abc point estimation with intractable
SMART_READER_LITE
LIVE PREVIEW

Kernel Recursive ABC: Point Estimation with Intractable Likelihood - - PowerPoint PPT Presentation

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM, Sophia Antipolis, France (Previously U. Tbingen) ISM-UUlm Workshop, October 2019 1 / 44 Contents of This Talk 1. Kernel Recursive ABC: Point


slide-1
SLIDE 1

Kernel Recursive ABC: Point Estimation with Intractable Likelihood

Motonobu Kanagawa

EURECOM, Sophia Antipolis, France (Previously U. Tübingen)

ISM-UUlm Workshop, October 2019

1 / 44

slide-2
SLIDE 2

Contents of This Talk

  • 1. Kernel Recursive ABC: Point Estimation with

Intractable Likelihood (ICML 2018)

  • T. Kajihara, M. Kanagawa, K. Yamazaki and K. Fukumizu.

2 / 44

slide-3
SLIDE 3

Outline

Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions

3 / 44

slide-4
SLIDE 4

Machine Learning for Computer Simulation

Computer simulation has been used to study time-evolving complex phenomena in various scientific fields.

◮ Climate science, social science, economics, ecology,

epidemiology, etc. etc...

4 / 44

slide-5
SLIDE 5

Machine Learning for Computer Simulation

Computer simulation has been used to study time-evolving complex phenomena in various scientific fields.

◮ Climate science, social science, economics, ecology,

epidemiology, etc. etc... The power of computer simulation is extrapolation, which enables

◮ predictions of quantities/phenomena in the future. ◮ gaining understanding of the phenomena of interest.

4 / 44

slide-6
SLIDE 6

Machine Learning for Computer Simulation

Component A Component Z Observed data Computer Simulator

Stochastic errors Numerical errors

Model description

θ1, …, θd

  • 2. Calibration
  • 1. Simulation

Simulation outputs

Total numerical errors

Parameters

  • 3. Interpretation

Numerical errors Computational costs

5 / 44

slide-7
SLIDE 7

Example: Tsunami Simulation [Saito, 2019, p.211]

6 / 44

slide-8
SLIDE 8

Example: Pedestrian Flow Simulation [Yamashita et al., 2010]

Multi-agent systems for pedestrians walking in Ginza.

Figure 1: Points representing individual pedestrians. (red = slow)

7 / 44

slide-9
SLIDE 9

Calibration: Parameter Estimation and Model Selection

Component A Component Z Observed data Computer Simulator

Stochastic errors Numerical errors

Model description

θ1, …, θd

  • 2. Calibration
  • 1. Simulation

Simulation outputs

Total numerical errors

Parameters

  • 3. Interpretation

Numerical errors Computational costs

8 / 44

slide-10
SLIDE 10

Calibration: Parameter Estimation and Model Selection

To obtain a “good” simulator, the following two tasks regarding calibration to observed data must be addressed.

9 / 44

slide-11
SLIDE 11

Calibration: Parameter Estimation and Model Selection

To obtain a “good” simulator, the following two tasks regarding calibration to observed data must be addressed.

  • 1. Parameter estimation

◮ Estimate parameters θ of a simulation model p(y ∗|θ).

(y ∗ denotes observed data.)

9 / 44

slide-12
SLIDE 12

Calibration: Parameter Estimation and Model Selection

To obtain a “good” simulator, the following two tasks regarding calibration to observed data must be addressed.

  • 1. Parameter estimation

◮ Estimate parameters θ of a simulation model p(y ∗|θ).

(y ∗ denotes observed data.)

  • 2. Model selection

◮ Select one model from multiple (K ≥ 2) candidate models:

p1(y ∗|θ1), p2(y ∗|θ2), . . . , pK(y ∗|θK)

9 / 44

slide-13
SLIDE 13

Calibration: Parameter Estimation and Model Selection

To obtain a “good” simulator, the following two tasks regarding calibration to observed data must be addressed.

  • 1. Parameter estimation

◮ Estimate parameters θ of a simulation model p(y ∗|θ).

(y ∗ denotes observed data.)

  • 2. Model selection

◮ Select one model from multiple (K ≥ 2) candidate models:

p1(y ∗|θ1), p2(y ∗|θ2), . . . , pK(y ∗|θK) In the language of statistics, computer simulation can be defined as sampling from a probabilistic model p(y|θ).

9 / 44

slide-14
SLIDE 14

Calibration: Parameter Estimation and Model Selection

These tasks are harder than standard statistical problems, since the likelihood function ℓ(θ) := p(y ∗|θ) is not available.

10 / 44

slide-15
SLIDE 15

Calibration: Parameter Estimation and Model Selection

These tasks are harder than standard statistical problems, since the likelihood function ℓ(θ) := p(y ∗|θ) is not available. This is because

◮ The mapping θ → y is usually very complicated. (e.g., it

involves solving differential equations)

10 / 44

slide-16
SLIDE 16

Calibration: Parameter Estimation and Model Selection

These tasks are harder than standard statistical problems, since the likelihood function ℓ(θ) := p(y ∗|θ) is not available. This is because

◮ The mapping θ → y is usually very complicated. (e.g., it

involves solving differential equations) Thus one needs to solve these tasks by likelihood-free inference, making use of sampling/forward simulations.

10 / 44

slide-17
SLIDE 17

Calibration: Parameter Estimation and Model Selection

These tasks are harder than standard statistical problems, since the likelihood function ℓ(θ) := p(y ∗|θ) is not available. This is because

◮ The mapping θ → y is usually very complicated. (e.g., it

involves solving differential equations) Thus one needs to solve these tasks by likelihood-free inference, making use of sampling/forward simulations. Approaches to likelihood-free inference include

◮ Approximate Bayesian Computation (ABC)

[Sisson et al., 2018].

◮ Bayesian optimization [Gutmann and Corander, 2016].

10 / 44

slide-18
SLIDE 18

Approximate Bayesian Computation (ABC)

– Set ε > 0 and J := {}.

11 / 44

slide-19
SLIDE 19

Approximate Bayesian Computation (ABC)

– Set ε > 0 and J := {}. – Generate parameter-data pairs from the model: (θ1, y1), . . . , (θn, yn) ∼ p(y|θ)

simulator

π(θ)

  • prior

, i.i.d.

11 / 44

slide-20
SLIDE 20

Approximate Bayesian Computation (ABC)

– Set ε > 0 and J := {}. – Generate parameter-data pairs from the model: (θ1, y1), . . . , (θn, yn) ∼ p(y|θ)

simulator

π(θ)

  • prior

, i.i.d. – For j = 1, . . . , n, set J ← J ∪ {j} if dist(yi, y ∗) ≤ ε, where y ∗ is observed data.

11 / 44

slide-21
SLIDE 21

Approximate Bayesian Computation (ABC)

– Set ε > 0 and J := {}. – Generate parameter-data pairs from the model: (θ1, y1), . . . , (θn, yn) ∼ p(y|θ)

simulator

π(θ)

  • prior

, i.i.d. – For j = 1, . . . , n, set J ← J ∪ {j} if dist(yi, y ∗) ≤ ε, where y ∗ is observed data. – Monte Carlo approximation of the posterior: p(θ|y ∗) ≈ ˆ pε(θ|y ∗) := 1 |J|

  • j∈J

δθj

  • Dirac at θj

11 / 44

slide-22
SLIDE 22

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models.

12 / 44

slide-23
SLIDE 23

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

12 / 44

slide-24
SLIDE 24

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings,

12 / 44

slide-25
SLIDE 25

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and

12 / 44

slide-26
SLIDE 26

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data.

12 / 44

slide-27
SLIDE 27

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data.

It should be useful when point estimation is more desirable than the fully Bayesian approach.

12 / 44

slide-28
SLIDE 28

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data.

It should be useful when point estimation is more desirable than the fully Bayesian approach. For instance:

◮ when your prior distribution π(θ) is not fully reliable,

12 / 44

slide-29
SLIDE 29

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data.

It should be useful when point estimation is more desirable than the fully Bayesian approach. For instance:

◮ when your prior distribution π(θ) is not fully reliable, ◮ when one simulation is computationally very expensive, and

12 / 44

slide-30
SLIDE 30

Contributions

We propose a kernel-based method for point estimation of simulation-based statistical models. The proposed approach (termed kernel recursive ABC)

◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data.

It should be useful when point estimation is more desirable than the fully Bayesian approach. For instance:

◮ when your prior distribution π(θ) is not fully reliable, ◮ when one simulation is computationally very expensive, and ◮ when your purpose is on predictions based on simulations.

12 / 44

slide-31
SLIDE 31

Outline

Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions

13 / 44

slide-32
SLIDE 32

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

Let k : X × X → R be a symmetric function on a set X.

14 / 44

slide-33
SLIDE 33

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

Let k : X × X → R be a symmetric function on a set X. The function k(x, x′) is called a positive definite kernel, if

n

  • i=1

n

  • j=1

cicjk(xi, xj) ≥ 0 holds for all n ∈ N, c1, . . . , cn ∈ R, x1, . . . , xn ∈ X.

14 / 44

slide-34
SLIDE 34

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

Let k : X × X → R be a symmetric function on a set X. The function k(x, x′) is called a positive definite kernel, if

n

  • i=1

n

  • j=1

cicjk(xi, xj) ≥ 0 holds for all n ∈ N, c1, . . . , cn ∈ R, x1, . . . , xn ∈ X. Examples of positive definite kernels on X = Rd: Gaussian k(x, x′) = exp(−x − x′2/γ2). Laplace (Matérn) k(x, x′) = exp(−x − x′/γ). Linear k(x, x′) =

  • x, x′

. Polynomial k(x, x′) = (

  • x, x′

+ c)m.

14 / 44

slide-35
SLIDE 35

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

Let k : X × X → R be a symmetric function on a set X. The function k(x, x′) is called a positive definite kernel, if

n

  • i=1

n

  • j=1

cicjk(xi, xj) ≥ 0 holds for all n ∈ N, c1, . . . , cn ∈ R, x1, . . . , xn ∈ X. Examples of positive definite kernels on X = Rd: Gaussian k(x, x′) = exp(−x − x′2/γ2). Laplace (Matérn) k(x, x′) = exp(−x − x′/γ). Linear k(x, x′) =

  • x, x′

. Polynomial k(x, x′) = (

  • x, x′

+ c)m. In this talk, I will simply call k a kernel.

14 / 44

slide-36
SLIDE 36

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

For any kernel k, there is a uniquely associated Hilbert space H consisting of functions on X such that

15 / 44

slide-37
SLIDE 37

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

For any kernel k, there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k(·, x) ∈ H for all x ∈ X

15 / 44

slide-38
SLIDE 38

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

For any kernel k, there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k(·, x) ∈ H for all x ∈ X where k(·, x) is the function of the first argument with x fixed: x′ ∈ X → k(x′, x).

15 / 44

slide-39
SLIDE 39

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

For any kernel k, there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k(·, x) ∈ H for all x ∈ X where k(·, x) is the function of the first argument with x fixed: x′ ∈ X → k(x′, x). (ii) f (x) = f , k(·, x)H for all f ∈ H and x ∈ X, which is called the reproducing property.

15 / 44

slide-40
SLIDE 40

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

For any kernel k, there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k(·, x) ∈ H for all x ∈ X where k(·, x) is the function of the first argument with x fixed: x′ ∈ X → k(x′, x). (ii) f (x) = f , k(·, x)H for all f ∈ H and x ∈ X, which is called the reproducing property. – H is called the RKHS of k.

15 / 44

slide-41
SLIDE 41

Kernels and Reproducing Kernel Hilbert Spaces (RKHS)

For any kernel k, there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k(·, x) ∈ H for all x ∈ X where k(·, x) is the function of the first argument with x fixed: x′ ∈ X → k(x′, x). (ii) f (x) = f , k(·, x)H for all f ∈ H and x ∈ X, which is called the reproducing property. – H is called the RKHS of k. – H can be written as H = span {k(·, x) | x ∈ X}

15 / 44

slide-42
SLIDE 42

Kernel Mean Embeddings [Smola et al., 2007]

A framework for representing distributions in an RKHS.

16 / 44

slide-43
SLIDE 43

Kernel Mean Embeddings [Smola et al., 2007]

A framework for representing distributions in an RKHS. – Let P be the set of all probability distributions on X.

16 / 44

slide-44
SLIDE 44

Kernel Mean Embeddings [Smola et al., 2007]

A framework for representing distributions in an RKHS. – Let P be the set of all probability distributions on X. – Let k be a kernel on X, and H be its RKHS.

16 / 44

slide-45
SLIDE 45

Kernel Mean Embeddings [Smola et al., 2007]

A framework for representing distributions in an RKHS. – Let P be the set of all probability distributions on X. – Let k be a kernel on X, and H be its RKHS. For each distribution P ∈ P, define the kernel mean: µP :=

  • k(·, x)dP(x) ∈ H.

which is a representation of P in H.

16 / 44

slide-46
SLIDE 46

Kernel Mean Embeddings [Smola et al., 2007]

A framework for representing distributions in an RKHS. – Let P be the set of all probability distributions on X. – Let k be a kernel on X, and H be its RKHS. For each distribution P ∈ P, define the kernel mean: µP :=

  • k(·, x)dP(x) ∈ H.

which is a representation of P in H. A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic, if for any P, Q ∈ P, µP = µQ if and only if P = Q.

16 / 44

slide-47
SLIDE 47

Kernel Mean Embeddings [Smola et al., 2007]

A framework for representing distributions in an RKHS. – Let P be the set of all probability distributions on X. – Let k be a kernel on X, and H be its RKHS. For each distribution P ∈ P, define the kernel mean: µP :=

  • k(·, x)dP(x) ∈ H.

which is a representation of P in H. A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic, if for any P, Q ∈ P, µP = µQ if and only if P = Q. – In other words, k is characteristic if the mapping P ∈ P → µP ∈ H is injective.

16 / 44

slide-48
SLIDE 48

Kernel Mean Embeddings [Smola et al., 2007]

Intuitively, k being characteristic implies that H is large enough.

Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3]

17 / 44

slide-49
SLIDE 49

Kernel Mean Embeddings [Smola et al., 2007]

Intuitively, k being characteristic implies that H is large enough.

Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3]

Examples of characteristic kernels on X = Rd: – Gaussian and Matérn kernels [Sriperumbudur et al., 2010].

17 / 44

slide-50
SLIDE 50

Kernel Mean Embeddings [Smola et al., 2007]

Intuitively, k being characteristic implies that H is large enough.

Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3]

Examples of characteristic kernels on X = Rd: – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. Examples of non-characteristic kernels on X = Rd: – Linear and polynomial kernels.

17 / 44

slide-51
SLIDE 51

Outline

Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions

18 / 44

slide-52
SLIDE 52

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

.

19 / 44

slide-53
SLIDE 53

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ).

19 / 44

slide-54
SLIDE 54

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ)

19 / 44

slide-55
SLIDE 55

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ) = p(y ∗|θ)2π(θ).

19 / 44

slide-56
SLIDE 56

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ) = p(y ∗|θ)2π(θ). 3rd recursion π3(θ) := p3(θ|y ∗) ∝ p(y ∗|θ)π2(θ)

19 / 44

slide-57
SLIDE 57

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ) = p(y ∗|θ)2π(θ). 3rd recursion π3(θ) := p3(θ|y ∗) ∝ p(y ∗|θ)π2(θ) = p(y ∗|θ)3π(θ).

19 / 44

slide-58
SLIDE 58

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ) = p(y ∗|θ)2π(θ). 3rd recursion π3(θ) := p3(θ|y ∗) ∝ p(y ∗|θ)π2(θ) = p(y ∗|θ)3π(θ). · · ·

19 / 44

slide-59
SLIDE 59

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ) = p(y ∗|θ)2π(θ). 3rd recursion π3(θ) := p3(θ|y ∗) ∝ p(y ∗|θ)π2(θ) = p(y ∗|θ)3π(θ). · · · N-th recursion πN(θ) := pN(θ|y ∗) ∝ p(y ∗|θ)πN−1(θ)

19 / 44

slide-60
SLIDE 60

Recursive Bayes Updates and Power Posteriors

Given observed data y ∗, Bayes’ rule yields a posterior distribution: p(θ|y ∗)

Posterior

∝ p(y ∗|θ)

Likelihood

π(θ)

  • Prior

. Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗. 1st recursion π1(θ) := p1(θ|y ∗) ∝ p(y ∗|θ)π(θ). 2nd recursion π2(θ) := p2(θ|y ∗) ∝ p(y ∗|θ)π1(θ) = p(y ∗|θ)2π(θ). 3rd recursion π3(θ) := p3(θ|y ∗) ∝ p(y ∗|θ)π2(θ) = p(y ∗|θ)3π(θ). · · · N-th recursion πN(θ) := pN(θ|y ∗) ∝ p(y ∗|θ)πN−1(θ) = p(y ∗|θ)Nπ(θ).

19 / 44

slide-61
SLIDE 61

Power Posteriors and Maximum Likelihood Estimation

N recursive Bayes updates yield the power posterior pN(θ|y ∗) ∝ p(y ∗|θ)Nπ(θ)

20 / 44

slide-62
SLIDE 62

Power Posteriors and Maximum Likelihood Estimation

N recursive Bayes updates yield the power posterior pN(θ|y ∗) ∝ p(y ∗|θ)Nπ(θ) Theorem [Lele et al., 2010]. Assume that p(y ∗|θ) has a unique global maximizer: θ∗ := arg max

θ∈Θ p(y ∗|θ).

20 / 44

slide-63
SLIDE 63

Power Posteriors and Maximum Likelihood Estimation

N recursive Bayes updates yield the power posterior pN(θ|y ∗) ∝ p(y ∗|θ)Nπ(θ) Theorem [Lele et al., 2010]. Assume that p(y ∗|θ) has a unique global maximizer: θ∗ := arg max

θ∈Θ p(y ∗|θ).

Then, if π(θ∗) > 0, under mild conditions on π(θ) and p(y|θ), pN(θ|y ∗) → δθ∗

  • Dirac at θ∗

as N → ∞ (weak convergence).

20 / 44

slide-64
SLIDE 64

Power Posteriors and Maximum Likelihood Estimation

N recursive Bayes updates yield the power posterior pN(θ|y ∗) ∝ p(y ∗|θ)Nπ(θ) Theorem [Lele et al., 2010]. Assume that p(y ∗|θ) has a unique global maximizer: θ∗ := arg max

θ∈Θ p(y ∗|θ).

Then, if π(θ∗) > 0, under mild conditions on π(θ) and p(y|θ), pN(θ|y ∗) → δθ∗

  • Dirac at θ∗

as N → ∞ (weak convergence). This implies that recursive Bayes updates provide a way of Maximum Likelihood Estimation.

20 / 44

slide-65
SLIDE 65

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling.

21 / 44

slide-66
SLIDE 66

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1, 2, . . . , Niter, iterate the following procedures:

21 / 44

slide-67
SLIDE 67

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1, 2, . . . , Niter, iterate the following procedures:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

21 / 44

slide-68
SLIDE 68

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1, 2, . . . , Niter, iterate the following procedures:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Simulate pseudo-data for each θi: yi ∼ p(y|θi) (i = 1, . . . , n).

21 / 44

slide-69
SLIDE 69

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1, 2, . . . , Niter, iterate the following procedures:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Simulate pseudo-data for each θi: yi ∼ p(y|θi) (i = 1, . . . , n). – Estimate the kernel mean of the power posterior using (θi, yi)n

i=1:

µPN :=

  • kΘ(·, θ)

Kernel on Θ

pN(θ|y ∗)dθ (1) where pN(θ|y ∗) ∝ pN(y|θ)π(θ)

21 / 44

slide-70
SLIDE 70

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1, 2, . . . , Niter, iterate the following procedures:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Simulate pseudo-data for each θi: yi ∼ p(y|θi) (i = 1, . . . , n). – Estimate the kernel mean of the power posterior using (θi, yi)n

i=1:

µPN :=

  • kΘ(·, θ)

Kernel on Θ

pN(θ|y ∗)dθ (1) where pN(θ|y ∗) ∝ pN(y|θ)π(θ)

  • 2. Kernel Herding: Sampling θ′

1, . . . , θ′ n from the estimate of (1):

21 / 44

slide-71
SLIDE 71

Proposed Method: Kernel Recursive ABC (Sketch)

– Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1, 2, . . . , Niter, iterate the following procedures:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Simulate pseudo-data for each θi: yi ∼ p(y|θi) (i = 1, . . . , n). – Estimate the kernel mean of the power posterior using (θi, yi)n

i=1:

µPN :=

  • kΘ(·, θ)

Kernel on Θ

pN(θ|y ∗)dθ (1) where pN(θ|y ∗) ∝ pN(y|θ)π(θ)

  • 2. Kernel Herding: Sampling θ′

1, . . . , θ′ n from the estimate of (1):

Set: N ← N + 1 and (θ1, . . . , θn) ← (θ′

1, . . . , θ′ n)

21 / 44

slide-72
SLIDE 72

Kernel ABC [Nakagome et al., 2013]

– Define

◮ a kernel kY(y, y ′) on the data space Y, ◮ a kernel kΘ(θ, θ′) on the parameter space Θ, and ◮ a regularisation constant λ > 0.

22 / 44

slide-73
SLIDE 73

Kernel ABC [Nakagome et al., 2013]

– Define

◮ a kernel kY(y, y ′) on the data space Y, ◮ a kernel kΘ(θ, θ′) on the parameter space Θ, and ◮ a regularisation constant λ > 0.

  • 1. Sampling: Generate parameter-data pairs from the model:

(θ1, y1), . . . , (θn, yn) ∼ p(y|θ)π(θ), i.i.d.

22 / 44

slide-74
SLIDE 74

Kernel ABC [Nakagome et al., 2013]

– Define

◮ a kernel kY(y, y ′) on the data space Y, ◮ a kernel kΘ(θ, θ′) on the parameter space Θ, and ◮ a regularisation constant λ > 0.

  • 1. Sampling: Generate parameter-data pairs from the model:

(θ1, y1), . . . , (θn, yn) ∼ p(y|θ)π(θ), i.i.d.

  • 2. Weight computation: Given observed data y ∗, compute

kY (y ∗) := (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤ ∈ Rn. (w1(y ∗), . . . , wn(y ∗))⊤ := (GY + nλIn)−1kY (y ∗) ∈ Rn, where GY := (kY(yi, yj)) ∈ Rn×n is the kernel matrix.

22 / 44

slide-75
SLIDE 75

Kernel ABC [Nakagome et al., 2013]

– Define

◮ a kernel kY(y, y ′) on the data space Y, ◮ a kernel kΘ(θ, θ′) on the parameter space Θ, and ◮ a regularisation constant λ > 0.

  • 1. Sampling: Generate parameter-data pairs from the model:

(θ1, y1), . . . , (θn, yn) ∼ p(y|θ)π(θ), i.i.d.

  • 2. Weight computation: Given observed data y ∗, compute

kY (y ∗) := (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤ ∈ Rn. (w1(y ∗), . . . , wn(y ∗))⊤ := (GY + nλIn)−1kY (y ∗) ∈ Rn, where GY := (kY(yi, yj)) ∈ Rn×n is the kernel matrix. Output: An estimate of the posterior kernel mean:

  • kΘ(·, θ)p(θ|y ∗)dθ

n

  • i=1

wi(y ∗)kΘ(·, θi), p(θ|y ∗) ∝ p(y ∗|θ)π(θ).

22 / 44

slide-76
SLIDE 76

Kernel ABC: The Sampling Step

  • 1. Sampling: Generate parameter-data pairs from the model:

(θ1, y1), . . . , (θn, yn) ∼ p(y|θ)π(θ), i.i.d.

θ1 θ2 θ3 θ4 y1 y4 y2 y3 y* Θ 풴 π(θ)

Observed data Prior distribution

θ*

Sampling Parameter space Data space Sampling

23 / 44

slide-77
SLIDE 77

Kernel ABC: The Weight Computation Step

  • 2. Weight computation: Given observed data y ∗, compute
  • 1. Similarities:

kY (y ∗) = (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤,

  • 2. Weights:

(w1(y ∗), . . . , wn(y ∗))⊤ = (GY + nλIn)−1kY (y ∗).

θ1 θ2 θ3 θ4 y1 y4 y2 y3 y* Θ 풴 θ*

Parameter space Data space

  • 1. Similarity

computation

  • 2. Weight

computation

  • kΘ(·, θ)p(θ|y ∗)dθ ≈

n

  • i=1

wi(y ∗)kΘ(·, θi).

24 / 44

slide-78
SLIDE 78

Kernel Herding [Chen et al., 2010]

Let – P be a known probability distribution on Θ; and – µP =

  • kΘ(·, θ)dP(θ) be its kernel mean.

25 / 44

slide-79
SLIDE 79

Kernel Herding [Chen et al., 2010]

Let – P be a known probability distribution on Θ; and – µP =

  • kΘ(·, θ)dP(θ) be its kernel mean.

Kernel herding is a deterministic sampling method that

25 / 44

slide-80
SLIDE 80

Kernel Herding [Chen et al., 2010]

Let – P be a known probability distribution on Θ; and – µP =

  • kΘ(·, θ)dP(θ) be its kernel mean.

Kernel herding is a deterministic sampling method that – sequentially generates sample points θ′

1, . . . , θ′ n from P as

θ′

1

:= arg max

θ∈Θ µP(θ),

25 / 44

slide-81
SLIDE 81

Kernel Herding [Chen et al., 2010]

Let – P be a known probability distribution on Θ; and – µP =

  • kΘ(·, θ)dP(θ) be its kernel mean.

Kernel herding is a deterministic sampling method that – sequentially generates sample points θ′

1, . . . , θ′ n from P as

θ′

1

:= arg max

θ∈Θ µP(θ),

θ′

T

:= arg max

θ∈Θ

µP(θ)

mode seeking

− 1 T

T−1

  • ℓ=1

kΘ(θ, θ′

ℓ)

  • repulsive force

(T = 2, . . . , n).

25 / 44

slide-82
SLIDE 82

Kernel Herding [Chen et al., 2010]

Let – P be a known probability distribution on Θ; and – µP =

  • kΘ(·, θ)dP(θ) be its kernel mean.

Kernel herding is a deterministic sampling method that – sequentially generates sample points θ′

1, . . . , θ′ n from P as

θ′

1

:= arg max

θ∈Θ µP(θ),

θ′

T

:= arg max

θ∈Θ

µP(θ)

mode seeking

− 1 T

T−1

  • ℓ=1

kΘ(θ, θ′

ℓ)

  • repulsive force

(T = 2, . . . , n). – is equivalent to greedily approximating the kernel mean µP: θ′

T = arg min θ∈Θ

  • µP − 1

T

  • kΘ(·, θ) +

T−1

  • i=1

kΘ(·, θ′

i)

, if kΘ is shift-invariant. (HΘ is the RKHS of kΘ.)

25 / 44

slide-83
SLIDE 83

Kernel Herding [Chen et al., 2010]

Red squares: Sample points generated from kernel herding Purple circles: Randomly generated i.i.d. sample points.

  • −6

−4 −2 2 4 6 −6 −5 −4 −3 −2 −1 1 2 3 4

Figure 3: [Chen et al., 2010, Fig 1]

26 / 44

slide-84
SLIDE 84

Proposed Method: Kernel Recursive ABC (Algorithm)

For N = 1, 2, . . . , Niter, iterate the following procedure:

27 / 44

slide-85
SLIDE 85

Proposed Method: Kernel Recursive ABC (Algorithm)

For N = 1, 2, . . . , Niter, iterate the following procedure:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

27 / 44

slide-86
SLIDE 86

Proposed Method: Kernel Recursive ABC (Algorithm)

For N = 1, 2, . . . , Niter, iterate the following procedure:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Generate pseudo-data from each θi: yi ∼ p(y|θi) (i = 1, . . . , n),

27 / 44

slide-87
SLIDE 87

Proposed Method: Kernel Recursive ABC (Algorithm)

For N = 1, 2, . . . , Niter, iterate the following procedure:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Generate pseudo-data from each θi: yi ∼ p(y|θi) (i = 1, . . . , n), – Compute weights for θ1, . . . , θn: kY (y ∗) = (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤, (w1(y ∗), . . . , wn(y ∗))⊤ = (GY + nλIn)−1kY (y ∗).

27 / 44

slide-88
SLIDE 88

Proposed Method: Kernel Recursive ABC (Algorithm)

For N = 1, 2, . . . , Niter, iterate the following procedure:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Generate pseudo-data from each θi: yi ∼ p(y|θi) (i = 1, . . . , n), – Compute weights for θ1, . . . , θn: kY (y ∗) = (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤, (w1(y ∗), . . . , wn(y ∗))⊤ = (GY + nλIn)−1kY (y ∗).

  • 2. Kernel Herding: Sampling from ˆ

µPN := n

i=1 wi(y ∗)kΘ(·, θi):

θ′

T := arg max θ∈Θ ˆ

µPN(θ) − 1 T

T−1

  • ℓ=1

k(θ, θ′

ℓ)

(T = 1, . . . , n).

27 / 44

slide-89
SLIDE 89

Proposed Method: Kernel Recursive ABC (Algorithm)

For N = 1, 2, . . . , Niter, iterate the following procedure:

  • 1. Kernel ABC: If N = 1: generate θ1, . . . , θn ∼ π(θ), i.i.d.

– Generate pseudo-data from each θi: yi ∼ p(y|θi) (i = 1, . . . , n), – Compute weights for θ1, . . . , θn: kY (y ∗) = (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤, (w1(y ∗), . . . , wn(y ∗))⊤ = (GY + nλIn)−1kY (y ∗).

  • 2. Kernel Herding: Sampling from ˆ

µPN := n

i=1 wi(y ∗)kΘ(·, θi):

θ′

T := arg max θ∈Θ ˆ

µPN(θ) − 1 T

T−1

  • ℓ=1

k(θ, θ′

ℓ)

(T = 1, . . . , n). Set: N ← N + 1 and (θ1, . . . , θn) ← (θ′

1, . . . , θ′ n)

27 / 44

slide-90
SLIDE 90

Why Kernels?

– The combination of Kernel ABC and Kernel Herding leads to robustness against misspecfication of the prior π(θ).

28 / 44

slide-91
SLIDE 91

Outline

Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions

29 / 44

slide-92
SLIDE 92

Prior Misspecification

Assume that

◮ there is a “true” parameter θ∗ such that

y ∗ ∼ p(y|θ∗)

◮ but you don’t know much about θ∗.

30 / 44

slide-93
SLIDE 93

Prior Misspecification

Assume that

◮ there is a “true” parameter θ∗ such that

y ∗ ∼ p(y|θ∗)

◮ but you don’t know much about θ∗.

In such a case, you may misspecify the prior π(θ).

30 / 44

slide-94
SLIDE 94

Prior Misspecification

Assume that

◮ there is a “true” parameter θ∗ such that

y ∗ ∼ p(y|θ∗)

◮ but you don’t know much about θ∗.

In such a case, you may misspecify the prior π(θ).

◮ e.g., the support of π(θ) may not contain θ∗.

30 / 44

slide-95
SLIDE 95

Prior Misspecification

Assume that

◮ there is a “true” parameter θ∗ such that

y ∗ ∼ p(y|θ∗)

◮ but you don’t know much about θ∗.

In such a case, you may misspecify the prior π(θ).

◮ e.g., the support of π(θ) may not contain θ∗.

As a result, simulated data y i ∼ p(y|θi), θi ∼ π(θ) (i = 1, . . . , n). may become far apart from observed data y ∗.

30 / 44

slide-96
SLIDE 96

Prior Misspecification

θ1 θ2 θ3 θ4 y1 y4 y2 y3 y* Θ 풴 π(θ)

Dissimilar

Observed data Prior distribution

θ*

True parameter Parameter space

Misspecification

Data space Prior support

31 / 44

slide-97
SLIDE 97

Auto-Correction Mechanism: The Kernel ABC Step

y1 y4 y2 y3 y* 풴

Dissimilar

Observed data Data space

– Recall kY(y ∗, yi) quantifies the similarity between y ∗ and yi.

32 / 44

slide-98
SLIDE 98

Auto-Correction Mechanism: The Kernel ABC Step

y1 y4 y2 y3 y* 풴

Dissimilar

Observed data Data space

– Recall kY(y ∗, yi) quantifies the similarity between y ∗ and yi. — e.g. a Gaussian kernel: kY(y ∗, yi) = exp(−dist2(y ∗, yi)/γ2)

32 / 44

slide-99
SLIDE 99

Auto-Correction Mechanism: The Kernel ABC Step

y1 y4 y2 y3 y* 풴

Dissimilar

Observed data Data space

– Recall kY(y ∗, yi) quantifies the similarity between y ∗ and yi. — e.g. a Gaussian kernel: kY(y ∗, yi) = exp(−dist2(y ∗, yi)/γ2) – Therefore, if y ∗ and each yi are dissimilar, we have kY (y ∗) = (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤≈ 0

32 / 44

slide-100
SLIDE 100

Auto-Correction Mechanism: The Kernel ABC Step

y1 y4 y2 y3 y* 풴

Dissimilar

Observed data Data space

– Recall kY(y ∗, yi) quantifies the similarity between y ∗ and yi. — e.g. a Gaussian kernel: kY(y ∗, yi) = exp(−dist2(y ∗, yi)/γ2) – Therefore, if y ∗ and each yi are dissimilar, we have kY (y ∗) = (kY(y ∗, y1), . . . , kY(y ∗, yn))⊤≈ 0 – As a result, the weights by Kernel ABC become (w1(y ∗), . . . , wn(y ∗))⊤ = (GY + nλIn)−1kY (y ∗)≈ 0

32 / 44

slide-101
SLIDE 101

Auto-Correction Mechanism: The Kernel Herding Step

Deterministically sample θ′

1, . . . , θ′ n

θ′

1 := arg max θ∈Θ n

  • i=1

wi(y ∗)k(θ, θi)

33 / 44

slide-102
SLIDE 102

Auto-Correction Mechanism: The Kernel Herding Step

Deterministically sample θ′

1, . . . , θ′ n

θ′

1 := arg max θ∈Θ n

  • i=1

wi(y ∗)k(θ, θi) For T = 2, . . . , n, θ′

T

:= arg max

θ∈Θ n

  • i=1

wi(y ∗)

≈0

k(θ, θi) − 1 T

T−1

  • ℓ=1

k(θ, θ′

ℓ)

33 / 44

slide-103
SLIDE 103

Auto-Correction Mechanism: The Kernel Herding Step

Deterministically sample θ′

1, . . . , θ′ n

θ′

1 := arg max θ∈Θ n

  • i=1

wi(y ∗)k(θ, θi) For T = 2, . . . , n, θ′

T

:= arg max

θ∈Θ n

  • i=1

wi(y ∗)

≈0

k(θ, θi) − 1 T

T−1

  • ℓ=1

k(θ, θ′

ℓ)

≈ arg minθ∈Θ

T−1

  • ℓ=1

k(θ, θ′

ℓ)

33 / 44

slide-104
SLIDE 104

Auto-Correction Mechanism: The Kernel Herding Step

Deterministically sample θ′

1, . . . , θ′ n

θ′

1 := arg max θ∈Θ n

  • i=1

wi(y ∗)k(θ, θi) For T = 2, . . . , n, θ′

T

:= arg max

θ∈Θ n

  • i=1

wi(y ∗)

≈0

k(θ, θi) − 1 T

T−1

  • ℓ=1

k(θ, θ′

ℓ)

≈ arg minθ∈Θ

T−1

  • ℓ=1

k(θ, θ′

ℓ)

Therefore, θ′

T is chosen to be distant from θ′ 1, . . . , θ′ T−1.

33 / 44

slide-105
SLIDE 105

Auto-Correction Mechanism: A Numerical Illustration

– Parameter space: Θ = R.

34 / 44

slide-106
SLIDE 106

Auto-Correction Mechanism: A Numerical Illustration

– Parameter space: Θ = R. – Observed data y ∗ = {y1, . . . , y100} ⊂ R, where y1, . . . , y100 ∼ N(θ∗, 40)

  • Normal dist.

, i.i.d. with θ∗ = 0

Unknown, to be estimated

.

34 / 44

slide-107
SLIDE 107

Auto-Correction Mechanism: A Numerical Illustration

– Parameter space: Θ = R. – Observed data y ∗ = {y1, . . . , y100} ⊂ R, where y1, . . . , y100 ∼ N(θ∗, 40)

  • Normal dist.

, i.i.d. with θ∗ = 0

Unknown, to be estimated

. – Assume your prior about θ∗ is severely misspecified: let π(θ) = uniform[2000, 3000]. (The support of π(θ) does not contain θ∗.)

34 / 44

slide-108
SLIDE 108

Auto-Correction Mechanism: A Numerical Illustration

– Parameter space: Θ = R. – Observed data y ∗ = {y1, . . . , y100} ⊂ R, where y1, . . . , y100 ∼ N(θ∗, 40)

  • Normal dist.

, i.i.d. with θ∗ = 0

Unknown, to be estimated

. – Assume your prior about θ∗ is severely misspecified: let π(θ) = uniform[2000, 3000]. (The support of π(θ) does not contain θ∗.) We applied the proposed method to estimate θ∗, with

34 / 44

slide-109
SLIDE 109

Auto-Correction Mechanism: A Numerical Illustration

– Parameter space: Θ = R. – Observed data y ∗ = {y1, . . . , y100} ⊂ R, where y1, . . . , y100 ∼ N(θ∗, 40)

  • Normal dist.

, i.i.d. with θ∗ = 0

Unknown, to be estimated

. – Assume your prior about θ∗ is severely misspecified: let π(θ) = uniform[2000, 3000]. (The support of π(θ) does not contain θ∗.) We applied the proposed method to estimate θ∗, with – kΘ, kY being Gaussian, the latter based on the energy distance [Székely and Rizzo, 2013].

34 / 44

slide-110
SLIDE 110

Auto-Correction Mechanism: A Numerical Illustration

– Parameter space: Θ = R. – Observed data y ∗ = {y1, . . . , y100} ⊂ R, where y1, . . . , y100 ∼ N(θ∗, 40)

  • Normal dist.

, i.i.d. with θ∗ = 0

Unknown, to be estimated

. – Assume your prior about θ∗ is severely misspecified: let π(θ) = uniform[2000, 3000]. (The support of π(θ) does not contain θ∗.) We applied the proposed method to estimate θ∗, with – kΘ, kY being Gaussian, the latter based on the energy distance [Székely and Rizzo, 2013]. In each iteration, 300 (θi, yi)-pairs are simulated.

34 / 44

slide-111
SLIDE 111

Auto-Correction Mechanism: A Nummerical Illustration

Initial sampling from the prior π(θ) = uniform[2000, 3000]: (Recall θ∗ = 0.)

35 / 44

slide-112
SLIDE 112

Auto-Correction Mechanism: A Nummerical Illustration

Initial sampling from the prior π(θ) = uniform[2000, 3000]: (Recall θ∗ = 0.) After 1 recursion of Kernel ABC + Kernel Herding:

35 / 44

slide-113
SLIDE 113

Auto-Correction Mechanism: A Nummerical Illustration

After 2 recursions of Kernel ABC + Kernel Herding:

36 / 44

slide-114
SLIDE 114

Auto-Correction Mechanism: A Nummerical Illustration

After 2 recursions of Kernel ABC + Kernel Herding: After 3 recursions of Kernel ABC + Kernel Herding:

36 / 44

slide-115
SLIDE 115

Outline

Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions

37 / 44

slide-116
SLIDE 116

Summary

The proposed method outperformed competitors in most cases ... Please look at the paper for details!!

38 / 44

slide-117
SLIDE 117

Point Estimation with a Misspecified Prior Distribution

The task: Estimate the mean vector of a 20-dim Gaussian distribution using a severely misspecified prior.

39 / 44

slide-118
SLIDE 118

Point Estimation with a Misspecified Prior Distribution

The task: Estimate the mean vector of a 20-dim Gaussian distribution using a severely misspecified prior. – The true mean vector: µ ∈ [0, 2000]20 ⊂ R20.

39 / 44

slide-119
SLIDE 119

Point Estimation with a Misspecified Prior Distribution

The task: Estimate the mean vector of a 20-dim Gaussian distribution using a severely misspecified prior. – The true mean vector: µ ∈ [0, 2000]20 ⊂ R20. – The prior: the uniform distribution on [9 × 106, 107]20 ⊂ R20.

39 / 44

slide-120
SLIDE 120

Point Estimation with a Misspecified Prior Distribution

The task: Estimate the mean vector of a 20-dim Gaussian distribution using a severely misspecified prior. – The true mean vector: µ ∈ [0, 2000]20 ⊂ R20. – The prior: the uniform distribution on [9 × 106, 107]20 ⊂ R20.

Algorithm parameter error data error cputime KR-ABC 0.70(0.29) 0.008(0.004) 866.02(26.12) KR-ABC (less samples) 7.22(3.28) 0.02(0.24) 353.498(23.05) K2-ABC >1e+6 (>1e+3) >1e+5 (>1e+3) 209.51(11.49) K-ABC >1e+6 (>1e+3) >1e+5 (>1e+3)) 403.93(24.97) SMC-ABC (mean) >1e+6 (>1e+3) >1e+5 (>1e+3) 590.41(29.54) SMC-ABC (MAP) >1e+6 (>1e+3) >1e+5 (>1e+3) 590.41(29.54) ABC-DC >1e+6 (>1e+3) >1e+5 (>1e+3) 313.99(16.85) BO >1e+5(>1e+4) >1e+5 (>1e+4) 25940.86(936.40) MSM >1e+5(>1e+4) >1e+5(>1e+4) 307.42(67.94)

39 / 44

slide-121
SLIDE 121

Parameter Estimation of a Pedestrian Flow Simulator [Yamashita et al., 2010]

Estimate certain parameters characterising groups of pedestrians.

Figure 4: Points representing individual pedestrians. (red = slow)

40 / 44

slide-122
SLIDE 122

Parameter Estimation of a Pedestrian Flow Simulator

Algorithm θ(N) error θ(T) error data error cputime KR-ABC 61.58(74.42) 70.93(102.08) 0.008(0.009) 2233.45(97.54) KR-ABC (less samples) 82.46(75.05) 134.00(161.85) 0.014(0.014) 1875.32(147.16) K2-ABC 298.94(120.71) 308.95(109.43) 0.10(0.10) 1547.32(56.31) K-ABC 354.72(145.76) 389.52(140.91) 0.12(0.09) 1773.74(84.91) SMC-ABC (mean) 271.51(104.64) 363.12(91.28) 0.09(0.07) 2017.89(110.02) SMC-ABC (MAP) 255.15(139.33) 348.43(104.74) 0.09(0.1) 2017.89(110.02) ABC-DC 273.93(136.14) 327.48(98.12) 0.09(0.14) 1984.43(59.12) BO 194.57(65.83) 291.73(105.33) 0.04(0.06) 37541.23(3047.46) MSM 453.58(89.43) 510.04(55.10) 0.24(0.17) 1869.83(49.51)

41 / 44

slide-123
SLIDE 123

Outline

Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions

42 / 44

slide-124
SLIDE 124

Conclusions

We proposed the Kernel Recursive ABC, a method for point estimation of simulator-based statistical models that is robust to misspecification of a prior distribution.

43 / 44

slide-125
SLIDE 125

Conclusions

We proposed the Kernel Recursive ABC, a method for point estimation of simulator-based statistical models that is robust to misspecification of a prior distribution. Extension to Model Selection: Model Selection for Simulator-based Statistical Models: A Kernel Approach (ArXiv, 2019)

  • T. Kajihara and M. Kanagawa and Y. Nakaguchi and K. Khandelwal

and K. Fukumizu.

43 / 44

slide-126
SLIDE 126

Conclusions

We proposed the Kernel Recursive ABC, a method for point estimation of simulator-based statistical models that is robust to misspecification of a prior distribution. Extension to Model Selection: Model Selection for Simulator-based Statistical Models: A Kernel Approach (ArXiv, 2019)

  • T. Kajihara and M. Kanagawa and Y. Nakaguchi and K. Khandelwal

and K. Fukumizu. – Perform model selection by mixture modelling, using the Kernel Recursive ABC.

43 / 44

slide-127
SLIDE 127

Conclusions

We proposed the Kernel Recursive ABC, a method for point estimation of simulator-based statistical models that is robust to misspecification of a prior distribution. Extension to Model Selection: Model Selection for Simulator-based Statistical Models: A Kernel Approach (ArXiv, 2019)

  • T. Kajihara and M. Kanagawa and Y. Nakaguchi and K. Khandelwal

and K. Fukumizu. – Perform model selection by mixture modelling, using the Kernel Recursive ABC. Future Work: – Statistical convergence analysis. – Scalability to large scale problems.

43 / 44

slide-128
SLIDE 128

Collaborators

◮ Takafumi Kajihara (NEC/AIST/RIKEN) ◮ Keisuke Yamazaki (AIST) ◮ Kenji Fukumizu (ISM) ◮ Yuuki Nakaguchi (NEC) ◮ Kanishk Khandelwal (NEC)

44 / 44

slide-129
SLIDE 129

Chen, Y., Welling, M., and Smola, A. (2010). Supersamples from kernel-herding. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), pages 109–116. Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, pages 489–496. Gutmann, M. U. and Corander, J. (2016). Bayesian optimization for likelihood-free inference of simulator-based statistical models. Journal of Machine Learning Research, 17(125):1–47. Lele, S. R., Nadeem, K., , and Schmuland, B. (2010). Estimability and likelihood inference for generalized linear mixed models using data cloning. Journal of the American Statistical Association, 105(492):1617–1625.

44 / 44

slide-130
SLIDE 130

Muandet, K., Fukumizu, K., Sriperumbudur, B. K., and Schölkopf, B. (2017). Kernel mean embedding of distributions : A review and beyond. Foundations and Trends in Machine Learning, 10(1–2):1–141. Nakagome, S., Fukumizu, K., and Mano, S. (2013). Kernel approximate Bayesian computation in population genetic inferences. Statistical Applications in Genetics and Molecular Biology, 12(6):667–678. Saito, T. (2019). Tsunami Generation and Propagation. Springer. Sisson, S. A., Fan, Y., and Beaumont, M. (2018). Handbook of Approximate Bayesian Computation. Chapman and Hall/CRC. Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions.

44 / 44

slide-131
SLIDE 131

In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754, pages 13–31. Springer. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. Jounal of Machine Learning Research, 11:1517–1561. Székely, G. J. and Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143:1249–1272. Yamashita, T., Soeda, S., and Noda, I. (2010). Assistance of evacuation planning with high-speed network model-based pedestrian simulator. In Proceedings of Fifth International Conference on Pedestrian and Evacuation Dynamics (PED 2010), page 58. PED 2010.

44 / 44