kernel recursive abc point estimation with intractable
play

Kernel Recursive ABC: Point Estimation with Intractable Likelihood - PowerPoint PPT Presentation

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM, Sophia Antipolis, France (Previously U. Tbingen) ISM-UUlm Workshop, October 2019 1 / 44 Contents of This Talk 1. Kernel Recursive ABC: Point


  1. Contributions We propose a kernel-based method for point estimation of simulation-based statistical models . The proposed approach (termed kernel recursive ABC ) ◮ is based on kernel mean embeddings, ◮ is a combination of kernel ABC and kernel herding, and ◮ recursively applies Bayes’ rule to the same observed data. It should be useful when point estimation is more desirable than the fully Bayesian approach. For instance: ◮ when your prior distribution π ( θ ) is not fully reliable, ◮ when one simulation is computationally very expensive, and ◮ when your purpose is on predictions based on simulations. 12 / 44

  2. Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 13 / 44

  3. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . 14 / 44

  4. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . 14 / 44

  5. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . Examples of positive definite kernels on X = R d : k ( x , x ′ ) exp ( −� x − x ′ � 2 /γ 2 ) . Gaussian = k ( x , x ′ ) exp ( −� x − x ′ � /γ ) . Laplace (Matérn) = � x , x ′ � k ( x , x ′ ) Linear = . � x , x ′ � + c ) m . k ( x , x ′ ) Polynomial = ( 14 / 44

  6. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) Let k : X × X → R be a symmetric function on a set X . The function k ( x , x ′ ) is called a positive definite kernel , if n n � � c i c j k ( x i , x j ) ≥ 0 holds i = 1 j = 1 for all n ∈ N , c 1 , . . . , c n ∈ R , x 1 , . . . , x n ∈ X . Examples of positive definite kernels on X = R d : k ( x , x ′ ) exp ( −� x − x ′ � 2 /γ 2 ) . Gaussian = k ( x , x ′ ) exp ( −� x − x ′ � /γ ) . Laplace (Matérn) = � x , x ′ � k ( x , x ′ ) Linear = . � x , x ′ � + c ) m . k ( x , x ′ ) Polynomial = ( In this talk, I will simply call k a kernel . 14 / 44

  7. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that 15 / 44

  8. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X 15 / 44

  9. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . 15 / 44

  10. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . 15 / 44

  11. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . – H is called the RKHS of k . 15 / 44

  12. Kernels and Reproducing Kernel Hilbert Spaces (RKHS) For any kernel k , there is a uniquely associated Hilbert space H consisting of functions on X such that (i) k ( · , x ) ∈ H for all x ∈ X where k ( · , x ) is the function of the first argument with x fixed: x ′ ∈ X → k ( x ′ , x ) . (ii) f ( x ) = � f , k ( · , x ) � H for all f ∈ H and x ∈ X , which is called the reproducing property . – H is called the RKHS of k . – H can be written as H = span { k ( · , x ) | x ∈ X} 15 / 44

  13. Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . 16 / 44

  14. Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . 16 / 44

  15. Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. 16 / 44

  16. Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . 16 / 44

  17. Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic , if for any P , Q ∈ P , µ P = µ Q if and only if P = Q . 16 / 44

  18. Kernel Mean Embeddings [Smola et al., 2007] A framework for representing distributions in an RKHS . – Let P be the set of all probability distributions on X . – Let k be a kernel on X , and H be its RKHS. For each distribution P ∈ P , define the kernel mean : � µ P := k ( · , x ) dP ( x ) ∈ H . which is a representation of P in H . A key concept: Characteristic kernels [Fukumizu et al., 2008]. – The kernel k is called characteristic , if for any P , Q ∈ P , µ P = µ Q if and only if P = Q . – In other words, k is characteristic if the mapping P ∈ P → µ P ∈ H is injective . 16 / 44

  19. Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] 17 / 44

  20. Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] Examples of characteristic kernels on X = R d : – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. 17 / 44

  21. Kernel Mean Embeddings [Smola et al., 2007] Intuitively, k being characteristic implies that H is large enough. Figure 2: Injective embedding [Muandet et al., 2017, Figure 2.3] Examples of characteristic kernels on X = R d : – Gaussian and Matérn kernels [Sriperumbudur et al., 2010]. Examples of non -characteristic kernels on X = R d : – Linear and polynomial kernels. 17 / 44

  22. Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 18 / 44

  23. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior 19 / 44

  24. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ 19 / 44

  25. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ 19 / 44

  26. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = 19 / 44

  27. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ 19 / 44

  28. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = 19 / 44

  29. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · 19 / 44

  30. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · π N ( θ ) := p N ( θ | y ∗ ) p ( y ∗ | θ ) π N − 1 ( θ ) N -th recursion ∝ 19 / 44

  31. Recursive Bayes Updates and Power Posteriors Given observed data y ∗ , Bayes’ rule yields a posterior distribution: p ( θ | y ∗ ) ∝ p ( y ∗ | θ ) π ( θ ) . � �� � � �� � ���� Posterior Likelihood Prior Recursive Bayes updates: Apply Bayes’ rule recursively to the same observed data y ∗ . π 1 ( θ ) := p 1 ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . 1st recursion ∝ π 2 ( θ ) := p 2 ( θ | y ∗ ) p ( y ∗ | θ ) π 1 ( θ ) 2nd recursion ∝ p ( y ∗ | θ ) 2 π ( θ ) . = π 3 ( θ ) := p 3 ( θ | y ∗ ) p ( y ∗ | θ ) π 2 ( θ ) 3rd recursion ∝ p ( y ∗ | θ ) 3 π ( θ ) . = · · · π N ( θ ) := p N ( θ | y ∗ ) p ( y ∗ | θ ) π N − 1 ( θ ) N -th recursion ∝ p ( y ∗ | θ ) N π ( θ ) . = 19 / 44

  32. Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) 20 / 44

  33. Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . 20 / 44

  34. Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . Then, if π ( θ ∗ ) > 0, under mild conditions on π ( θ ) and p ( y | θ ) , p N ( θ | y ∗ ) → δ θ ∗ as N → ∞ ( weak convergence ) . ���� Dirac at θ ∗ 20 / 44

  35. Power Posteriors and Maximum Likelihood Estimation N recursive Bayes updates yield the power posterior p N ( θ | y ∗ ) ∝ p ( y ∗ | θ ) N π ( θ ) Theorem [Lele et al., 2010]. Assume that p ( y ∗ | θ ) has a unique global maximizer: θ ∗ := arg max θ ∈ Θ p ( y ∗ | θ ) . Then, if π ( θ ∗ ) > 0, under mild conditions on π ( θ ) and p ( y | θ ) , p N ( θ | y ∗ ) → δ θ ∗ as N → ∞ ( weak convergence ) . ���� Dirac at θ ∗ This implies that recursive Bayes updates provide a way of Maximum Likelihood Estimation . 20 / 44

  36. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. 21 / 44

  37. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 21 / 44

  38. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. 21 / 44

  39. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . 21 / 44

  40. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� � Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 21 / 44

  41. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� � Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 2. Kernel Herding : Sampling θ ′ 1 , . . . , θ ′ n from the estimate of (1): 21 / 44

  42. Proposed Method: Kernel Recursive ABC (Sketch) – Recursive Applications of 1.Bayes’ rule and 2.sampling. For N = 1 , 2 , . . . , N iter , iterate the following procedures: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Simulate pseudo-data for each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) . – Estimate the kernel mean of the power posterior using ( θ i , y i ) n i = 1 : � p N ( θ | y ∗ ) d θ µ P N := k Θ ( · , θ ) (1) � �� � Kernel on Θ p N ( θ | y ∗ ) ∝ p N ( y | θ ) π ( θ ) where 2. Kernel Herding : Sampling θ ′ 1 , . . . , θ ′ n from the estimate of (1): Set: N ← N + 1 and ( θ 1 , . . . , θ n ) ← ( θ ′ 1 , . . . , θ ′ n ) 21 / 44

  43. Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 22 / 44

  44. Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 22 / 44

  45. Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 2. Weight computation : Given observed data y ∗ , compute ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ∈ R n . k Y ( y ∗ ) := ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ∈ R n , := where G Y := ( k Y ( y i , y j )) ∈ R n × n is the kernel matrix. 22 / 44

  46. Kernel ABC [Nakagome et al., 2013] – Define ◮ a kernel k Y ( y , y ′ ) on the data space Y , ◮ a kernel k Θ ( θ, θ ′ ) on the parameter space Θ , and ◮ a regularisation constant λ > 0. 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. 2. Weight computation : Given observed data y ∗ , compute ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ∈ R n . k Y ( y ∗ ) := ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ∈ R n , := where G Y := ( k Y ( y i , y j )) ∈ R n × n is the kernel matrix. Output : An estimate of the posterior kernel mean: � n � k Θ ( · , θ ) p ( θ | y ∗ ) d θ w i ( y ∗ ) k Θ ( · , θ i ) , ≈ i = 1 p ( θ | y ∗ ) p ( y ∗ | θ ) π ( θ ) . ∝ 22 / 44

  47. Kernel ABC: The Sampling Step 1. Sampling : Generate parameter-data pairs from the model: ( θ 1 , y 1 ) , . . . , ( θ n , y n ) ∼ p ( y | θ ) π ( θ ) , i.i.d. Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Sampling Parameter space Θ θ 1 θ 2 θ 3 θ * θ 4 Sampling π ( θ ) Prior distribution 23 / 44

  48. Kernel ABC: The Weight Computation Step 2. Weight computation : Given observed data y ∗ , compute k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , 1. Similarities: ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . 2. Weights: Data space 1. Similarity computation y * 풴 y 3 y 4 y 1 y 2 Parameter 2. Weight space computation Θ θ 3 θ 1 θ 2 θ * θ 4 n � � k Θ ( · , θ ) p ( θ | y ∗ ) d θ ≈ w i ( y ∗ ) k Θ ( · , θ i ) . i = 1 24 / 44

  49. Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. 25 / 44

  50. Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that 25 / 44

  51. Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 25 / 44

  52. Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 T − 1 − 1 � θ ′ k Θ ( θ, θ ′ := arg max µ P ( θ ) ℓ ) ( T = 2 , . . . , n ) . T T θ ∈ Θ � �� � � �� � ℓ = 1 mode seeking repulsive force 25 / 44

  53. Kernel Herding [Chen et al., 2010] Let – P be a known probability distribution on Θ ; and � – µ P = k Θ ( · , θ ) dP ( θ ) be its kernel mean. Kernel herding is a deterministic sampling method that – sequentially generates sample points θ ′ 1 , . . . , θ ′ n from P as θ ′ := arg max θ ∈ Θ µ P ( θ ) , 1 T − 1 − 1 � θ ′ k Θ ( θ, θ ′ := arg max µ P ( θ ) ℓ ) ( T = 2 , . . . , n ) . T T θ ∈ Θ � �� � � �� � ℓ = 1 mode seeking repulsive force – is equivalent to greedily approximating the kernel mean µ P : � � �� T − 1 � � � µ P − 1 � θ ′ � k Θ ( · , θ ′ � T = arg min k Θ ( · , θ ) + i ) , � � T θ ∈ Θ � i = 1 H Θ if k Θ is shift-invariant. ( H Θ is the RKHS of k Θ .) 25 / 44

  54. Kernel Herding [Chen et al., 2010] Red squares: Sample points generated from kernel herding Purple circles: Randomly generated i.i.d. sample points. 4 3 2 1 0 − 1 − 2 − 3 − 4 − 5 − 6 − 6 − 4 − 2 0 2 4 6 - Figure 3: [Chen et al., 2010, Fig 1] 26 / 44

  55. Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 27 / 44

  56. Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. 27 / 44

  57. Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , 27 / 44

  58. Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . 27 / 44

  59. Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . µ P N := � n i = 1 w i ( y ∗ ) k Θ ( · , θ i ) : 2. Kernel Herding : Sampling from ˆ T − 1 µ P N ( θ ) − 1 � θ ′ k ( θ, θ ′ T := arg max θ ∈ Θ ˆ ℓ ) ( T = 1 , . . . , n ) . T ℓ = 1 27 / 44

  60. Proposed Method: Kernel Recursive ABC (Algorithm) For N = 1 , 2 , . . . , N iter , iterate the following procedure: 1. Kernel ABC : If N = 1: generate θ 1 , . . . , θ n ∼ π ( θ ) , i.i.d. – Generate pseudo-data from each θ i : y i ∼ p ( y | θ i ) ( i = 1 , . . . , n ) , – Compute weights for θ 1 , . . . , θ n : k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ , ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) . µ P N := � n i = 1 w i ( y ∗ ) k Θ ( · , θ i ) : 2. Kernel Herding : Sampling from ˆ T − 1 µ P N ( θ ) − 1 � θ ′ k ( θ, θ ′ T := arg max θ ∈ Θ ˆ ℓ ) ( T = 1 , . . . , n ) . T ℓ = 1 Set: N ← N + 1 and ( θ 1 , . . . , θ n ) ← ( θ ′ 1 , . . . , θ ′ n ) 27 / 44

  61. Why Kernels? – The combination of Kernel ABC and Kernel Herding leads to robustness against misspecfication of the prior π ( θ ) . 28 / 44

  62. Outline Background: Machine Learning for Computer Simulation Preliminaries on Kernel Mean Embeddings Proposed Approach: Kernel Recursive ABC Prior Misspecification and the Auto-Correction Mechanism Empirical Comparisons with Competing Methods Conclusions 29 / 44

  63. Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . 30 / 44

  64. Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . 30 / 44

  65. Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . ◮ e.g., the support of π ( θ ) may not contain θ ∗ . 30 / 44

  66. Prior Misspecification Assume that ◮ there is a “true” parameter θ ∗ such that y ∗ ∼ p ( y | θ ∗ ) ◮ but you don’t know much about θ ∗ . In such a case, you may misspecify the prior π ( θ ) . ◮ e.g., the support of π ( θ ) may not contain θ ∗ . As a result, simulated data y i ∼ p ( y | θ i ) , θ i ∼ π ( θ ) ( i = 1 , . . . , n ) . may become far apart from observed data y ∗ . 30 / 44

  67. Prior Misspecification Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar Parameter space Θ θ * θ 3 θ 1 θ 4 θ 2 True parameter Misspecification Prior support π ( θ ) Prior distribution 31 / 44

  68. Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . 32 / 44

  69. Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) 32 / 44

  70. Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) – Therefore, if y ∗ and each y i are dissimilar , we have k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ≈ 0 32 / 44

  71. Auto-Correction Mechanism: The Kernel ABC Step Data space Observed data y * 풴 y 3 y 1 y 4 y 2 Dissimilar – Recall k Y ( y ∗ , y i ) quantifies the similarity between y ∗ and y i . — e.g. a Gaussian kernel: k Y ( y ∗ , y i ) = exp ( − dist 2 ( y ∗ , y i ) /γ 2 ) – Therefore, if y ∗ and each y i are dissimilar , we have k Y ( y ∗ ) = ( k Y ( y ∗ , y 1 ) , . . . , k Y ( y ∗ , y n )) ⊤ ≈ 0 – As a result, the weights by Kernel ABC become ( w 1 ( y ∗ ) , . . . , w n ( y ∗ )) ⊤ = ( G Y + n λ I n ) − 1 k Y ( y ∗ ) ≈ 0 32 / 44

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend