Likelihood-free gravitational-wave parameter estimation with neural - - PowerPoint PPT Presentation

likelihood free gravitational wave parameter estimation
SMART_READER_LITE
LIVE PREVIEW

Likelihood-free gravitational-wave parameter estimation with neural - - PowerPoint PPT Presentation

Likelihood-free gravitational-wave parameter estimation with neural networks Stephen R. Green Albert Einstein Institute Potsdam based on arXiv:2002.07656 with C. Simpson and J. Gair Gravity Seminar University of Southampton February 27, 2020 1


slide-1
SLIDE 1

Likelihood-free gravitational-wave parameter estimation with neural networks

Stephen R. Green Albert Einstein Institute Potsdam based on arXiv:2002.07656 with C. Simpson and J. Gair Gravity Seminar University of Southampton February 27, 2020

1

slide-2
SLIDE 2

Outline

  • 1. Introduction to Bayesian inference for compact binaries
  • 2. Likelihood-free inference with neural networks

(a) Basic approach (b) Normalizing flows (c) Variational autoencoders

  • 3. Results

2

slide-3
SLIDE 3

Introduction to parameter estimation

  • Bayesian inference for compact binaries:



 Sample posterior distribution for system parameters (masses, spins, sky position, etc.) given detector strain data .
 
 
 
 
 
 
 


  • Once likelihood and prior are defined, right hand side can be evaluated (up to

normalization).

θ s

3

p(θ|s) = p(s|θ)p(θ) p(s)

likelihood prior evidence (normalizing factor)

slide-4
SLIDE 4

Introduction to parameter estimation

  • Likelihood based on assumption that if the gravitational-wave signal were subtracted from

, then what remains must be noise.

  • Noise assumed to follow stationary Gaussian distribution, i.e.,



 
 
 where the noise-weighted inner product is
 
 
 


  • Summed over detectors, this gives the likelihood,



 
 


s n

4

p(s|θ) ∝ exp (− 1 2 ∑

I

(sI − hI(θ)|sI − hI(θ)))

(a|b) = 2∫

df ̂ a( f ) ̂ b( f )* + ̂ a( f )* ̂ b( f ) Sn( f ) n ∼ p(n) ∝ exp (− 1 2 (n|n))

detector noise power spectral density (PSD)

slide-5
SLIDE 5

Introduction to parameter estimation

  • Prior

based on beliefs about system before looking at data,
 
 e.g., uniform in

  • ver some range,


uniform in spatial volume,
 etc.

  • With prior and likelihood defined, the

posterior can be evaluated up to normalization.

  • Method such as Markov chain Monte Carlo

(MCMC) is used to obtain posterior samples.
 
 Move around parameter space, and compare strain data against waveform model .

p(θ) m1, m2 s h(θ)

5

Image: Abbott et al (2016)

slide-6
SLIDE 6

Need for new methods

  • Standard method expensive:
  • Many likelihood evaluations required for each independent sample
  • Likelihood evaluation slow, requires a waveform to be generated
  • Various waveform models (EOBNR, Phenom, …) created as faster

alternatives to numerical relativity; reduced-order surrogate models for even faster evaluation.

  • Days to months for parameter estimation of a single event, depending on type of

event and waveform model.

6

Goal of this work: 
 
 Develop deep learning methods to do parameter estimation much faster. Model the posterior distribution with a neural network.

p(θ|s)

slide-7
SLIDE 7

Main result: very fast posterior sampling

Rest of this talk:
 How did we do this?

7

3 5 4 4 5 5 5 5

m2/M

. 1 . 5 3 . 4 . 5 6 .

φ0

. 6 8 2 . 6 8 3 5 . 6 8 5 . 6 8 6 5

tc/s

1 2 1 6 2 2 4

dL/Mpc

− . 3 . . 3 . 6 . 9

χ1z

− 1 . − . 5 . . 5 1 .

χ2z

5 4 6 6 6 7 2 7 8

m1/M

. . 8 1 . 6 2 . 4 3 . 2

θJN

3 5 4 4 5 5 5 5

m2/M

. 1 . 5 3 . 4 . 5 6 .

φ0

. 6 8 2 . 6 8 3 5 . 6 8 5 . 6 8 6 5

tc/s

1 2 1 6 2 2 4

dL/Mpc

− . 3 . . 3 . 6 . 9

χ1z

− 1 . − . 5 . . 5 1 .

χ2z

. . 8 1 . 6 2 . 4 3 . 2

θJN

slide-8
SLIDE 8

Two key ideas

  • 1. A conditional probability distribution can be described by a neural

network.


  • 2. The network can be trained to model a gravitational wave posterior

distribution without ever evaluating a likelihood. Instead, it only requires samples from the data generating process.

(θ, s)

8

slide-9
SLIDE 9

Introduction to neural networks

  • Nonlinear functions constructed as composition of mappings:

x ∈ ℝN

x

σ1(W1x + b1) h1

Input layer First hidden layer

Consists of:

  • 1. Linear transformation


  • 2. Simple element-wise

nonlinear mapping.
 
 E.g.,

W1x + b1

h1 ∈ ℝN1

σ1(x) = ⇢ x, x ≥ 0 0, x < 0

<latexit sha1_base64="lLl4U0dj1wGiFMGs9y0nSvg/7rM=">ACOnicbVDBThsxEPVCgRAKBDhysRpRgYQiL0KCA0ioXDgSqQGkOIq8zuzGwutd2bMo0Srf1Uu/ghuHXjiAql7AXWSPbTQJ1l6fm9m7HlRrpVDxp6ChcUPS8srtdX62sf1jc3G1vaNyworoSMzndm7SDjQykAHFWq4y2INJwG91fTv3bB7BOZeYrjnPopSIxKlZSoJf6jTZ3KklFP9wfHdBzyjXEyEvKI0iUKYW1YjwprZ7Q0SH9TEc8AcZ5nc0uZ4xyMIOqinKrkiG2+o0ma7EZ6HsSVqRJKlz3G498kMkiBYNSC+e6Icux56eikhomdV4yIW8Fwl0PTUiBdcrZ6tP6J5XBjTOrD8G6Uz9u6MUqXPjNPKVqcChe+tNxf953QLj016pTF4gGDl/KC40xYxOc6QDZUGiHnsipFX+r1QOhRUSfdp1H0L4duX35OaoFbJW2D5uXnyp4qiRXfKJ7JOQnJALckWuSYdI8o38IC/kNfgePAc/g1/z0oWg6tkh/yD4/QdYeatY</latexit><latexit sha1_base64="lLl4U0dj1wGiFMGs9y0nSvg/7rM=">ACOnicbVDBThsxEPVCgRAKBDhysRpRgYQiL0KCA0ioXDgSqQGkOIq8zuzGwutd2bMo0Srf1Uu/ghuHXjiAql7AXWSPbTQJ1l6fm9m7HlRrpVDxp6ChcUPS8srtdX62sf1jc3G1vaNyworoSMzndm7SDjQykAHFWq4y2INJwG91fTv3bB7BOZeYrjnPopSIxKlZSoJf6jTZ3KklFP9wfHdBzyjXEyEvKI0iUKYW1YjwprZ7Q0SH9TEc8AcZ5nc0uZ4xyMIOqinKrkiG2+o0ma7EZ6HsSVqRJKlz3G498kMkiBYNSC+e6Icux56eikhomdV4yIW8Fwl0PTUiBdcrZ6tP6J5XBjTOrD8G6Uz9u6MUqXPjNPKVqcChe+tNxf953QLj016pTF4gGDl/KC40xYxOc6QDZUGiHnsipFX+r1QOhRUSfdp1H0L4duX35OaoFbJW2D5uXnyp4qiRXfKJ7JOQnJALckWuSYdI8o38IC/kNfgePAc/g1/z0oWg6tkh/yD4/QdYeatY</latexit><latexit sha1_base64="lLl4U0dj1wGiFMGs9y0nSvg/7rM=">ACOnicbVDBThsxEPVCgRAKBDhysRpRgYQiL0KCA0ioXDgSqQGkOIq8zuzGwutd2bMo0Srf1Uu/ghuHXjiAql7AXWSPbTQJ1l6fm9m7HlRrpVDxp6ChcUPS8srtdX62sf1jc3G1vaNyworoSMzndm7SDjQykAHFWq4y2INJwG91fTv3bB7BOZeYrjnPopSIxKlZSoJf6jTZ3KklFP9wfHdBzyjXEyEvKI0iUKYW1YjwprZ7Q0SH9TEc8AcZ5nc0uZ4xyMIOqinKrkiG2+o0ma7EZ6HsSVqRJKlz3G498kMkiBYNSC+e6Icux56eikhomdV4yIW8Fwl0PTUiBdcrZ6tP6J5XBjTOrD8G6Uz9u6MUqXPjNPKVqcChe+tNxf953QLj016pTF4gGDl/KC40xYxOc6QDZUGiHnsipFX+r1QOhRUSfdp1H0L4duX35OaoFbJW2D5uXnyp4qiRXfKJ7JOQnJALckWuSYdI8o38IC/kNfgePAc/g1/z0oWg6tkh/yD4/QdYeatY</latexit><latexit sha1_base64="lLl4U0dj1wGiFMGs9y0nSvg/7rM=">ACOnicbVDBThsxEPVCgRAKBDhysRpRgYQiL0KCA0ioXDgSqQGkOIq8zuzGwutd2bMo0Srf1Uu/ghuHXjiAql7AXWSPbTQJ1l6fm9m7HlRrpVDxp6ChcUPS8srtdX62sf1jc3G1vaNyworoSMzndm7SDjQykAHFWq4y2INJwG91fTv3bB7BOZeYrjnPopSIxKlZSoJf6jTZ3KklFP9wfHdBzyjXEyEvKI0iUKYW1YjwprZ7Q0SH9TEc8AcZ5nc0uZ4xyMIOqinKrkiG2+o0ma7EZ6HsSVqRJKlz3G498kMkiBYNSC+e6Icux56eikhomdV4yIW8Fwl0PTUiBdcrZ6tP6J5XBjTOrD8G6Uz9u6MUqXPjNPKVqcChe+tNxf953QLj016pTF4gGDl/KC40xYxOc6QDZUGiHnsipFX+r1QOhRUSfdp1H0L4duX35OaoFbJW2D5uXnyp4qiRXfKJ7JOQnJALckWuSYdI8o38IC/kNfgePAc/g1/z0oWg6tkh/yD4/QdYeatY</latexit>

9

slide-10
SLIDE 10

Introduction to neural networks

  • Training/test data consist of (x, y) pairs.
  • Train network by tuning the weights W and biases b to minimize loss function
  • Stochastic gradient descent combined with chain rule (“backpropagation”) to adjust

weights and biases.

L(y, yout)

x

σ1(W1x + b1) h1

Input layer First hidden layer

σ2(W2h1 + b2) h2

Second hidden layer

hp

Final hidden layer

x ∈ ℝN

y

Output layer

σout(Wouthp + bout)

y ∈ ℝNout

10

slide-11
SLIDE 11

Neural networks as probability distributions

  • Since conditional probability distributions can be parametrized by functions, and

neural networks are functions, conditional probability distributions can be described by neural networks.
 
 E.g., multivariate normal distribution
 
 
 
 
 
 
 where .

  • For this example, it is trivial to draw samples and evaluate the density.
  • More complex distributions may also be described by neural networks (later in talk).

μ(y), Σ(y) = NN(y)

11

p(x|y) = N(µ(y), Σ(y))(x) = 1 p (2π)n| det Σ(y)| exp @−1 2

n

X

ij=1

(xi − µi(y))Σ−1

ij (y)(xj − µj(y))

1 A

<latexit sha1_base64="E86g1g0Oj0vjL8P28h4md/eBhr4=">ACpXicbVFdb9MwFHXC1yhfBR5sVYxEolWSTUEL5UmeNkLaIO1m1S3keM6rTvbCbaDGrn5Z/wK3vg3OF3Ex8aVLB2de6PfW5acKZNFP30/Fu379y9t3e/8+Dho8dPuk+fTXReKkLHJOe5ukixpxJOjbMcHpRKIpFyul5evmh6Z9/o0qzXJ6ZqAzgZeSZYxg46ik+70INtsqhAcjiAQ2K4K5/VQHSJRBFb6G6AtbCuxgGxCiFDnYIQyhYmNa4v0V2VsMEQFC+dyixbU/NFv6xrRTYE4zUwA+7+HhjXSpUgsW4/iei6DTcJg31mXCWtc2gvmth/Xjah2JHSidSta70QKbZcmTDp9qJBtCt4E8Qt6IG2TpLuD7TISmoNIRjradxVJiZxcowmndQaWmBSaXeEmnDkosqJ7ZXco1fOmYBcxy5Y40cMf+PWGx0LoSqVM2ServYb8X29amuzdzDJZlIZKcmWUlRyaHDYrgwumKDG8cgATxdxbIVlhF6hxi+24EOLrX74JsNBfDh4c3rYO3rfxrEHXoB9EIAYvAVH4BicgDEg3r537J16n/1X/kf/zJ9cSX2vnXkO/ik/+QUoycrp</latexit>
slide-12
SLIDE 12

Likelihood-free inference with neural networks

[First applied to GW by Chua and Vallisneri (2020), Gabbard et al (2019)]

  • Goal is to train network to model true posterior, as given by prior and likelihood

that we specify, i.e.,


  • Minimize expectation value (over ) of cross-entropy between the distributions



 
 


  • Bayes’ theorem


 
 
 
 


s ⟹ ptrue(s) ptrue(θ|s) = ptrue(θ) ptrue(s|θ)

12

p(θ|s) → ptrue(θ|s)

L = − ∫ ds ptrue(s)∫ dθ ptrue(θ|s) log p(θ|s)

∴ L = − ∫ dθ ptrue(θ)∫ ds ptrue(s|θ) log p(θ|s)

Intractable with knowing posterior for each !

s

Only requires samples from likelihood, not the posterior!

slide-13
SLIDE 13

Likelihood-free inference with neural networks

  • Loss function



 
 
 
 
 
 
 
 
 


  • Choose network parameters that minimize : compute gradient of with respect to

network parameters (weights and biases) and use stochastic gradient descent.

  • Never evaluate a likelihood and no need for posterior samples!

L L

13

L = − Z dθ ptrue(θ) Z ds ptrue(s|θ) log p(θ|s) ≈ − 1 N

N

X

i=1

log p(θ(i)|s(i)), where θ(i) ∼ ptrue(θ), s(i) ∼ ptrue(s|θ(i))

<latexit sha1_base64="ULeu080VOH0W64ZHdjcABXMd2h8=">ADHicbVJNbxMxEPUuXyV8NIUjlxERVSKVaLcqgkulCi4cUFUk0laK05X602s7odre6GRsz+EC3+FCwcQ4sIBiX+Dd7NBTZqRLD29eW8M3YoEq605/13Bs3b92+s3G3de/+g4eb7a1HxyovJGUDmie5PA2JYgnP2EBznbBTIRlJw4SdhOdvqvzJRyYVz7MPeirYKCXjMecEm2pYMvZfQfb+/AcM804JToiUxNVGI9YZrgHRCBWbBaFqwsu/NUb9WhYK1azf7rdwAn+RhEU2GmLIdb25gIfPLqodYEmr80hyWgFWRBobv+XZ4bLvzHR5r4QZqDnqgS18cVGQCLBml9p8mjDJwJa4IseKp3V3tWJ5Emtf1IK1usUMzX1Bu+P1vTrgOvAb0EFNHAXt3zjKaZGyTNOEKDX0PaFHhkjNacLKFi4UE4SekzEbWpiRlKmRqR+3hGeWiSDOpT124TV71WFIqtQ0Da2y2rxazVXkutyw0PGrkeGZKDTL6PyiuEhA51D9FIi4ZFQnUwsIldz2CnRC7Atp+59adgn+6sjXwfFu39/rv3i/1zl43axjAz1BT1EX+eglOkBv0REaIOp8dr46350f7hf3m/vT/TWXuk7jeYyWwv3zD9O/fI=</latexit>

Estimate on 
 minibatch of size N Easy to evaluate from neural network Sample parameters from prior Sample strain data from generative process (likelihood)

slide-14
SLIDE 14

Gravitational-wave parameter estimation

  • Chua and Vallisneri (2019) applied this method (with a Gaussian posterior

model) to gravitational waves:
 
 
 
 
 
 
 
 
 


  • A Gaussian may be adequate for very high signal-to-noise, but more generally

distributions can have higher moments and multimodality.

14

slide-15
SLIDE 15

Normalizing flows

Rezende and Mohamed (2015)

  • Our approach to make gravitational-wave posterior more flexible: use a normalizing

flow.

  • Change of variables rule for probability distributions: if

is a probability distribution, and is a mapping on the sample space, then in the new coordinates, the distribution is
 
 


  • A normalizing flow is an invertible mapping with simple Jacobian determinant.
  • If

can be easily sampled and its density evaluated, and is a normalizing flow, then the same holds for .
 
 Typically, take to be a simple base distribution, e.g., multivariate standard normal.

π(u) f : u ↦ x f π(u) f p(x) π(u)

15

p(x) = π(f −1(x))

  • det ∂(f −1

1 , . . . , f −1 n )

∂(x1, . . . , xn)

  • <latexit sha1_base64="CG1E47SHTcSIbYAOKukPL0c5kmw=">ACV3icbVHRSiMxFM2Mrta6ulUfQlbhBbcMrMo+iLI7ouPClaFplsymTtMJMJyR2xjP1J2Rd/ZV80rQPu6l4InHvOudzkJDFKOoyipyBcWv60stpYa65/3tj80travnJFaQX0RaEKe5NwB0pq6KNEBTfGAs8TBdfJ7c+5fn0H1slCX+LUwDnYy0zKTh6atTSpnPfpSeUGdnJflXf4pnvu0xBhg+UpYCUZaLihluUXJVm0bxPlNpgW6/7nV39uah92+6x16jzMrxB9GrXbUixZFP4K4Bm1S1/mo9cjSQpQ5aBSKOzeI4PDar5IKJg1WenAcHLxzDwUPMc3LBa5DKje5JaVZYfzTSBfv3RMVz56Z54p05x4l7r83J/2mDErPjYSW1KRG0eF2UlYpiQech01RaEKimHnBhpb8rFRPuc0T/FU0fQvz+yR/B1fdefNA7vDhon/6o42iQXfKVdEhMjsgpOSPnpE8E+U3+BEvBcvAUPIcrYePVGgb1zA75p8KtF1gHsoM=</latexit>
slide-16
SLIDE 16
  • To model a gravitational-wave posterior, take

, and condition the flow on strain data .

x → θ f s

Normalizing flows for gravitational waves

u ∼ 𝒪(0,1)n

16

0.0 0.2 0.4 0.6 0.8 1.0

t/s

−8 −6 −4 −2 2 4 6 8

s=h+n

θ = f(u, s)

θ ∼ p(θ|s) = N(0, 1)n(f −1(θ))| det J−1

f |

<latexit sha1_base64="ux6p1K1jDQaF96mqdnjK+wNPmcQ=">ACOHicbVBSxtBGJ3V2qZprasexkaGhLQsFsCehGCXqSHNoUmCtkzE6+TYbMzi4z3wphk5/lxZ/hTbz0FJ69Rc4SfbQah8MvHnv+5h5L0ylMOh5d87G5outl69Kr8tv3m6/23F397omyTSHDk9koi9DZkAKBR0UKOEy1cDiUMJFOD1b+hdXoI1I1HecpdCP2ViJSHCGVhq6XwOcADJaDYyIaVpbX+emHgTl6gkNYoYTzmT+ZVHzDvz6QNWiQX7oL4rBen0ejADp5+Fang/ditfwVqDPiV+QCinQHrq3wSjhWQwKuWTG9HwvxX7ONAouYVEOMgMp41M2hp6lisVg+vkq+IJ+tMqIRom2RyFdqX9v5Cw2ZhaHdnIZxDz1luL/vF6G0XE/FyrNEBRfPxRlkmJCly3SkdDAUc4sYVwL+1fKJ0wzjrbrsi3Bfxr5Oel+avjNRvNbs9I6LeokfkA6kRnxyRFjknbdIhnFyTe/KT/HJunB/Ob+fPenTDKXb2yT9wHh4BiSWqdA=</latexit>

(hopefully)

slide-17
SLIDE 17

Masked autoregressive flow

Papamakarios et al (2017)

  • By the product rule, an arbitrary probability distribution

may be decomposed as
 


  • Define an autoregressive model by restricting the form of each factor,



 
 
 
 i.e., if , and we set ,
 then .

  • The mapping

defines a normalizing flow.

p(x) u ∼ 𝒪(0,1)n xi = μi(x1:i−1) + ui exp αi(x1:i−1) x ∼ p(x) f : u ↦ x

17

p(x) =

n

Y

i=1

p(xi|x1:i−1)

<latexit sha1_base64="O3eEzMl1uj0B1+gIcT7p5o3HNI=">ACEHicbVDLSgMxFM34rPU16tJNsIjtwjIjFUoFN24rGAf0NYhk6ZtaCYTkoy0jPMJbvwVNy4UcevSnX9j+lho64ELh3Pu5d57fMGo0o7zbS0sLi2vrKbW0usbm1vb9s5uVYWRxKSCQxbKuo8UYZSTiqakbqQBAU+IzW/fzXya/dEKhryWz0UpBWgLqcdipE2kmcfiewgB4uwKWTY9mJadJM7Do3oUfgAB17sXtBjN8l5dsbJO2PAeJOSQZMUfbsr2Y7xFAuMYMKdVwHaFbMZKaYkaSdDNSRCDcR13SMJSjgKhWPH4ogYdGacNOKE1xDcfq74kYBUoNA90Bkj31Kw3Ev/zGpHunLdiykWkCceTRZ2IQR3CUTqwTSXBmg0NQVhScyvEPSQR1ibDtAnBnX15nlRP8m4hf3pTyJQup3GkwD4AFngjNQAtegDCoAg0fwDF7Bm/VkvVjv1sekdcGazuyBP7A+fwBA65rY</latexit>

p(xi|x1:i−1) = N(µi(x1:i−1), exp(2αi(x1:i−1))

<latexit sha1_base64="/VMzkGnYtOQpHg6obNkr6mInw=">ACOXicbVDLSgMxFM3Ud31VXboJFmELTOiKIJQdONKthW6JThTpq2wcxMSDLSMva3PgX7gQ3LhRx6w+YPkBbPRA4nHMvN+cEgjOlHefZykxNz8zOzS9kF5eWV1Zza+sVFSeS0DKJeSxvAlCUs4iWNdOc3ghJIQw4rQa3532/ekelYnF0rbuC1kNoRazJCGgj+bmSsDs+w/e46fuCdtzezv4FHsh6DYBnl72bC9MfGb/2LvYox2B7X3sARdtGDPxjp/LOwVnAPyXuCOSRyOU/NyT14hJEtJIEw5K1VxH6HoKUjPCaS/rJYoKILfQojVDIwipqeD5D28bZQGbsbSvEjgfp7I4VQqW4YmMl+JDXp9cX/vFqim8f1lEUi0TQiw0PNhGMd436NuMEkJZp3DQEimfkrJm2QLQpO2tKcCcj/yWV/YJ7UDi8OsgXz0Z1zKNtIVs5KIjVEQXqITKiKAH9ILe0Lv1aL1aH9bncDRjXY20Bisr2/v/6le</latexit>
slide-18
SLIDE 18

Masked autoregressive flow

Papamakarios et al (2017)

  • satisfies properties of a normalizing flow:

1.


 2.


 3. 
 
 
 


f f : u ↦ x xi = μi(x1:i−1) + ui exp αi(x1:i−1) f −1 : x ↦ u ui = [xi − μi(x1:i−1)] exp (−αi(x1:i−1)) det ∂( f −1

1 , …, f −1 n )

∂(x1, …, xn) = exp (−

n

i=1

αi(x1:i−1))

18

Forward map recursive Inverse map nonrecursive Simple Jacobian determinant

slide-19
SLIDE 19

Masked autoregressive flow

Papamakarios et al (2017)

  • Can be implemented with a neural network by masking certain connections

that violate autoregressive property [MADE network, Germain et al (2015)]
 
 
 
 
 
 
 
 
 
 
 


  • Forward flow requires passes.

n

19

x1 x2 x3

p(x1) p(x2|x1) p(x3|x1, x2)

slide-20
SLIDE 20

Masked autoregressive flow

Papamakarios et al (2017)

  • To achieve further generality, stack several MADE blocks, permuting components in between.

20

permute MADE permute MADE permute MADE

u

θ s f

slide-21
SLIDE 21

Training

[same approach as Gabbard et al (2019)]

  • Train on

pairs:

, samples


;

: 1 second long whitened (fixed PSD) inspiral-merger-ringdown waveforms at1024 Hz, stored in training set

  • : stationary Gaussian

noise sampled at train time

  • Training time ~ 6 hours

(θ, s) θ ∼ p(θ) 106 s ∼ p(s|θ) s = h(θ) + n h(θ) n

21

35 M ≤ m1, m2 ≤ 80 M, 1000 Mpc ≤ dL ≤ 3000 Mpc, 0.65 s ≤ tc ≤ 0.85 s, 0 ≤ φ0 ≤ 2π,

<latexit sha1_base64="YCOGBfnPkQBWtruvpCDwaYEwnx0=">AC2XicbVLbhMxFPUMrxIeDbBkYzUCsYhGnjSFLCvYsACplZq0UhyNPB6nsWqPXduDFA1B6qIseXP2PEXfEI90wGmCVeydXzvOcfXj1QLbh1Cv4Lw1u07d+9t3e8ePjo8Xb3ydOJVYWhbEyVUOYkJZYJnrOx406wE20Ykalgx+nZu6p+/IkZy1V+5JazSQ5zfmcU+J8Kun+3t37giVxCyPLjwlWmXIriAUrVy+hTOK+nwbVGo7QBq+PcQfCGKFWSdO/cnxekAxmyYcGVS676+Q+rE1Q9PpfH3bNwiW0ZYGiUZv6x+CmBOsFT1BLNcCa19Sk20MRqgNugrgBPdDEQdL9iTNFC8lyRwWxdhoj7WYlMY5TwVYdXFimCT0jp2zqYU4ks7OyfpkVfOEzGZwr40fuYJ1tK0oirV3K1DOrE9n1WpX8X21auPloVvJcF47l9HqjeSGgU7B6Zphxw6gTSw8INdz3CumCGEKd/wzVJcTrR94Ek0EUD6Ph4bC3/7a5ji3wHOyAVyAGb8A+eA8OwBjQYBJ8Di6Dr+E0vAi/hd+vqWHQaJ6BGxH+uAJzQN24</latexit>

0.0 0.2 0.4 0.6 0.8 1.0

t/s

−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 y = h + n h

slide-22
SLIDE 22

Sample posterior: MAF

  • Time to draw 10,000

independent samples < 1 second.

  • Posterior pretty good,

but does not properly model ϕ0

22

5 5 5 6 6 5

m2/M

− 3 3 6

φ0

. 8 4 8 . 8 5 6 . 8 6 4

tc/s

6 6 7 2 7 8

m1/M

2 4 2 6 2 8 3

dL/Mpc

5 5 5 6 6 5

m2/M

− 3 3 6

φ0

. 8 4 8 . 8 5 6 . 8 6 4

tc/s

2 4 2 6 2 8 3

dL/Mpc

0.0 0.2 0.4 0.6 0.8 1.0

t/s

−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 y = h + n h

slide-23
SLIDE 23

Variational autoencoder

Kingma and Welling (2013)

  • To increase flexibility further, introduce latent variables . These must be

marginalized over to obtain posterior.
 
 
 
 


  • This mixture of distributions is more general. To sample

(i) draw latent variable from variational prior . (ii) draw parameters .

z z ∼ p(z|s) θ ∼ p(θ|z, s)

23

p(θ|s) = ∫ p(θ|z, s)p(z|s) dz

Both described by neural networks

slide-24
SLIDE 24

Variational autoencoder

Kingma and Welling (2013)

  • To train, would like to evaluate the posterior. But integral is intractable.
  • Variational autoencoder introduces third model, the recognition model
  • , which is an approximation to the variational posterior
  • .
  • Training maximizes the variational lower bound on

, namely
 
 
 
 


  • Applied by Gabbard et al (2019) to gravitational waves: With all 3

networks Gaussian, obtained similar performance to MAF .

q(z|θ, s) p(z|θ, s) p(θ|s)

24

p(θ|s) = ∫ p(θ|z, s)p(z|s) dz

z

θ

q(z|θ, s)

p(θ|z, s)

L = Eq(z|θ,s) log p(θ|z, s) DKL (q(z|θ, s)kp(z|s))

<latexit sha1_base64="X6TY+s+DgyOA3GWb9o/D6tKg1Uk=">ACWXicbVFdSxtBFJ3dVpumVaM+9mVoECJo2BWLvghSFQr1QcGokA3L7ORuMmT2w5m7YjLZP+mDIP0rfehskof6cWHgzDn3cOeiXIpNHres+N+Li0/Kn2uf7l68rqWmN941pnheLQ4ZnM1G3ENEiRQgcFSrjNFbAknATjU4q/eYelBZeoXjHoJG6QiFpyhpcJGHiQMh5xJc17SIzq7RZE5K0Nz15pMAxwCsh29XdJAZgOat+bMdGI5uktPQxMgPKD5fV6WgYQYWy9sNJhaz2RaISUGQ9wOG02v7c2KvgX+AjTJoi7CxmPQz3iRQIpcMq27vpdjzCFgkso60GhIWd8xAbQtTBlCeiemSVT0i3L9GmcKXtSpDP2f4dhidbjJLKd1eb6tVaR72ndAuPDnhFpXiCkfD4oLiTFjFYx075QwFGOLWBcCftWyodMY72M+o2BP/1ym/B9V7b32/uNxvHv9cxFEj38h30iI+OSDH5Be5IB3CyRP56yw5y84f13Frbn3e6joLzyZ5Ue7mP2gs5I=</latexit>

reconstruction loss KL loss

slide-25
SLIDE 25

Variational autoencoder with normalizing flows

all taken to be MAFs.

  • Training time ~ 15 hours
  • Posterior comparable to

MCMC.

p(θ|z, s) p(z|s) q(z|θ, s)

25

Neural network MCMC

5 2 5 6 6 6 4

m2/M

− 2 2 4 6

φ0

. 8 4 4 . 8 5 . 8 5 6 . 8 6 2

tc/s

6 4 6 8 7 2 7 6 8

m1/M

2 4 2 5 5 2 7 2 8 5 3

dL/Mpc

5 2 5 6 6 6 4

m2/M

− 2 2 4 6

φ0

. 8 4 4 . 8 5 . 8 5 6 . 8 6 2

tc/s

2 4 2 5 5 2 7 2 8 5 3

dL/Mpc

slide-26
SLIDE 26

P—P plot

  • For each one-dimensional

marginalized posterior, study distribution of percentile values of true parameters.

  • 1000 different waveforms +

noise realizations.

26

0.0 0.2 0.4 0.6 0.8 1.0

p

0.0 0.2 0.4 0.6 0.8 1.0

CDF(p)

m1 (0.55) m2 (0.59) φ0 (0.37) tc (0.46) dL (0.56)

slide-27
SLIDE 27

Adding aligned spins and inclination

  • Prior ranges



 
 
 
 
 


  • Slightly larger network
  • Sampling time now ~ 2

seconds for 10,000 samples.

27

35 M ≤ m1, m2 ≤ 80 M, 1000 Mpc ≤ dL ≤ 3000 Mpc, 0.65 s ≤ tc ≤ 0.85 s, 0 ≤ φ0 ≤ 2π, −1 ≤ χ1z, χ2z ≤ 1, 0 ≤ θJN ≤ π.

<latexit sha1_base64="TXInhfgMaj2q6zdGvpTKW1KaEs=">ADKnicbVJNb9MwGHbC1ygf6+DIxaICcSiR3RXosYILQgMNiW6T6ipyHe15iTGdpC6KPwdLvwVLjswTVz5IThZGnLK9l6/L7P89h+7UhJYSxCF5/7fqNm7e2bnfu3L13f7u78+DAZLlmfMIymemjiBouRconVljJj5TmNIkP4xO3lT1wy9cG5Gln+xS8VlCj1MxF4xalwp3vPHui68koXahk+J9SLI4syUkhflU5iEuO+mQbWGI7TB6xPSgRAj1CopdiUn3Mawzjca1DlsrtO7sPaBAUv/53DrFnYkLUsUDBqU/8arEqIWogQtVQDokRNfY6vmMxCnxaWdRwcFpvDHF1MbTqZxfc0rB496FsmTrPIOz2UIDqgJsAN6AHmtgPu2ckzlie8NQySY2ZYqTsrKDaCiZ52SG54YqyE3rMpw6mNOFmVtRPXcInLhPDeabdSC2s21FQRNjlknkmFWHzHqtSv6vNs3tfDQrRKpy1N2udE8l9BmsPo3MBaMyuXDlCmhTsrZAuqKbPud3VcE/D6lTfBwSDAw2D4cdgbv27asQUegcfgGcDgFRiDt2AfTADzvnk/vJ/euf/dP/Mv/F+XVN9rNA/BSvi/wBTFvrj</latexit>

3 5 4 4 5 5 5 5

m2/M

. 1 . 5 3 . 4 . 5 6 .

φ0

. 6 8 2 . 6 8 3 5 . 6 8 5 . 6 8 6 5

tc/s

1 2 1 6 2 2 4

dL/Mpc

− . 3 . . 3 . 6 . 9

χ1z

− 1 . − . 5 . . 5 1 .

χ2z

5 4 6 6 6 7 2 7 8

m1/M

. . 8 1 . 6 2 . 4 3 . 2

θJN

3 5 4 4 5 5 5 5

m2/M

. 1 . 5 3 . 4 . 5 6 .

φ0

. 6 8 2 . 6 8 3 5 . 6 8 5 . 6 8 6 5

tc/s

1 2 1 6 2 2 4

dL/Mpc

− . 3 . . 3 . 6 . 9

χ1z

− 1 . − . 5 . . 5 1 .

χ2z

. . 8 1 . 6 2 . 4 3 . 2

θJN

slide-28
SLIDE 28

P—P plot

0.0 0.2 0.4 0.6 0.8 1.0

p

0.0 0.2 0.4 0.6 0.8 1.0

CDF(p)

m1 (0.60) m2 (0.95) φ0 (0.80) tc (0.21) dL (0.68) χ1z (0.82) χ2z (0.60) θJN (0.55)

28

~ 30 minutes to generate all samples

slide-29
SLIDE 29

Next steps

  • Expand to full 15D parameter space: multiple detectors, sky position, non-

aligned spins.

  • Allow the noise PSD to vary from event to event.
  • Waveform “compression” to allow lower mass BBH, and BNS events. These

involve longer waveforms, and higher sampling frequency.

  • Try to reduce size of training set.

29

slide-30
SLIDE 30

Conclusions

  • For single detector, aligned spin binaries, neural networks are capable of modeling multimodal
  • .
  • Training is likelihood-free, requiring only

pairs from the data generative process.

  • After training, < 2 seconds to produce 10,000 independent samples. Compares to days for

standard methods.

  • Model with CVAE and MAF has best performance:
  • Successfully models all parameters, including degeneracies.
  • Posterior comparable to MCMC.
  • Passes P—P plot statistical tests.
  • Ongoing work to develop into a complete parameter estimation tool.

p(θ|s) (θ, s)

30

THANK YOU