Semantics for Probabilistic Programming Chris Heunen 1 / 27 Bayes - - PowerPoint PPT Presentation

semantics for probabilistic programming
SMART_READER_LITE
LIVE PREVIEW

Semantics for Probabilistic Programming Chris Heunen 1 / 27 Bayes - - PowerPoint PPT Presentation

Semantics for Probabilistic Programming Chris Heunen 1 / 27 Bayes law P ( A | B ) = P ( B | A ) P ( A ) P ( B ) 2 / 27 Bayes law P ( A | B ) = P ( B | A ) P ( A ) P ( B ) Bayesian reasoning: predict future, based on model and


slide-1
SLIDE 1

Semantics for Probabilistic Programming

Chris Heunen

1 / 27

slide-2
SLIDE 2

Bayes’ law

P(A | B) = P(B | A)×P(A) P(B)

2 / 27

slide-3
SLIDE 3

Bayes’ law

P(A | B) = P(B | A)×P(A) P(B) Bayesian reasoning:

◮ predict future, based on model and prior evidence ◮ infer causes, based on model and posterior evidence ◮ learn better model, based on prior model and evidence

2 / 27

slide-4
SLIDE 4

Bayesian networks

3 / 27

slide-5
SLIDE 5

Bayesian inference

4 / 27

slide-6
SLIDE 6

Bayesian data modelling

  • 1. Develop probabilistic (generative) model
  • 2. Design inference algorithm for model
  • 3. Use algorithm to fit model to data

Example: find effect of drug on patient, given data

5 / 27

slide-7
SLIDE 7

Linear regression

Generative model

s ∼ normal(0, 2) b ∼ normal(0, 6) f(x) = s · x + b yi = normal(f(i), 0.5) for i = 0 . . . 6

Conditioning

y0 = 0.6, y1 = 0.7, y2 = 1.2, y3 = 3.2, y4 = 6.8, y5 = 8.2, y6 = 8.4

Predict f

6 / 27

slide-8
SLIDE 8

Linear regression

7 / 27

slide-9
SLIDE 9

Probabilistic programming

  • 1. Develop probabilistic (generative) model Write a program
  • 2. Design inference algorithm for model
  • 2. Use built-in algorithm to fit model to data

8 / 27

slide-10
SLIDE 10

Probabilistic programming

  • 1. Develop probabilistic (generative) model Write a program
  • 2. Design inference algorithm for model
  • 2. Use built-in algorithm to fit model to data

P(A | B) ∝ P(B | A) × P(A) posterior ∝ likelihood × prior functional programming + observe + sample

8 / 27

slide-11
SLIDE 11

Probabilistic programming

  • 1. Develop probabilistic (generative) model Write a program
  • 2. Design inference algorithm for model
  • 2. Use built-in algorithm to fit model to data

P(A | B) ∝ P(B | A) × P(A) posterior ∝ likelihood × prior functional programming + observe + sample

8 / 27

slide-12
SLIDE 12

Linear regression

(defquery Bayesian-linear-regression (let [f (let [s (sample (normal 0.0 3.0)) b (sample (normal 0.0 3.0))] (fn [x] (+ (* s x) b)))] (observe (normal (f 1.0) 0.5) 2.5) (observe (normal (f 2.0) 0.5) 3.8) (observe (normal (f 3.0) 0.5) 4.5) (observe (normal (f 4.0) 0.5) 6.2) (observe (normal (f 5.0) 0.5) 8.0) (predict :f f)))

9 / 27

slide-13
SLIDE 13

Linear regression

10 / 27

slide-14
SLIDE 14

Linear regression

11 / 27

slide-15
SLIDE 15

Measure theory

Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around 0.34

12 / 27

slide-16
SLIDE 16

Measure theory

Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around 0.34 A measurable space is a set X with a family ΣX of subsets that is closed under countable unions and complements A (probability) measure on X is a function p: ΣX → [0, ∞] that satisfies p( Un) = p(Un) (and has p(X) = 1)

12 / 27

slide-17
SLIDE 17

Measure theory

Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around 0.34 A measurable space is a set X with a family ΣX of subsets that is closed under countable unions and complements A (probability) measure on X is a function p: ΣX → [0, ∞] that satisfies p( Un) = p(Un) (and has p(X) = 1) A function f : X → Y is measurable if f −1(U) ∈ ΣX for U ∈ ΣY A random variable is a measurable function R → X

12 / 27

slide-18
SLIDE 18

Function types

Z × X [X → Y] × X Y ˆ f f × idX ev

13 / 27

slide-19
SLIDE 19

Function types

Z × X [X → Y] × X Y ˆ f f × idX ev [R → R] cannot be a measurable space!

13 / 27

slide-20
SLIDE 20

Quasi-Borel spaces

A quasi-Borel space is a set X together with MX ⊆ [R → X] satisfying:

14 / 27

slide-21
SLIDE 21

Quasi-Borel spaces

A quasi-Borel space is a set X together with MX ⊆ [R → X] satisfying:

◮ α ∈ MX if α: R → X is constant

14 / 27

slide-22
SLIDE 22

Quasi-Borel spaces

A quasi-Borel space is a set X together with MX ⊆ [R → X] satisfying:

◮ α ∈ MX if α: R → X is constant ◮ α ◦ ϕ ∈ MX if α ∈ MX and ϕ: R → R is measurable

14 / 27

slide-23
SLIDE 23

Quasi-Borel spaces

A quasi-Borel space is a set X together with MX ⊆ [R → X] satisfying:

◮ α ∈ MX if α: R → X is constant ◮ α ◦ ϕ ∈ MX if α ∈ MX and ϕ: R → R is measurable ◮ if R = n∈N Sn, with each set Sn Borel, and α1, α2, . . . ∈ MX,

then β is in MX, where β(r) = αn(r) for r ∈ Sn

14 / 27

slide-24
SLIDE 24

Quasi-Borel spaces

A quasi-Borel space is a set X together with MX ⊆ [R → X] satisfying:

◮ α ∈ MX if α: R → X is constant ◮ α ◦ ϕ ∈ MX if α ∈ MX and ϕ: R → R is measurable ◮ if R = n∈N Sn, with each set Sn Borel, and α1, α2, . . . ∈ MX,

then β is in MX, where β(r) = αn(r) for r ∈ Sn A morphism is a function f : X → Y with f ◦ α ∈ MY if α ∈ MX

14 / 27

slide-25
SLIDE 25

Quasi-Borel spaces

A quasi-Borel space is a set X together with MX ⊆ [R → X] satisfying:

◮ α ∈ MX if α: R → X is constant ◮ α ◦ ϕ ∈ MX if α ∈ MX and ϕ: R → R is measurable ◮ if R = n∈N Sn, with each set Sn Borel, and α1, α2, . . . ∈ MX,

then β is in MX, where β(r) = αn(r) for r ∈ Sn A morphism is a function f : X → Y with f ◦ α ∈ MY if α ∈ MX

◮ has product types ◮ has sum types ◮ has function types!

M[X→Y] = {α: R → [X → Y] | ˆ α: R × X → Y morphism}

14 / 27

slide-26
SLIDE 26

Example quasi-Borel spaces

Set Qbs ⊥ X (X, {case Sn.xn | Sn ⊆ X partition, xn ∈ R}) X (X, MX)

15 / 27

slide-27
SLIDE 27

Example quasi-Borel spaces

Set Qbs ⊥ X (X, {case Sn.xn | Sn ⊆ X partition, xn ∈ R}) X (X, MX) Set Qbs ⊤ X (X, {α: R → X}) X (X, MX)

15 / 27

slide-28
SLIDE 28

Example quasi-Borel spaces

Set Qbs ⊥ X (X, {case Sn.xn | Sn ⊆ X partition, xn ∈ R}) X (X, MX) Set Qbs ⊤ X (X, {α: R → X}) X (X, MX) Meas Qbs ⊤ (X, ΣX) (X, {α: R → X measurable}) (X, {U | ∀α ∈ MX : α−1(U) measurable}) (X, MX)

15 / 27

slide-29
SLIDE 29

Distribution types

A measure on a quasi-Borel space (X, MX) consists of

◮ α ∈ MX and ◮ a probability measure µ on R

Two measures are identified when they induce the same µ(α−1(−))

16 / 27

slide-30
SLIDE 30

Distribution types

A measure on a quasi-Borel space (X, MX) consists of

◮ α ∈ MX and ◮ a probability measure µ on R

Two measures are identified when they induce the same µ(α−1(−)) Gives monad

◮ P(X, MX) = {(α, µ) measure on (X, MX}/ ∼ ◮ return x = [λr.x, µ]∼ for arbitrary µ ◮ bind uses integral

  • fd(α, µ) :=
  • (f ◦ α)dµ if f : (X, MX) → R

for distribution types

16 / 27

slide-31
SLIDE 31

Example: facts about distributions

  • let x = sample(gauss(0.0,1.0))

in return (x<0)

  • = sample(bern(0.5))

17 / 27

slide-32
SLIDE 32

Example: importance sampling

  • sample(exp(2))
  • =

let x = sample(gauss(0,1)))

  • bserve(exp-pdf(2,x)/gauss-pdf(0,1,x));

return x

  • 18 / 27
slide-33
SLIDE 33

Example: conjugate priors

let x = sample(beta(1,1)) in observe(bern(x), true); return x

  • =
  • bserve(bern(0.5), true);

let x = sample(beta(2,1)) in return x

  • 19 / 27
slide-34
SLIDE 34

Linear regression

(defquery Bayesian-linear-regression Prior: (let [f (let [s (sample (normal 0.0 3.0)) b (sample (normal 0.0 3.0))] (fn [x] (+ (* s x) b)))] Likelihood: (observe (normal (f 1.0) 0.5) 2.5) (observe (normal (f 2.0) 0.5) 3.8) (observe (normal (f 3.0) 0.5) 4.5) (observe (normal (f 4.0) 0.5) 6.2) (observe (normal (f 5.0) 0.5) 8.0) Posterior: (predict :f f)))

20 / 27

slide-35
SLIDE 35

Linear regression: prior

Define a prior measure on [R → R] (let [f (let [s (sample (normal 0.0 3.0)) b (sample (normal 0.0 3.0))] (fn [x] (+ (* s x) b)))]

  • =

[α, ν ⊗ ν]∼ ∈ P([R → R]) where ν is normal distribution, mean 0 and standard deviation 3, and α: R × R → [R → R] is (s, b) → λr.sr + b

21 / 27

slide-36
SLIDE 36

Linear regression: likelihood

Define likelihood of observations (with some noise)

  • (observe (normal (f 1.0) 0.5) 2.5)

(observe (normal (f 2.0) 0.5) 3.8) (observe (normal (f 3.0) 0.5) 4.5) (observe (normal (f 4.0) 0.5) 6.2) (observe (normal (f 5.0) 0.5) 8.0)

  • =

d(f(1), 2.5) · d(f(2), 3.8) · d(f(3), 4.5) · d(f(4), 6.2) · d(f(5), 8.0) where f free variable of type [R → R], and d: R2 → [0, ∞) is density

  • f normal distribution with standard deviation 0.5

d(µ, x) =

  • 2/π exp(−2(x − µ)2)

22 / 27

slide-37
SLIDE 37

Linear regression: Posterior

Normalise combined prior and likelihood (predict :f f))) ∈ P([R → R])

23 / 27

slide-38
SLIDE 38

Piecewise linear regression: Posterior

Normalise combined prior and likelihood (predict :f f))) ∈ P([R → R])

24 / 27

slide-39
SLIDE 39

Modular inference algorithms

An inference representation is monad (T, return, ≫ =) with T X → P X, sample: 1 → T [0, 1], score: [0, ∞) → T 1.

◮ Discrete weighted sampler (e.g. coin flip) ◮ Continuous sampler

25 / 27

slide-40
SLIDE 40

Modular inference algorithms

An inference representation is monad (T, return, ≫ =) with T X → P X, sample: 1 → T [0, 1], score: [0, ∞) → T 1.

◮ Discrete weighted sampler (e.g. coin flip) ◮ Continuous sampler

An inference transformer respects meaning, sample, and score.

◮ List: T(−) → T

  • List(−)
  • ◮ Continuous weighting: T(−) → T
  • [0, ∞) ∗ (−)
  • ◮ Population: T(−) → T
  • List
  • [0, ∞) ∗ (−)
  • 25 / 27
slide-41
SLIDE 41

Modular inference algorithms library

Sequential Monte Carlo: approximate distribution by population of weighted samples (particles/suspended computations), repeatedly applying fixed random process (particle filter)

26 / 27

slide-42
SLIDE 42

Want more?

◮ “Semantics for probabilistic programming: higher-order

functions, continuous distributions, and soft constraints” LiCS 2016

◮ “A convenient category for higher-order probability theory”

LiCS 2017

◮ “Denotational validation of higher-order Bayesian inference”

POPL 2018

27 / 27

slide-43
SLIDE 43

De Finetti’s theorem

Every exchangeable sequence of random observations

  • n R can be generated by:

◮ choose a single probability distribution on R ◮ sample that one independently repeatedly

slide-44
SLIDE 44

De Finetti’s theorem

Every exchangeable sequence of random observations

  • n a quasi-Borel space X can be generated by:

◮ choose a single probability distribution on X ◮ sample that one independently repeatedly

slide-45
SLIDE 45

Trace Markov Chain Monte Carlo

Repeatedly use kernel to propose new value, decide whether to accept (Metropolis-Hastings update). Random walk in target space: program traces.

◮ Metropolis-Hastings-Green: update preserves distribution. ◮ Program traces form inference representation. ◮ Trace MCMC is inference transformation

(parametrised by proposal kernel)