[PPT] - Semantic Foundations for Probabilistic Programming Chris Heunen PowerPoint Presentation

SLIDE 1

Semantic Foundations for Probabilistic Programming

Chris Heunen Ohad Kammar, Sam Staton, Frank Wood, Hongseok Yang

1 / 21

SLIDE 2

Semantic foundations

programs mathematical objects − s1 s2

2 / 21

SLIDE 3

Semantic foundations

programs mathematical objects − s1 s2

s1;s2
2 / 21

SLIDE 4

Semantic foundations

programs mathematical objects − s1 s2

s1;s2
◮ Operational: remember implementation details

(efficiency)

◮ Denotational: see what program does conceptually

(correctness)

2 / 21

SLIDE 5

Semantic foundations

programs mathematical objects − s1 s2

s1;s2
◮ Operational: remember implementation details

(efficiency)

◮ Denotational: see what program does conceptually

(correctness)

Motivation:

◮ Ground programmer’s unspoken intuitions ◮ Justify/refute/suggest program transformations ◮ Understand programming through mathematics

2 / 21

SLIDE 6

Semantic foundations

programs mathematical objects − s1 s2

s1;s2
◮ Operational: remember implementation details

(efficiency)

◮ Denotational: see what program does conceptually

(correctness)

Motivation:

◮ Ground programmer’s unspoken intuitions ◮ Justify/refute/suggest program transformations ◮ Understand probability through program equations

2 / 21

SLIDE 7

Probabilistic programming

P(A | B) = P(B | A) × P(A) P(B)

3 / 21

SLIDE 8

Probabilistic programming

P(A | B) ∝ P(B | A) × P(A)

3 / 21

SLIDE 9

Probabilistic programming

P(A | B) ∝ P(B | A) × P(A) posterior ∝ likelihood × prior

3 / 21

SLIDE 10

Probabilistic programming

P(A | B) ∝ P(B | A) × P(A) posterior ∝ likelihood × prior idealized Anglican = functional programming + normalize

bserve

sample http://www.robots.ox.ac.uk/~fwood/anglican

3 / 21

SLIDE 11

Overview

◮ Interpret types as measurable spaces

e.g. real = R

◮ Interpret (open) terms as kernels ◮ Interpret closed terms as measures ◮ Inference normalizes measures

posterior ∝ likelihood × prior

[Kozen, “Semantics of probabilistic programs”, J Comp Syst Sci, 1981]

4 / 21

SLIDE 12

Overview

◮ Interpret types as measurable spaces

e.g. real = R

◮ Interpret (open) terms as kernels ◮ Interpret closed terms as measures ◮ Inference normalizes measures

posterior ∝ likelihood × prior

But:

◮ Commutativity?

Fubini not true for all kernels

◮ Higher order functions?

R → R not a measurable space

◮ Extensionality? ◮ Recursion? [Kozen, “Semantics of probabilistic programs”, J Comp Syst Sci, 1981] [Aumann, “Borel structures for function spaces”, Ill J Math, 1961]

4 / 21

SLIDE 13

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

5 / 21

SLIDE 14

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

let x = sample(bern(0.5)) in let r = if x then 2.0 else 1.0

bserve(0.0 from exp(r));

return x

5 / 21

SLIDE 15

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

two traces: 0.5 0.5 let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0

bserve(0.0 from exp(r));

return x

5 / 21

SLIDE 16

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

two traces: 0.5 0.5 let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0

bserve(0.0 from exp(r));

score 2 return x return true

5 / 21

SLIDE 17

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

two traces: 0.5 0.5 let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0

bserve(0.0 from exp(r));

score 2 score 1 return x return true return false

5 / 21

SLIDE 18

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

two traces: 0.5 0.5 let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0

bserve(0.0 from exp(r));

score 2 score 1 return x return true return false posterior ∝ likelihood × prior 2 × 0.5: true 1 × 0.5: false

5 / 21

SLIDE 19

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay
4. What is the outcome x?

P(true) = 1, P(false) = 0.5 two traces: 0.5 0.5 let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0

bserve(0.0 from exp(r));

score 2 score 1 return x return true return false posterior ∝ likelihood × prior 2 × 0.5: true 1 × 0.5: false

5 / 21

SLIDE 20

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay

model evidence (score): 1.5

4. What is the outcome x?

P(true) = 66%, P(false) = 33% two traces: 0.5 0.5 let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0

bserve(0.0 from exp(r));

score 2 score 1 return x return true return false posterior ∝ likelihood × prior 2 × 0.5: true 1 × 0.5: false

5 / 21

SLIDE 21

Example

1. Toss a fair coin to get outcome x
2. Set up exponential decay with rate r depending on x
3. Observe immediate decay

model evidence (score): 1.5

4. What is the outcome x?

P(true) = 66%, P(false) = 33% Programs may also sample continuous distributions so have to deal with uncountable number of traces: let y = sample(gauss(7,2))

5 / 21

SLIDE 22

Measure theory

Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around 0.34

6 / 21

SLIDE 23

Measure theory

Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around 0.34 A measurable space is a set X with a family ΣX of subsets that is closed under countable unions and complements A (probability) measure on X is a function p: ΣX → [0, ∞] that satisfies p( Un) = p(Un) (and has p(X) = 1)

6 / 21

SLIDE 24

First order language

◮ Types: A, B ::= R

P(A) 1 A × B

i∈I Ai

| | | | real numbers distributions over A finite products countable sums bool := 1 + 1 nat :=

i∈N 1

7 / 21

SLIDE 25

First order language

◮ Types: A, B ::= R

P(A) 1 A × B

i∈I Ai

| | | |

◮ Deterministic terms may not sample:

◮ variables

x, y, z

◮ constructors for sums and products

case, ini, if, false, true

◮ measurable functions

bern, exp, gauss, dirac

⊢

d 42.0 : R

⊢

d gauss(2.0, 7.0) : P(R)

x: R, y: R ⊢

d x + y : R

x: R, y: R ⊢

d x < y : bool

7 / 21

SLIDE 26

First order language

◮ Types: A, B ::= R

P(A) 1 A × B

i∈I Ai

| | | |

◮ Deterministic terms may not sample:

◮ variables

x, y, z

◮ constructors for sums and products

case, ini, if, false, true

◮ measurable functions

bern, exp, gauss, dirac

◮ Probabilistic terms may sample:

◮ sequencing

return, let

◮ constraints

score

◮ priors

sample

Γ ⊢

d t: A

Γ ⊢

p return(t): A

Γ ⊢

p t: A

x: A ⊢

p u: B

Γ ⊢

p let x = t in u: B

Γ ⊢

d t: R

Γ ⊢

p score(t): 1

Γ ⊢

d t: P(A)

Γ ⊢

p sample(t): A

7 / 21

SLIDE 27

First order language

◮ Types: A, B ::= R

P(A) 1 A × B

i∈I Ai

| | | |

◮ Deterministic terms may not sample:

◮ variables

x, y, z

◮ constructors for sums and products

case, ini, if, false, true

◮ measurable functions

bern, exp, gauss, dirac

◮ inference

norm

◮ Probabilistic terms may sample:

◮ sequencing

return, let

◮ constraints

score

◮ priors

sample

Γ ⊢

d t: A

Γ ⊢

p return(t): A

Γ ⊢

p t: A

x: A ⊢

p u: B

Γ ⊢

p let x = t in u: B

Γ ⊢

d t: R

Γ ⊢

p score(t): 1

Γ ⊢

d t: P(A)

Γ ⊢

p sample(t): A

7 / 21

SLIDE 28

First order semantics

Interpret

◮ type A

as measurable space A

◮ deterministic term Γ ⊢ d t: A

as measurable function Γ → A

◮ probabilistic term Γ ⊢ p t: A

as kernel t: Γ × ΣA → [0, ∞]

fixing first argument: measure, fixing second argument: measurable

8 / 21

SLIDE 29

First order semantics

Interpret

◮ type A

as measurable space A

◮ deterministic term Γ ⊢ d t: A

as measurable function Γ → A

◮ probabilistic term Γ ⊢ p t: A

as kernel t: Γ × ΣA → [0, ∞]

fixing first argument: measure, fixing second argument: measurable

Γ ⊢

d t: R

Γ ⊢

p score(t): 1

score(t)(γ, ∗) = t(γ)

8 / 21

SLIDE 30

First order semantics

Interpret

◮ type A

as measurable space A

◮ deterministic term Γ ⊢ d t: A

as measurable function Γ → A

◮ probabilistic term Γ ⊢ p t: A

as kernel t: Γ × ΣA → [0, ∞]

fixing first argument: measure, fixing second argument: measurable

Γ ⊢

d t: R

Γ ⊢

p score(t): 1

score(t)(γ, ∗) = t(γ) Γ ⊢

d t: P(A)

Γ ⊢

p sample(t): A

sample(t)(γ, U) = (t(γ))(U)

8 / 21

SLIDE 31

First order semantics

Interpret

◮ type A

as measurable space A

◮ deterministic term Γ ⊢ d t: A

as measurable function Γ → A

◮ probabilistic term Γ ⊢ p t: A

as kernel t: Γ × ΣA → [0, ∞]

fixing first argument: measure, fixing second argument: measurable

Γ ⊢

d t: R

Γ ⊢

p score(t): 1

score(t)(γ, ∗) = t(γ) Γ ⊢

d t: P(A)

Γ ⊢

p sample(t): A

sample(t)(γ, U) = (t(γ))(U) Γ ⊢

p t: A

x: A ⊢

p u: B

Γ ⊢

p let x = t in u: B

let x = t in u(γ, U) =

Au(γ, x, U) t(γ, dx)

8 / 21

SLIDE 32

First order semantics

Interpret

◮ type A

as measurable space A

◮ deterministic term Γ ⊢ d t: A

as measurable function Γ → A

◮ probabilistic term Γ ⊢ p t: A

as kernel t: Γ × ΣA → [0, ∞]

fixing first argument: measure, fixing second argument: measurable

Γ ⊢

d t: R

Γ ⊢

p score(t): 1

score(t)(γ, ∗) = t(γ) Γ ⊢

d t: P(A)

Γ ⊢

p sample(t): A

sample(t)(γ, U) = (t(γ))(U) Γ ⊢

p t: A

x: A ⊢

p u: B

Γ ⊢

p let x = t in u: B

let x = t in u =

Audt

8 / 21

SLIDE 33

Example

let x = sample(bern(0.5)) in

let r = if x then 2.0 else 1.0

bserve(0.0 from exp(r));

return x

The meaning of a program returning values in X is a measure on X

∅ has measure 0.0 {true} has measure 1.0 = 0.5 × 2.0 {false} has measure 0.5 = 0.5 × 1.0 {true, false} has measure 1.5

9 / 21

SLIDE 34

Normalization: posterior ∝ likelihood × prior

Γ ⊢

p t: A

Γ ⊢

d norm(t): R × P(A) + 1 + 1

model evidence normalized posterior errors

10 / 21

SLIDE 35

Normalization: posterior ∝ likelihood × prior

Γ ⊢

p t: A

Γ ⊢

d norm(t): R × P(A) + 1 + 1

model evidence normalized posterior errors Interpretation of probabilistic term is kernel Γ × ΣA → [0, ∞] so

fixing first argument gives measure

t(γ, −) t(γ, A) is normalized probability measure

10 / 21

SLIDE 36

Normalization: posterior ∝ likelihood × prior

Γ ⊢

p t: A

Γ ⊢

d norm(t): R × P(A) + 1 + 1

model evidence normalized posterior errors (constant is 0 or ∞) Interpretation of probabilistic term is kernel Γ × ΣA → [0, ∞] so

fixing first argument gives measure

t(γ, −) t(γ, A) is normalized probability measure

normalizing constant is model evidence

10 / 21

SLIDE 37

Normalization: posterior ∝ likelihood × prior

Γ ⊢

p t: A

Γ ⊢

d norm(t): R × P(A) + 1 + 1

model evidence normalized posterior errors (constant is 0 or ∞)

let x = sample(bern(0.5)) in

let r = if x then 2.0 else 1.0

bserve(0.0 from exp(r));

return x

=

true : 2.0 × 0.5 false : 1.0 × 0.5

10 / 21

SLIDE 38

Normalization: posterior ∝ likelihood × prior

Γ ⊢

p t: A

Γ ⊢

d norm(t): R × P(A) + 1 + 1

model evidence normalized posterior errors (constant is 0 or ∞)

norm(

let x = sample(bern(0.5)) in let r = if x then 2.0 else 1.0

bserve(0.0 from exp(r));

return x )

= in1
1.5, bern(0.66)
10 / 21

SLIDE 39

Example: sequential Monte Carlo

norm( let x=t

in u )

=
norm( let (e,d) = norm(t) in

score(e); let x=sample(d) in u )

11 / 21

SLIDE 40

Example: importance sampling

sample(exp(2))
=
let x = sample(gauss(0,1)))

score(exp-pdf(2,x) / gauss-pdf(0,1,x)); return x

12 / 21

SLIDE 41

Example: importance sampling

sample(exp(2))
=
let x = sample(gauss(0,1)))

score(exp-pdf(2,x) / gauss-pdf(0,1,x)); return x

=
let x = sample(gauss(0,1)))

score(1 / gauss-pdf(0,1,x)); score(exp-pdf(2,x)); return x

12 / 21

SLIDE 42

Example: importance sampling

norm( sample(exp(2)) )

=
norm(

let x = sample(gauss(0,1))) score(exp-pdf(2,x) / gauss-pdf(0,1,x)); return x )

=
norm( norm(

let x = sample(gauss(0,1))) score(1 / gauss-pdf(0,1,x)); ); score(exp-pdf(2,x)); return x )

Don’t normalize as you go

12 / 21

SLIDE 43

Commutativity

Reordering lines is very useful program transformation let x=t in let y=u in v

=

let y=u in let x=t in v

13 / 21

SLIDE 44

Commutativity

Reordering lines is very useful program transformation let x=t in let y=u in v

=

let y=u in let x=t in v

amounts to Fubini’s theorem
A
B

v du dt =

B
A

v dt du

13 / 21

SLIDE 45

Commutativity

Reordering lines is very useful program transformation let x=t in let y=u in v

=

let y=u in let x=t in v

amounts to Fubini’s theorem
A
B

v du dt =

B
A

v dt du Not true for arbitrary kernels, only for s-finite kernels kernel is s-finite when countable sum of bounded ones k: Γ × ΣA → [0, ∞] bounded if ∃n∀γ∀U : k(γ, U) < n

13 / 21

SLIDE 46

Commutativity

Reordering lines is very useful program transformation let x=t in let y=u in v

=

let y=u in let x=t in v

amounts to Fubini’s theorem
A
B

v du dt =

B
A

v dt du Not true for arbitrary kernels, only for s-finite kernels kernel is s-finite when countable sum of bounded ones k: Γ × ΣA → [0, ∞] bounded if ∃n∀γ∀U : k(γ, U) < n

◮ kernel k is s-finite iff it can be built from

sub-probability distributions, score, and binding k > > = l is (γ, V) →

A l(γ, x, V)k(γ, dx)

◮ measurable spaces and s-finite kernels form

distributive symmetric monoidal category

13 / 21

SLIDE 47

Commutativity

Reordering lines is very useful program transformation let x=t in let y=u in v

=

let y=u in let x=t in v

amounts to Fubini’s theorem
A
B

v du dt =

B
A

v dt du Not true for arbitrary kernels, only for s-finite kernels kernel is s-finite when countable sum of bounded ones k: Γ × ΣA → [0, ∞] bounded if ∃n∀γ∀U : k(γ, U) < n Interpret terms as s-finite kernels

13 / 21

SLIDE 48

Example: facts about distributions

let x = sample(gauss(0.0,1.0))

in return (x<0)

= sample(bern(0.5))

14 / 21

SLIDE 49

Example: conjugate priors

let x = sample(beta(1,1)) in observe(bern(x), true); return x

=
bserve(bern(0.5), true);

let x = sample(beta(2,1)) in return x

15 / 21

SLIDE 50

Higher order functions

Allow probabilistic terms as input/output for other terms

[Roy et al, “A stochastic programming perspective on nonparametric Bayes”, ICML 2008]

16 / 21

SLIDE 51

Higher order functions

Allow probabilistic terms as input/output for other terms A, B ::= R | P(A) | 1 | A × B |

i∈I

Ai | A → B

[Roy et al, “A stochastic programming perspective on nonparametric Bayes”, ICML 2008]

16 / 21

SLIDE 52

Higher order functions

Allow probabilistic terms as input/output for other terms A, B ::= R | P(A) | 1 | A × B |

i∈I

Ai | A → B R → R is not a measurable space

[Roy et al, “A stochastic programming perspective on nonparametric Bayes”, ICML 2008] [Aumann, “Borel structures for function spaces”, Ill J Math, 1961]

16 / 21

SLIDE 53

Higher order functions

Allow probabilistic terms as input/output for other terms A, B ::= R | P(A) | 1 | A × B |

i∈I

Ai | A → B R → R is not a measurable space Easy to handle operationally. What to do denotationally?

[Roy et al, “A stochastic programming perspective on nonparametric Bayes”, ICML 2008] [Aumann, “Borel structures for function spaces”, Ill J Math, 1961] [Borgstr¨

m et al, “Measure transformer semantics for Bayesian machine learning”, ESOP2011]

16 / 21

SLIDE 54

Higher order semantics

Use category theory to extend measure theory measurable spaces

not enough function spaces

17 / 21

SLIDE 55

Higher order semantics

Use category theory to extend measure theory measurable spaces sheaves on measurable spaces

not enough function spaces presheaves on measurable spaces that preserve countable products all function spaces preserves all structure

17 / 21

SLIDE 56

Higher order semantics

Use category theory to extend measure theory measurable spaces sheaves on measurable spaces measurable spaces sheaves on measurable spaces Giry monad

distribution types

A P(A)

17 / 21

SLIDE 57

Higher order semantics

Use category theory to extend measure theory measurable spaces sheaves on measurable spaces measurable spaces sheaves on measurable spaces Giry monad left Kan extension

[Power, “Generic models for computational effects”, Th Comp Sci 2006]

17 / 21

SLIDE 58

Higher order semantics

Use category theory to extend measure theory measurable spaces sheaves on measurable spaces

◮ 1 → (R → R) consists of random functions measurable Ω × R → R ◮ All definable functions R → R are measurable “Church-Turing” ◮ Denotational and operational semantics match soundness & adequacy

17 / 21

SLIDE 59

Extensionality

Not extensional: 1 A B

p f g

for all p ⇒ f = g Solution: restrict to subcategory that is extensional

18 / 21

SLIDE 60

Extensionality

Not extensional: 1 A B

p f g

for all p ⇒ f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with MX ⊆ [R → X] satisfying

◮ if f : R → R is measurable and g ∈ M, then gf ∈ M ◮ if f : R → X is constant, then f ∈ M ◮ if f : R → N is measurable and gn ∈ M, then [gn] f ∈ M t → gf(t)(t)

morphisms are functions f : X → Y with g ∈ MX ⇒ fg ∈ MY

18 / 21

SLIDE 61

Extensionality

Not extensional: 1 A B

p f g

for all p ⇒ f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with MX ⊆ [R → X] satisfying

◮ if f : R → R is measurable and g ∈ M, then gf ∈ M ◮ if f : R → X is constant, then f ∈ M ◮ if f : R → N is measurable and gn ∈ M, then [gn] f ∈ M t → gf(t)(t)

morphisms are functions f : X → Y with g ∈ MX ⇒ fg ∈ MY Example: X measurable space, MX measurable functions R → X morphism X → Y is measurable function

18 / 21

SLIDE 62

Extensionality

Not extensional: 1 A B

p f g

for all p ⇒ f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with MX ⊆ [R → X] satisfying

◮ if f : R → R is measurable and g ∈ M, then gf ∈ M ◮ if f : R → X is constant, then f ∈ M ◮ if f : R → N is measurable and gn ∈ M, then [gn] f ∈ M t → gf(t)(t)

morphisms are functions f : X → Y with g ∈ MX ⇒ fg ∈ MY Example: X measurable space, MX measurable functions R → X morphism X → Y is measurable function Theorem: this gives cartesian closed category with countable sums Corollary: if term t has first order type, then t is measurable

even if t involves higher order functions

18 / 21

SLIDE 63

Extensionality

Not extensional: 1 A B

p f g

for all p ⇒ f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with MX ⊆ [R → X] satisfying

◮ if f : R → R is measurable and g ∈ M, then gf ∈ M ◮ if f : R → X is constant, then f ∈ M ◮ if f : R → N is measurable and gn ∈ M, then [gn] f ∈ M t → gf(t)(t)

A measure on (X, MX) is a measure µ on R with a function f ∈ M Proposition: measures on [X → Y] are random functions

measurable map R × X → Y modulo measure on R

18 / 21

SLIDE 64

Recursion

No recursion / least fixed points Idea: restrict to presheaves over domains An ω-complete partial order has suprema of increasing sequences morphisms preserve suprema of increasing sequences and infima

x1 x2 x3 x4 x5 sup xn

A quasi-measurable space is ordered when X is an ωcpo and M is closed under pointwise increasing suprema Example: Any ωcpo, e.g. [0,1]

take M all measurable functions R → X where X has the Borel σ-algebra on the Lawson topology

Theorem: this gives a cartesian closed category with countable sums

19 / 21

SLIDE 65

Example: von Neumann’s trick

let g = bern(0.66) in

letrec f() = (let x = sample(g) let y = sample(g) if x=y then f() else return x) in f()

?

= sample(bern(0.5))

20 / 21

SLIDE 66

Conclusion

Foundational semantics for probabilistic programming:

◮ continuous distributions ◮ soft constraints ◮ commutativity ◮ higher order functions ◮ recursion

can verify/suggest program transformations. Approximations?

21 / 21