Tutorial on Probabilistic Programming in Machine Learning Frank - - PowerPoint PPT Presentation

tutorial on probabilistic programming in machine learning
SMART_READER_LITE
LIVE PREVIEW

Tutorial on Probabilistic Programming in Machine Learning Frank - - PowerPoint PPT Presentation

Tutorial on Probabilistic Programming in Machine Learning Frank Wood Play Along 1. Download and install Leiningen http://leiningen.org/ 2. Fork and clone the Anglican Examples repository git@bitbucket.org:fwood/anglican-examples.git


slide-1
SLIDE 1

Tutorial on Probabilistic Programming in Machine Learning

Frank Wood

slide-2
SLIDE 2

Play Along

  • 1. Download and install Leiningen
  • http://leiningen.org/
  • 2. Fork and clone the Anglican Examples repository
  • git@bitbucket.org:fwood/anglican-examples.git
  • https://fwood@bitbucket.org/fwood/anglican-examples.git
  • 3. Enter repository directory and type
  • “lein gorilla”
  • 4. Open local URL in browser
  • 5. _OR_ http://www.robots.ox.ac.uk/~fwood/anglican/examples/index.html
slide-3
SLIDE 3

Motivation and Background

AI

slide-4
SLIDE 4

Probabilistic Deterministic Infinite Automata (PDIA)

  • World states learned
  • Per-state emissions learned
  • Per-state deterministic transition functions

learned

  • Infinite-state limit of model to right
  • Unsupervised PDFA structure learning

biases towards compact (few states) world-models that are fast approximate predictors

Problem

  • ~4000 lines of Java code
  • New student re-implementation

– 6-12 months

Pfau, ¡Bartlett, ¡and ¡W. ¡Probabilistic ¡Deterministic ¡Infinite ¡Automata, ¡NIPS, ¡2011 ¡ Doshi-­‑Velez, ¡Pfau, ¡W., ¡and ¡Roy ¡Bayesian ¡Nonparametric ¡Methods ¡for ¡Partially-­‑Observable ¡Reinforcement ¡Learning, ¡TPAMI, ¡2013 ¡

ε

µ ∼ Dir(α0/|Q|) φj ∼ Dir(αµ) δ(qi, σj) = δij ∼ φj πqi ∼ Dir(β/|Σ|) ξ0 = q0, ξt = δ(ξt−1, xt−1) xt ∼ πξt

Notation:

  • finite set of states
  • finite alphabet
  • transition function
  • emission distribution
  • initial state
  • data at time t
  • state at time t
  • hyperparameters

q0 q1 q2 q3 q4 q5 . . .

δ

σ0 σ1 σ2 q1 q1

·

. . . . . . . . .

· · · ·

q2 q3 q3 q3 q3 q4 q5 q5 q42 q12 q6

π

q0 q1 . . . σ0 σ1 σ2

(iid probability vector) (iid probability vector)

. . . . . . . . . q0 q1 q3 q3 q3 q5 q2 q4 q5 q2 . . . 1 1 0 0 2 2 1 2 1 0 . . .

A Prior over PDFA with a bounded number of states

j = 0, . . . , |Σ| − 1 i = 0, . . . , |Q| − 1 i = 0, . . . , |Q| − 1

Q Σ δ : Q × Σ → Q π : Q × Σ → [0, 1] q0 ∈ Q xt ∈ Σ ξt ∈ Q α, α0, β

Probabilistic Deterministic Finite Automata 1

B/1.0

2 3 4 5 6

T/0.5 P/0.5 S/0.6 X/0.4 V/0.3 T/0.7 V/0.5 X/0.5 P/0.5 S/0.5 E/1.0 to 0 from 6

1

A/0.5 B/0.5 B/1.0

1 2 3 4 5 6

B/0.8125 B/0.8125 B/0.75 A/0.1875 A/0.5625 A/0.5625 B/0.0625 A/0.1875 B/0.4375 A/0.25 A/0.9375 B/0.4375 A/0.75 B/0.25

ε 1 00 01 10 11

1 1 1 1 1 1 1

Trigram as DFA (without probability) Even Process Reber Grammar Foulkes Grammar

Unsupervised Automata Induction

slide-5
SLIDE 5

Dependent Dirichlet Process Mixture of Objects

  • World state ≈ dependent infinite mixture
  • f motile objects that may appear,

disappear, and occlude

  • Per-state, per-object emission model

learned

  • Per-state, per-object complex transition

functions learned

Problem

  • ~5000 lines of Matlab code
  • Implementation

– ~ 1 year

  • Generative model

– ~1 page latex math

  • Inference derivation

– ~3 pages latex math

Neiswanger, ¡Wood, ¡and ¡Xing ¡The ¡Dependent ¡Dirichlet ¡Process ¡Mixture ¡of ¡Objects ¡for ¡Detection-­‑free ¡Tracking ¡and ¡Object ¡Modeling, ¡AISTATS, ¡2014 ¡ Caron, ¡F., ¡Neiswanger, ¡W., ¡Wood, ¡F., ¡Doucet, ¡A., ¡& ¡Davy, ¡M. ¡Generalized ¡Pólya ¡Urn ¡for ¡Time-­‑Varying ¡Pitman-­‑Yor ¡Processes. ¡JMLR. ¡To ¡appear ¡2015

−4 −2 2 4 −4 −2 2 4 10 20 30 40 50 60 70 80 90 100 (a) −8 −6 −4 −2 2 4 −4 −2 2 4 5 10 15 20 25 30 35 40 45 50 (b) (c) (d) (e) (f) (g) (h) (i) (j)

Unsupervised Tracking

slide-6
SLIDE 6

Outline

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-7
SLIDE 7

Questions

  • Is probabilistic programming really just about tool and library building for ML applications?
  • Are gradients so important that we should just use HMC or stochastic gradients and neural nets for everything (re: neural Turing machine) and

give up on anything resembling combinatorial search. As a corollary, for anything that does reduce to combinatorial search should we just compile to SAT solvers?

  • Is it more important for purposes of language aesthetic and programmer convenience to syntactically allow constraint/equality observations than

to minimize the risk of programmers writing "impossible" inference problems by forcing them to specify a “noisy” observation likelihood?

  • Is it possible to develop program transformations that succeed and fail in sufficiently intuitive ways to give programmers the facility of equality

constraints when appropriate while not allowing them to make measure zero observations of random variables whose support is “dangerously” complex (uncountable, high-dimensional, etc)?

  • How possible and important is it to identify program fragments in which exact inference can be performed? Is it possible that the most important

probabilistic programming output will be static analysis tools that identify arbitrary programs that identify fragments of models in which exact inference can be performed and then also transform them into an amenable form for running exact inference?

  • Are first-class distributions a requirement? Should they be hidden from the user?
  • What is "query"? What should we name it? What should it mean? We've just called "query" "distribution" because rejection-query is, in Church,

just an exact sampler from a conditional distribution, vs. mh-query which is different. Can't we unify them all and simply call them "query" if you don't have first class distributions or "distribution" if you do and distributions support a sample interface?

  • Should we expose the "internals" of the inference mechanism to the end user? Is the ability to draw a single unweighted approximate sample

enough? Is a set of converging weighted samples enough? Do we need exact samples? Do we want an evidence approximation? Or converging expectation computations? Are compositions of the latter meaningful in any theoretically interpretable way?

  • What is the "deliverable" from inference? How do we amortize inference across queries/tasks? Is probabilistic program compilation Rao-

Blackwellization or transfer learning or both?

  • Is probabilistic programming just an efficient mechanism or means for searching for models in which approximate inference is sufficiently
slide-8
SLIDE 8

Outline

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-9
SLIDE 9

The Maximum Likelihood Principle

L(X, y, θ) = Y

n

p(yn|xn, θ) arg max

θ

L(X, y, θ) = arg max

θ

X

n

log p(yn|xn, θ) = θ∗ s.t. rθ X

n

log p(yn|xn, θ∗) = 0

(defn sgd [X y f' theta-init num-iters stepsize] (loop [theta theta-init num-iters num-iters] (if (= num-iters 0) theta (let [n (rand-int (count y)) gradient (f' (get X n) (get y n) theta) theta (map - theta (map #(* stepsize %) gradient))] (recur w (dec num-iters))))))

slide-10
SLIDE 10

Reasonable Analogy?

Automatic Differentiation Supervised Learning Probabilistic Programming Unsupervised Learning

slide-11
SLIDE 11

Example: Logistic Regression

≈ Shallow feed forward neural net

p(yn = 1|xn, w) = 1 1 + exp{−w0 − P

d wdxnd}

Logistic Regression Maximum Likelihood Code

5 2 3.5 1 Iris-versicolor 6 2.2 4 1 Iris-versicolor 6.2 2.2 4.5 1.5 Iris-versicolor 6 2.2 5 1.5 Iris-virginica 4.5 2.3 1.3 0.3 Iris-setosa 5 2.3 3.3 1 Iris-versicolor

x y

slide-12
SLIDE 12

Pros vs. Cons

  • Cons
  • Lack of convexity requires multiple restarts
  • Brittle
  • Lacking uncertainty quantification, point estimates aren’t

naturally composable

  • Pros
  • Fast
  • Learned parameter value is “compiled deliverable”
slide-13
SLIDE 13

Questions

  • Is probabilistic programming really just tool and library

building for ML applications?

  • Are gradients so important that we should just use

differentiable models with stochastic gradients for everything (re: neural Turing machine, OpenDR, Picture, Stan) and give up on anything resembling combinatorial search?

slide-14
SLIDE 14

Outline

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-15
SLIDE 15

Exact inference; pros vs. cons

  • Pros
  • Exact probability computation; no worries about convergence or

approximation

  • Uncertainty calculus allows principled “model” composition
  • Provides "deliverable" in terms of a complete characterization of a

(usually) low-dimensional conditional probability distribution.

  • Cons
  • Restrictive: Only possible in a small subset of models
  • Costly: Exponential in tree-width
slide-16
SLIDE 16

Outline

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-17
SLIDE 17

Support for continuous variables is essential for ML

(defm marsaglia-normal [mean var] (let [d (uniform-continuous -1.0 1.0) x (sample d) y (sample d) s (+ (* x x) (* y y))] (if (< s 1) (+ mean (* (sqrt var) (* x (sqrt (* -2 (/ ( log s) s)))))) (marsaglia-normal mean var))))

Marsaglia Example Code

slide-18
SLIDE 18

LDS / Kalman Smoother

p(zn|zn−1) = N(zn|Azn−1, Γ) p(xn|zn) = N(xn|Czn, Σ).

sequen- latent condi- corresponding aphical the dy-

zn−1 zn zn+1 xn−1 xn xn+1 z1 z2 x1 x2

p(x1, . . . , xN, z1, . . . , zN) = p(z1)

N

  • n=2

p(zn|zn−1)

N

  • n=1

p(xn|zn).

Kalman Smoother Example Code

slide-19
SLIDE 19

Questions

  • How possible and important is it to identify program

fragments in which exact inference can be performed? Is it possible that the most important probabilistic programming

  • utput will be static analysis tools that identify fragments of

programs/models in which exact inference can be performed and then also transform them into an amenable form for running exact inference, in effect Rao- Blackwellizing by algorithmic or symbolic integration and hiding of latent variables from inference algorithms?

  • Is it worth the effort to service (via such analyses,

compilation, etc.) the few special cases in which exact inference can be performed?

slide-20
SLIDE 20

Related Aside

  • While discrete variables can be always be grounded

exactly, i.e. constrained; continuous variables often cannot.

slide-21
SLIDE 21

The Allure of Equality

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true)) (sample (flip 0.9)))] (observe (= wet-grass true)) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

slide-22
SLIDE 22

Equality <=> Dirac Observe <=> Constraint

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true)) (sample (flip 0.9)))] (observe (dirac wet-grass) true) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

p(x|y = o) ∝ δ(y − o)p(x, y) = p(x, y = o)

slide-23
SLIDE 23

ABC Observe

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true)) (sample (flip 0.9)))] (observe (normal 0.0 tolerance) (d wet-grass true)) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

p(x|y = o) ∝ p(d(y, o))p(x, y)

slide-24
SLIDE 24

Noisy Observe

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (flip 0.99) (and (= sprinkler false) (= is-raining false)) (flip 0.0) (or (= sprinkler true) (= is-raining true)) (flip 0.9))] (observe wet-grass true) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

p(x|y = o) ∝ p(o|x)p(x)

slide-25
SLIDE 25

Questions

  • Is it more important for purposes of language aesthetic and

programmer convenience to syntactically allow constraint/ equality observations than to minimize the risk of programmers writing "impossible" inference problems by forcing them to specify a “noisy” observation likelihood?

  • Is it possible to develop program transformations that

succeed and fail in sufficiently intuitive ways to give programmers the facility of equality constraints when appropriate while not allowing them to make measure-zero

  • bservations of random variables whose support is

“dangerously” complex (uncountable, high-dimensional, etc)?

slide-26
SLIDE 26

Outline

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-27
SLIDE 27

Unsupervised modeling and Approximate inference

  • Named by Yann LeCun @DALI as the most important

area of research in ML

  • He probably meant deep-variational-autoencoders,

but still…

  • It’s what humans do; it’s what an AI must do
  • Canonical pedagogical example : GMM
slide-28
SLIDE 28

“Bayesian” GMM Review

β

α xn

zn π

N

K Λk

µk

Λ0, ν

π ∼ Dirichlet(α) Λk ∼ Wishart(Λ0, ν) µk

  • Λk ∼ Normal
  • 0, (βΛk)−1

zn

  • π ∼ Categorical(π)

xn

  • zn = k, µk, Λk ∼ Normal
  • µk, Λ−1

k

  • .

(a) −2 2 −2 2 (b) −2 2 −2 2 (c) L = 1 −2 2 −2 2 (d) L = 2 −2 2 −2 2 (e) L = 5 −2 2 −2 2 (f) L = 20 −2 2 −2 2

Gaussian Mixture Model Example Code

and

p(x)

slide-29
SLIDE 29

Outline

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-30
SLIDE 30

Advanced Models

  • Dirichlet Process Mixture Model Example Code
  • Unsupervised Hierarchical Clustering via the

Hierarchical Dirichlet Process Example Code

  • Automata Learning via the Probabilistic Deterministic

Infinite Automata Example Code

  • More; policy learning via inference, planning as MAP

search, etc. (offline)

slide-31
SLIDE 31

for base measure concentration and all partitions

Dirichlet Process

6

A A1 A A A A

2 3 4 5

(θ(A1), . . . , θ(Ak)) ∼ Dirichlet(αH(A1), . . . , αH(Ak))

Random probability measure with Dirichlet marginals H

A1, . . . , Ak

α

slide-32
SLIDE 32

Constructive Definition

(defm pick-a-stick [stick v l k] ; picks a stick given a stick generator ; given a value v ~ uniform-continuous(0,1) ; should be called with l = 0.0, k=1 (let [u (+ l (stick k))] (if (> u v) k (pick-a-stick stick v u (+ k 1)))))

(defm remaining [b k] (if (<= k 0) 1 (* (- 1 (b k)) (remaining b (- k 1))))) (defm polya [stick] ; given a stick generating function ; polya returns a function that samples ; stick indexes from the stick lengths (let [uc01 (uniform-continuous 0 1)] (fn [] (let [v (sample uc01)] (pick-a-stick stick v 0.0 1))))) (defm dirichlet-process-breaking-rule [alpha k] (sample (beta 1.0 alpha))) (defm stick [breaking-rule] ; given a breaking-rule function which ; returns a value between 1 and 0 given a ; stick index k returns a function that ; returns the stick length for index k (let [b (mem breaking-rule)] (fn [k] (if (< 0 k) (* (b k) (remaining b (- k 1))) 0)))) (stick 1) (stick 2) (stick 3) (stick 4) (…) (stick 6) v (sample(uniform-continuous 0 1))

slide-33
SLIDE 33

Questions

  • Model prototyping <-> deployment. Where will we land?
slide-34
SLIDE 34

Good News

  • Complexity reduction success
  • New probabilistic-programming-compatible approaches

to inference are possible and under development

slide-35
SLIDE 35

Increased Productivity

(fn [x] (logb 1.04 (+ 1 x)))

Lines of Matlab/Java Code Lines of Anglican Code

HPYP, [Wood 2007] DDPMO, [Neiswanger et al 2014] PDIA, [Pfau 2010] Collapsed LDA DP Conjugate Mixture

log lin p(⋅|data)

Complexity Reduction Example Code

slide-36
SLIDE 36

Message

  • Supervised modeling and maximum likelihood aren’t the whole ML story
  • Exact inference is great when possible but too restrictive to be the sole

focus

  • Continuous variables are essential to machine learning
  • Unsupervised modeling is a growing and important challenge
  • We can write (and actually perform inference) in some pretty advanced

models using existing probabilistic programming systems now

  • Inference is key, and general purpose inference for probabilistic

programming is hard, but there is room for improvement and there is a big role for the PL community to play

slide-37
SLIDE 37

Trace Probability

  • observe data points
  • internal random choices
  • simulate from

by running the program forward

  • weight execution traces by

y1 y2

θ x1 x2 x11

x12

x13 x21 x22

{

{

etc

p(y1:N, x1:N) =

N

Y

n=1

g(yn|x1:n)f(xn|x1:n−1) y1

y2

x1 x2 x3

y3

f(xn|x1:n−1) g(yn|x1:n) xn yn

slide-38
SLIDE 38

n = 1 n = 2

Iteratively, 


  • simulate

  • weight

  • resample

SMC

Observe Particle

slide-39
SLIDE 39

Intuitively


  • run

  • wait

  • fork

SMC for Probabilistic Programming

Threads

  • bserve delimiter

continuations

slide-40
SLIDE 40

SMC Inner Loop

n n n

n n n

n n n

  • Sequential Monte Carlo is

now a building block for

  • ther inference

techniques

  • Particle MCMC
  • PIMH : “particle

independent Metropolis- Hastings”

  • iCSMC : “iterated

conditional SMC”

  • ­‑ ¡ ¡

s=1 s=2 s=3

[Andrieu, Doucet, Holenstein 2010] [W., van de Meent, Mansinghka 2014]

slide-41
SLIDE 41

SMC slowed down for clarity

SMC Parallelism Bottleneck

slide-42
SLIDE 42

Asynchronously 


  • simulate

  • weight

  • branch

n = 1 n = 2

Particle Cascade

Paige, W., Doucet, Teh; NIPS 2014

slide-43
SLIDE 43

Particle Cascade

slide-44
SLIDE 44

The particle cascade provides an unbiased estimator

  • f the marginal likelihood, whose variance decreases

proportionally to the number of initial particles K0:
 
 Theorem: For any K0 ≥ 1 and n ≥ 0, . Theorem: For any n ≥ 0, there exists a constant an such that

ˆ p(y0:n) := 1 K0

Kn

X

k=1

W k

n

V[ˆ p(y0:n)] < an K0

E[ˆ p(y0:n)] = p(y0:n)

Theoretical Properties

slide-45
SLIDE 45

Results: Mean Squared Error

Suppose we wish to compute the posterior expectation of a function : Under mild conditions, the mean squared error of this estimator is bounded, and also decreases as 1/K0. Theorem: For any n ≥ 0, there exists a constant an < ∞ such that for any K0 ≥ 1 and bounded function , ψ(x0:n) E[ψ(x0:n)] ≈ X

k

wk

⇣ x(k)

0:n

E 2 4 ( Kn X

k=1

wk

⇣ x(k)

0:n

⌘!

  • Z

p(dx0:n|y0:n)ψ(x0:n) )23 5  an K0 kψk2 .

ψ

slide-46
SLIDE 46

Scalability: Multiple Cores

  • More cores == faster inference
  • Scales to multiple cores more efficiently than other

particle-based methods

slide-47
SLIDE 47

Opportunities

  • Parallelism

“Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014]

  • Backwards passing

“Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. AISTATS 2015]

  • Search

“Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015]

  • Adaptation

“Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, W ; in submission]

  • Novel proposals

“Adaptive PMCMC” [Paige, W.; in submission]

slide-48
SLIDE 48

Questions

  • What is "query"? What should we name it? What should it mean? Is

“query" just a "distribution" constructor? In Church rejection-query is an exact single sampler from a conditional distribution whereas mh-query is different. Can't we unify them all and simply call them "query" if you don't have first class distributions or "distribution" if you do and distributions support a sample interface? (Andreas said yes in the hall yesterday)

  • Should we expose the "internals" of a query to the end user? Is the

ability to draw a single unweighted approximate sample enough? Is a set of converging weighted samples better? Do we want an evidence approximation? Or converging expectation computations?

  • Are first-class distributions a requirement? Should they be hidden

from the user?

Bayes Net Example Code

slide-49
SLIDE 49

Questions

  • What is the "deliverable" from inference?
  • How do we amortize inference across queries/tasks?
  • How do we define probabilistic program compilation:

automatic Rao-Blackwellization or transfer learning or somehow both?

slide-50
SLIDE 50

Bubble Up

Inference Probabilistic Programming Language Models Applications Probabilistic Programming System

slide-51
SLIDE 51

Thank You

  • David Tolpin (lead architect Anglican)
  • Brooks Paige (workbooks)
  • Jan Willem van de Meent (workbooks)
  • Yura Perov (workbooks)
  • Chris Bishop (figures from PRML)
  • Yee Whye Teh (figures and text from BNP tutorials)
  • Funding: DARPA, Amazon, Microsoft
slide-52
SLIDE 52

In loving memory

Mark Wood

1960-2015

slide-53
SLIDE 53

Discrete RV’s Only

2000 1990 2010

Systems

PL

HANSAI IBAL Figaro

ML STATS

WinBUGS BUGS JAGS STAN LibBi Venture Anglican Church Probabilistic-C infer.NET webChurch Blog Factorie

AI

Prism Prolog KMP

Bounded Recursion

Problog Simula