Tutorial on Probabilistic Programming in Machine Learning
Frank Wood
Tutorial on Probabilistic Programming in Machine Learning Frank - - PowerPoint PPT Presentation
Tutorial on Probabilistic Programming in Machine Learning Frank Wood Play Along 1. Download and install Leiningen http://leiningen.org/ 2. Fork and clone the Anglican Examples repository git@bitbucket.org:fwood/anglican-examples.git
Frank Wood
Probabilistic Deterministic Infinite Automata (PDIA)
learned
biases towards compact (few states) world-models that are fast approximate predictors
Problem
– 6-12 months
Pfau, ¡Bartlett, ¡and ¡W. ¡Probabilistic ¡Deterministic ¡Infinite ¡Automata, ¡NIPS, ¡2011 ¡ Doshi-‑Velez, ¡Pfau, ¡W., ¡and ¡Roy ¡Bayesian ¡Nonparametric ¡Methods ¡for ¡Partially-‑Observable ¡Reinforcement ¡Learning, ¡TPAMI, ¡2013 ¡
ε
µ ∼ Dir(α0/|Q|) φj ∼ Dir(αµ) δ(qi, σj) = δij ∼ φj πqi ∼ Dir(β/|Σ|) ξ0 = q0, ξt = δ(ξt−1, xt−1) xt ∼ πξt
Notation:
q0 q1 q2 q3 q4 q5 . . .
δ
σ0 σ1 σ2 q1 q1
. . . . . . . . .
q2 q3 q3 q3 q3 q4 q5 q5 q42 q12 q6
π
q0 q1 . . . σ0 σ1 σ2
(iid probability vector) (iid probability vector)
. . . . . . . . . q0 q1 q3 q3 q3 q5 q2 q4 q5 q2 . . . 1 1 0 0 2 2 1 2 1 0 . . .
A Prior over PDFA with a bounded number of states
j = 0, . . . , |Σ| − 1 i = 0, . . . , |Q| − 1 i = 0, . . . , |Q| − 1
Q Σ δ : Q × Σ → Q π : Q × Σ → [0, 1] q0 ∈ Q xt ∈ Σ ξt ∈ Q α, α0, β
Probabilistic Deterministic Finite Automata 1
B/1.02 3 4 5 6
T/0.5 P/0.5 S/0.6 X/0.4 V/0.3 T/0.7 V/0.5 X/0.5 P/0.5 S/0.5 E/1.0 to 0 from 61
A/0.5 B/0.5 B/1.0
1 2 3 4 5 6
B/0.8125 B/0.8125 B/0.75 A/0.1875 A/0.5625 A/0.5625 B/0.0625 A/0.1875 B/0.4375 A/0.25 A/0.9375 B/0.4375 A/0.75 B/0.25ε 1 00 01 10 11
1 1 1 1 1 1 1Trigram as DFA (without probability) Even Process Reber Grammar Foulkes Grammar
Dependent Dirichlet Process Mixture of Objects
disappear, and occlude
learned
functions learned
Problem
– ~ 1 year
– ~1 page latex math
– ~3 pages latex math
Neiswanger, ¡Wood, ¡and ¡Xing ¡The ¡Dependent ¡Dirichlet ¡Process ¡Mixture ¡of ¡Objects ¡for ¡Detection-‑free ¡Tracking ¡and ¡Object ¡Modeling, ¡AISTATS, ¡2014 ¡ Caron, ¡F., ¡Neiswanger, ¡W., ¡Wood, ¡F., ¡Doucet, ¡A., ¡& ¡Davy, ¡M. ¡Generalized ¡Pólya ¡Urn ¡for ¡Time-‑Varying ¡Pitman-‑Yor ¡Processes. ¡JMLR. ¡To ¡appear ¡2015
−4 −2 2 4 −4 −2 2 4 10 20 30 40 50 60 70 80 90 100 (a) −8 −6 −4 −2 2 4 −4 −2 2 4 5 10 15 20 25 30 35 40 45 50 (b) (c) (d) (e) (f) (g) (h) (i) (j)focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
give up on anything resembling combinatorial search. As a corollary, for anything that does reduce to combinatorial search should we just compile to SAT solvers?
to minimize the risk of programmers writing "impossible" inference problems by forcing them to specify a “noisy” observation likelihood?
constraints when appropriate while not allowing them to make measure zero observations of random variables whose support is “dangerously” complex (uncountable, high-dimensional, etc)?
probabilistic programming output will be static analysis tools that identify arbitrary programs that identify fragments of models in which exact inference can be performed and then also transform them into an amenable form for running exact inference?
just an exact sampler from a conditional distribution, vs. mh-query which is different. Can't we unify them all and simply call them "query" if you don't have first class distributions or "distribution" if you do and distributions support a sample interface?
enough? Is a set of converging weighted samples enough? Do we need exact samples? Do we want an evidence approximation? Or converging expectation computations? Are compositions of the latter meaningful in any theoretically interpretable way?
Blackwellization or transfer learning or both?
focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
n
θ
θ
n
n
(defn sgd [X y f' theta-init num-iters stepsize] (loop [theta theta-init num-iters num-iters] (if (= num-iters 0) theta (let [n (rand-int (count y)) gradient (f' (get X n) (get y n) theta) theta (map - theta (map #(* stepsize %) gradient))] (recur w (dec num-iters))))))
p(yn = 1|xn, w) = 1 1 + exp{−w0 − P
d wdxnd}
5 2 3.5 1 Iris-versicolor 6 2.2 4 1 Iris-versicolor 6.2 2.2 4.5 1.5 Iris-versicolor 6 2.2 5 1.5 Iris-virginica 4.5 2.3 1.3 0.3 Iris-setosa 5 2.3 3.3 1 Iris-versicolor
naturally composable
focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
approximation
(usually) low-dimensional conditional probability distribution.
focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
(defm marsaglia-normal [mean var] (let [d (uniform-continuous -1.0 1.0) x (sample d) y (sample d) s (+ (* x x) (* y y))] (if (< s 1) (+ mean (* (sqrt var) (* x (sqrt (* -2 (/ ( log s) s)))))) (marsaglia-normal mean var))))
p(zn|zn−1) = N(zn|Azn−1, Γ) p(xn|zn) = N(xn|Czn, Σ).
sequen- latent condi- corresponding aphical the dy-
zn−1 zn zn+1 xn−1 xn xn+1 z1 z2 x1 x2
p(x1, . . . , xN, z1, . . . , zN) = p(z1)
N
p(zn|zn−1)
N
p(xn|zn).
fragments in which exact inference can be performed? Is it possible that the most important probabilistic programming
programs/models in which exact inference can be performed and then also transform them into an amenable form for running exact inference, in effect Rao- Blackwellizing by algorithmic or symbolic integration and hiding of latent variables from inference algorithms?
compilation, etc.) the few special cases in which exact inference can be performed?
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true)) (sample (flip 0.9)))] (observe (= wet-grass true)) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true)) (sample (flip 0.9)))] (observe (dirac wet-grass) true) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true)) (sample (flip 0.9)))] (observe (normal 0.0 tolerance) (d wet-grass true)) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true)) (flip 0.99) (and (= sprinkler false) (= is-raining false)) (flip 0.0) (or (= sprinkler true) (= is-raining true)) (flip 0.9))] (observe wet-grass true) (predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
programmer convenience to syntactically allow constraint/ equality observations than to minimize the risk of programmers writing "impossible" inference problems by forcing them to specify a “noisy” observation likelihood?
succeed and fail in sufficiently intuitive ways to give programmers the facility of equality constraints when appropriate while not allowing them to make measure-zero
“dangerously” complex (uncountable, high-dimensional, etc)?
focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
β
α xn
zn π
N
K Λk
µk
Λ0, ν
π ∼ Dirichlet(α) Λk ∼ Wishart(Λ0, ν) µk
zn
xn
k
(a) −2 2 −2 2 (b) −2 2 −2 2 (c) L = 1 −2 2 −2 2 (d) L = 2 −2 2 −2 2 (e) L = 5 −2 2 −2 2 (f) L = 20 −2 2 −2 2
and
p(x)
focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
6
2 3 4 5
(θ(A1), . . . , θ(Ak)) ∼ Dirichlet(αH(A1), . . . , αH(Ak))
(defm pick-a-stick [stick v l k] ; picks a stick given a stick generator ; given a value v ~ uniform-continuous(0,1) ; should be called with l = 0.0, k=1 (let [u (+ l (stick k))] (if (> u v) k (pick-a-stick stick v u (+ k 1)))))
(defm remaining [b k] (if (<= k 0) 1 (* (- 1 (b k)) (remaining b (- k 1))))) (defm polya [stick] ; given a stick generating function ; polya returns a function that samples ; stick indexes from the stick lengths (let [uc01 (uniform-continuous 0 1)] (fn [] (let [v (sample uc01)] (pick-a-stick stick v 0.0 1))))) (defm dirichlet-process-breaking-rule [alpha k] (sample (beta 1.0 alpha))) (defm stick [breaking-rule] ; given a breaking-rule function which ; returns a value between 1 and 0 given a ; stick index k returns a function that ; returns the stick length for index k (let [b (mem breaking-rule)] (fn [k] (if (< 0 k) (* (b k) (remaining b (- k 1))) 0)))) (stick 1) (stick 2) (stick 3) (stick 4) (…) (stick 6) v (sample(uniform-continuous 0 1))
(fn [x] (logb 1.04 (+ 1 x)))
Lines of Matlab/Java Code Lines of Anglican Code
HPYP, [Wood 2007] DDPMO, [Neiswanger et al 2014] PDIA, [Pfau 2010] Collapsed LDA DP Conjugate Mixture
log lin p(⋅|data)
focus
models using existing probabilistic programming systems now
programming is hard, but there is room for improvement and there is a big role for the PL community to play
by running the program forward
y1 y2
θ x1 x2 x11
x12
x13 x21 x22
etc
p(y1:N, x1:N) =
N
Y
n=1
g(yn|x1:n)f(xn|x1:n−1) y1
y2
x1 x2 x3
y3
n = 1 n = 2
n n n
n n n
n n n
independent Metropolis- Hastings”
conditional SMC”
s=1 s=2 s=3
[Andrieu, Doucet, Holenstein 2010] [W., van de Meent, Mansinghka 2014]
n = 1 n = 2
Paige, W., Doucet, Teh; NIPS 2014
ˆ p(y0:n) := 1 K0
Kn
X
k=1
W k
n
V[ˆ p(y0:n)] < an K0
E[ˆ p(y0:n)] = p(y0:n)
k
nψ
0:n
E 2 4 ( Kn X
k=1
wk
nψ
⇣ x(k)
0:n
⌘!
p(dx0:n|y0:n)ψ(x0:n) )23 5 an K0 kψk2 .
“Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014]
“Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. AISTATS 2015]
“Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015]
“Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, W ; in submission]
“Adaptive PMCMC” [Paige, W.; in submission]
“query" just a "distribution" constructor? In Church rejection-query is an exact single sampler from a conditional distribution whereas mh-query is different. Can't we unify them all and simply call them "query" if you don't have first class distributions or "distribution" if you do and distributions support a sample interface? (Andreas said yes in the hall yesterday)
ability to draw a single unweighted approximate sample enough? Is a set of converging weighted samples better? Do we want an evidence approximation? Or converging expectation computations?
from the user?
In loving memory
1960-2015
2000 1990 2010
PL
HANSAI IBAL Figaro
ML STATS
WinBUGS BUGS JAGS STAN LibBi Venture Anglican Church Probabilistic-C infer.NET webChurch Blog Factorie
AI
Prism Prolog KMP
Problog Simula