Probabilistic Programming for Bayesian Machine Learning Luke Ong - - PowerPoint PPT Presentation

probabilistic programming for bayesian machine learning
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Programming for Bayesian Machine Learning Luke Ong - - PowerPoint PPT Presentation

Probabilistic Programming for Bayesian Machine Learning Luke Ong University of Oxford 1 What is Machine Learning? Many related terms: neural networks, pattern recognition, data mining, data science, statistical modelling, AI, machine


slide-1
SLIDE 1

Probabilistic Programming for Bayesian Machine Learning

Luke Ong 翁之昊

University of Oxford

1

slide-2
SLIDE 2

What is Machine Learning?

How to make machines learn from data?

  • Computer Science: AI, computer vision, information retrieval
  • Statistics: learning theory, learning and inference from data
  • Cognitive Science / Psychology: perception, computational

linguistics, mathematical psychology

  • Neuroscience: neural networks, neural information

processing

  • Engineering: signal processing, adaptive and optimal control,

information theory, robotics

  • Economics: decision theory, game theory, operations

research Many related terms: neural networks, pattern recognition, data mining, data science, statistical modelling, AI, machine learning, etc.

2

slide-3
SLIDE 3

Truly useful real-world applications of ML / AI

3

Autonomous veh. / robotics / drones Computer vision: facial recognition Financial prediction / automated trading Recommender systems Language / speech technologies Scientific modelling / data analysis

slide-4
SLIDE 4

Intense global interest (and hype) in AI

China’s “Next Generation AI Dev. Plan (2017)”

  • 1. Join first echelon by 2020 (big data, swarm AI, theory)
  • 2. Breakthroughs by 2025 (medicine, AI laws, security & control)
  • 3. World-leading by 2030 with CNY 1 trillion (

USD 150 billion) domestic AI industry (social governance, defence, industry)

4

* Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018

slide-5
SLIDE 5

Intense global interest (and hype) in AI

China’s “Next Generation AI Dev. Plan (2017)”

  • 1. Join first echelon by 2020 (big data, swarm AI, theory)
  • 2. Breakthroughs by 2025 (medicine, AI laws, security & control)
  • 3. World-leading by 2030 with USD150 billion domestic AI

industry (social governance, defence, industry)

5

“Objective 1 (technologies & market applications) was already achieved in mid-2018. China is:

  • #1 in AI funding*: globally 48% from China; 38% from US
  • #1 in total and highly cited AI papers worldwide
  • #1 in AI patents Tsinghua U. Report, 2018

“A ‘Sputnik moment’ was felt by the West.” Stuart Russell 2018

Much of the hype concerns Deep Learning

* Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018

slide-6
SLIDE 6

6

slide-7
SLIDE 7

How to situate Deep Learning in ML?

Discriminative ML

Directly learn to predict: given training data (input-output pairs), learn a parametrised (non-linear) function from inputs to outputs.

  • Training uses data to estimate optimal value of parameter

.

  • Prediction: given unseen input , return output

Examples: Neural nets, support vector machines, decision trees ensembles (e.g. random forests).

Generative (probabilistic) ML

Build a probabilistic model to explain observed data by generating them, i.e., simulator. The model defines joint probability

  • f inputs (latent

variables, and parameters) and outputs (data).

fθ θ* x′ y′ := fθ*(x′) p(X, Y) X Y

7

is typically uninterpretable

θ

slide-8
SLIDE 8

Deep Learning

Limitations

  • 1. Very data hungry
  • 2. Compute-intensive to train and deploy; finicky to optimise
  • 3. Easily fooled by adversarial examples.
  • 4. Poor in giving uncertainty estimates, leading to over-

confidence, so unsuitable for safety-critical systems

  • 5. Hard to use prior knowledge & symbolic representation
  • 6. Uninterpretable black-boxes: parameters have no real-world

meanings

8

ConvNet figure from Clarifai Technology

slide-9
SLIDE 9

Deep learning ad infinitum?

Give up probability, logic, symbolic representation? “Deep learning will plateau out: many things are needed to make further progress, such as reasoning, and programmable models.” Pace Bayesian deep learning / uncertainty in deep learning.

Neal, R. M.:Bayesian learning for neural networks (Vol. 118). Springer, 1994. Gal, Yarin: Uncertainty in Deep Learning, Univ. of Cambridge PhD Thesis 2017 Gal & Ghahramani: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016

9

“Many more applications are completely out of reach for current deep learning techniques — even given vast amounts of human-annotated data. … The main directions in which I see promise are models closer to general-purpose programs.” Francois Chollet (deep learning expert, Keras inventor)

slide-10
SLIDE 10

In contrast to Deep Learning…

Probabilistic Machine Learning

Given a system with some data:

  • 1. Build a model capable of generating data observable from

the system.

  • 2. Use probability to express belief / uncertainty (including

noise) about the model.

  • 3. Apply Bayes' Rule (= Bayesian Inversion) to learn from

data:

  • a. infer unknown quantities
  • b. predict
  • c. explore and adapt models

10

Thomas Bayes (1701-1761)

slide-11
SLIDE 11

Bayes’ Rule:

Given observed data

  • Likelihood function:

, not a probability (w.r.t. )

  • Model evidence:

, normalising constant - computational challenge in ML

  • Significance of Bayes’ Rule: it prescribes how our prior belief

about is changed after observing the data .

P(θ ∣ 𝒠) = P(𝒠 ∣ θ) P(θ) P(𝒠)

𝒠 = {d1, ⋯, dN}

P(θ ∣ 𝒠) ∝ P(𝒠 ∣ θ) × P(θ)

Posterior ∝ Likelihood × Prior

P(𝒠 ∣ θ) θ P(𝒠) = ∫ P(𝒠 ∣ θ) p(θ) dθ θ 𝒠

11

Thomas Bayes (1701-1761)

Two axioms “from which everything follows”* Sum Rule

P(x) = ∑y P(x, y)

Product Rule

P(x, y) = P(x) P(y ∣ x)

* Cox 1946; Jayne 1996, van Horn 2003

data (observed) parameter (latent)

slide-12
SLIDE 12

What is Probabilistic Programming?

Problem: Probabilistic model development, and design & implementation of inference algorithms, are time-consuming and error-prone, requiring bespoke constructions. Probabilistic programming is a general-purpose means of

  • 1. expressing probabilistic models as programs, &
  • 2. automatically performing Bayesian inference.

12

slide-13
SLIDE 13

What is Probabilistic Programming?

Problem: probabilistic model development, and design & implementation of inference algorithms, are time-consuming and error-prone, and (unnecessarily) bespoke constructions Probabilistic programming is a general-purpose means of

  • 1. expressing probabilistic models as programs, &
  • 2. automatically performing Bayesian inference.

Separation of concerns. Probabilistic programming systems

  • enable data scientists / domain experts to focus on designing

good models

  • leaving the development of efficient inference engines to

experts in Bayesian statistics, machine learning & prog. langs. Key advantage: democratise access to machine learning.

13

* Wood, F .:Probabilistic Programming. NIPS 2015 tutorial. * Tenenbaum & Mansinghka: Engineering and Reverse-Engineering Intelligence Using Probabilistic Programs, Program Induction, and Deep Learning, NIPS17 tutorial

slide-14
SLIDE 14

Bayesian / Probabilistic Pipeline

14

Knowledge & Questions

(Make Assumptions) PRIOR PROBABILITY (Discover Patterns) JOINT PROBABILITY

Data

(Make Assumptions) PRIOR PROBABILITY (Infer, Predict, Explore) POSTERIOR PROBABILITY

The pipeline distinguishes roles of

  • 1. knowledge and questions (domain experts)
  • 2. making assumptions (data scientists & ML experts)
  • 3. building models and computing inferences (data scientists & ML

experts), and

  • 4. implementing applications (ML users and practitioners)
slide-15
SLIDE 15

Bayesian / Probabilistic Pipeline Loop

15

Knowledge & Questions

(Make Assumptions) PRIOR PROBABILITY (Discover Patterns) JOINT PROBABILITY

Data

Criticise Model (Make Assumptions) PRIOR PROBABILITY (Infer, Predict, Explore) POSTERIOR PROBABILITY

Probabilistic programming provides the means to iterate the Bayesian pipeline —the posterior probability of th iterate becomes the prior of the ( )th iterate. Loop Robustness: Asymptotic consensus of Bayesian posterior inference.

n n + 1

slide-16
SLIDE 16

Asymptotic certainty of posterior inference

Theorem (Bernstein-von Mises). Assume data set (comprising data points) was generated from some true . Under some regularity conditions, provided In the unrealisable case, where data was generated from some which cannot be modelled by any , then the posterior will converge to where minimises

𝒠n n θ* p(θ*) > 0 lim

n→∞ p(θ ∣ 𝒠n) = δ(θ − θ*)

p*(x) θ lim

n→∞ p(θ ∣ 𝒠n) = δ(θ −

̂ θ) ̂ θ KL(p*(x) ∣ ∣ p(x ∣ θ))

16

Doob, 1949; Freedman, 1963

The posterior distribution for unknown quantities in any problem is effectively asymptotically independent of the prior distribution as the data sample grows large.

slide-17
SLIDE 17

Asymptotic consensus of posterior inference

  • Theorem. Take two Bayesians with different priors,

and , observing same data . Assume and have the same support. Then, as , the posteriors, and , will converge in uniform distance between distributions .

p1(θ) p2(θ) 𝒠 p1 p2 n → ∞ p1(θ ∣ 𝒠n) p2(θ ∣ 𝒠n) ρ(P1, P2) := sup

E

|P1(E) − P2(E)|

17

Tanner: Tools for Statistical Inference. Springer, 1996. (Ch. 2)

  • B. J. K. Kleijn, A. W. Van der Vaart, et al. The Bernstein-von-Mises theorem under
  • misspecification. Electronic Journal ofStatistics, 6:354–381, 2012.
slide-18
SLIDE 18

A probabilistic program describes “prior likelihood”, which is not a probability measure. The Inference Problem: Find the normalising constant so as to compute the posterior probability.

×

18

P(θ ∣ 𝒠) ∝ P(𝒠 ∣ θ) × P(θ)

Posterior ∝ Likelihood × Prior

Gordon et al. Probabilistic Programming. FOSE 2014 Staton, S.: Commutative Semantics for Probabilistic Programming. ESOP 2017

A PPL is a (deterministic) programming lang. plus 3 probabilistic constructs, corresponding to prior, likelihood and posterior, i.e.

  • 1. sample - draws from a probability distribution
  • 2. observe - records the likelihood of an observed data point
  • 3. normalise - computes the normalising constant and (hence)

posterior probability

What is a probabilistic programming language?

slide-19
SLIDE 19

Current Probabilistic Programming Languages (incomplete)

19

Picture based on Wood, Introduction to Probabilistic Programming, 2018

Extension of Clojure, a variant of Lisp

slide-20
SLIDE 20

20

High school probability problem

  • 1. I’ve forgotten what day it is.
  • 2. I know: Sundays: on average 3 WeChats per hour.
  • ther days: on average 10 WeChats per hour.
  • 3. I received 4 WeChats in a given hour.
  • 4. Is it Sunday?
slide-21
SLIDE 21

Bernoulli distribution, bernoulli( ), has two outcomes: 1 with probability , 0 with probability .

p p 1 − p

21

Poisson distribution, poisson( ), gives the probability of a given number of events occurring in a fixed time interval, where is the average number of events

  • ccurring in the interval.

λ λ

poisson(3) poisson(10)

High school probability problem

  • 1. I’ve forgotten what day it is.
  • 2. I know: Sundays: on average 3 WeChats per hour.
  • ther days: on average 10 WeChats per hour.
  • 3. I received 4 WeChats in a given hour.
  • 4. Is it Sunday?
slide-22
SLIDE 22

Solution 0: Easy application of Bayes’ Rule

  • Prob. mass function of Poisson( ):

λ pλ(k) = λk eλ k!, k ∈ {0,1,2,⋯} P(day = 𝚃𝚟𝚘 ∣ rate = 4) = P(day = 𝚃𝚟𝚘) ⋅ P(rate = 4 ∣ day = 𝚃𝚟𝚘) ∑d P(rate = 4 ∣ day = d) ⋅ P(day = d) = (1/7) ⋅ p3(4) (1/7) ⋅ p3(4) + (6/7) ⋅ p10(4) = 0.168 0.168 + 0.114 ≈ 0.597

22

High school probability problem

  • 1. I’ve forgotten what day it is.
  • 2. I know: Sundays: on average 3 WeChats per hour.
  • ther days: on average 10 WeChats per hour.
  • 3. I received 4 WeChats in a given hour.
  • 4. Is it Sunday?
slide-23
SLIDE 23

Probabilistic Programming Solution:

  • 1. Formulate the problem as a program that describes the

posterior probability.

  • 2. Sample from this distribution by running the program (multiple

times), using a suitable Monte Carlo sampling algorithm.

23

High school probability problem

  • 1. I’ve forgotten what day it is.
  • 2. I know: Sundays: on average 3 messages per hour.
  • ther days: on average 10 messages per hour.
  • 3. I received 4 messages in a given hour.
  • 4. Is it Sunday?

normalise( let sunday = sample(bernoulli(1/7)) in let rate = if sunday then 3 else 10 in

  • bserve 4 from poisson(rate);

return(sunday))

generates 1 with prob. ; 0 with prob.

p 1 − p

gives prob. of events occurring in fixed duration, where rate is avg.

  • no. of events occurring

n

w

slide-24
SLIDE 24

24

E.g. Sampling an Anglican program

Histogram of 20000 runs on right. Ground truth: 0.596… (exact solution using Bayes’ Rule)

slide-25
SLIDE 25

Outline

  • 0. Probabilistic Programming:
  • 1. Probabilistic programs as (generative)

models

  • 2. Generic inference algorithms
  • a. approximate methods
  • b. an exact method by disintegration

25

slide-26
SLIDE 26

26

Based on Wood and Paige: Probabilistic Programming Practicals, MLSS 2015

Example: 2D Physics

20 balls are falling from a chute. [Q] How to transfer them to the bin? [A] Position bumpers judiciously.

slide-27
SLIDE 27

27

Probabilistic Programming Solution: 3 steps

  • 1. Generative model as simulator: Construct a program

that, given a list of bumper coordinates, models the trajectories of the 20 bouncing balls, thus determining the number of balls landing in the bin.

bumper-coords #balls-in-bin

simulator

slide-28
SLIDE 28

28

Probabilistic Programming Solution

  • 2. Inference (compute posterior) = “inverting simulator”:

Given desired (probabilistically specified) output, what should the inputs have been, in order to generate the

  • utput?
  • 3. Sample from the posterior distribution

P(X ∣ Y = 𝒪(18,2))

  • 1. Generative model as simulator: Construct a program

that, given a list of bumper coordinates, models the trajectories of the 20 bouncing balls, thus determining the number of balls landing in the bin.

bumper-coords #balls-in-bin

simulator

bumper-coords #balls-in-bin

inference

P(X, Y)

X

Y

slide-29
SLIDE 29

29

Demo: an Anglican implementation

Sampling from the posterior distribution

  • Generate 2000 samples of the posterior distribution using

MCMC (:lmh lightweight Metropolis-Hastings)

  • Discard first 1000 samples (“burn in”), then take every 100th

sample (“thinning”)

  • Select the best (= highest #balls-in-bin) from the 10

remaining samples

slide-30
SLIDE 30

30

Well-placed bumpers by probabilistic programming, sending all 20 balls into bin

slide-31
SLIDE 31

Alphanumeric strings Distribution on

  • alphan. strings

31

p(X, θ, Y) p(X ∣ θ, Y = "pX U 4 ∖2 x")

X simulator (rotations, noise

warpings, etc.)

θ

inference

Le et al. AISTATS 2017

Captchas

Y

"pX U 4 ∖2 x"

Example 2: breaking Captchas using prob. prog.

Performance: Inference on tests < 100ms; recognition rate Wikipedia 81%, Facebook 41% “If you can create instances of captchas, you can break it!”

slide-32
SLIDE 32

Alphanumeric strings Distribution on

  • alphan. strings

32

(rotations, noise

warpings, etc.)

θ

Captchas

p(X, θ, Y) p(X ∣ θ, Y = "pX U 4 ∖2 x")

X simulator Y inference

How to automate the inversion of simulators?

  • 1. A prob. prog. lang. for coding simulators, equipped with

compiler yielding representations of joint distributions

  • 2. General purpose inference engine that can work on arbitrary

programs.

Le et al. AISTATS 2017

"pX U 4 ∖2 x"

slide-33
SLIDE 33

Outline

  • 0. Probabilistic Programming
  • 1. Generative models as probabilistic programs
  • 2. Generic inference algorithms
  • a. approximate methods
  • b. an exact method by disintegration

33

slide-34
SLIDE 34

Good King Markov Puzzle*

34

* Richard McElreath: Statistical Rethinking, 2015

  • King Markov rules over 10

islands.

  • Island has population

.

  • King Markov loves his people: he

vows to visit each island as often as its population size.

  • How should he visit his islands?

[Q] Design an algorithm, subject to (1) no scheduling; no bookkeeping, (2) can only move among adjacent islands.

i 100 × i

slide-35
SLIDE 35

35

  • King Markov rules over 10

islands.

  • Island has population

.

  • King Markov loves his people: he

vows to visit each island as often as its population size.

  • How should he visit his islands?

[Q] Design an algorithm, subject to (1) no scheduling; no bookkeeping, (2) can only move among adjacent islands.

i 100 × i

Attempt 1.

  • Violate (1)

8 days 9 days 10 days 7 days

Good King Markov Puzzle*

slide-36
SLIDE 36

Good King Markov Puzzle*

36

Attempt 2.

  • Viotate (2)

i ~ discrete [1, .., 10] % draws i with prob. i/55

Visit i Repeat

  • King Markov rules over 10

islands.

  • Island has population

.

  • King Markov loves his people: he

vows to visit each island as often as its population size.

  • How should he visit his islands?

[Q] Design an algorithm, subject to (1) no scheduling; no bookkeeping, (2) can only move among adjacent islands.

i 100 × i

slide-37
SLIDE 37

Solution: Metropolis Algorithm

— How to stumble towards an answer by

randomisation.

37

  • 0. Randomly pick some island to visit.

Repeat

  • 1. Randomly set proposed_island to be one of the two

adjacent islands.

  • 2. If proposed_island > present_island

then visit proposed_island next else toss a coin of bias ; if head then visit proposed_island next else stay put proposed_island present_island

  • N. Metropolis et al., “Equation of State Calculations by Fast Computing Machines,” J. Chemical

Physics, Vol. 21, 1953

  • N. Metropolis (1915-99)
slide-38
SLIDE 38

In the limit, King Markov will visit islands as often as its population.

38

An example run of Metropolis algorithm on King Markov puzzle

Figure from web.stanford.edu/class/stats305a/MCMC.html

slide-39
SLIDE 39

39

“We tried to assemble the 10 algorithms with the greatest influence on the development and practice of science and engineering in the 20th century: 1. Metropolis Algorithm for Monte Carlo. Through the use of random processes, this algorithm offers an efficient way to stumble toward answers to problems that are too complicated to solve exactly. 2. Simplex Method for Linear Programming 3. Krylov Subspace Iteration Methods 4. The Decompositional Approach to Matrix Computations 5. The Fortran Optimizing Compiler 6. QR Algorithm for Computing Eigenvalues 7. Quicksort Algorithm for Sorting 8. Fast Fourier Transform 9. Integer Relation Detection

  • 10. Fast Multipole Method Francis Sullivan, Jack Dongarra, Guest editors”

IEEE Computing in Sc. & Eng. Vol. 2, 2000

slide-40
SLIDE 40

Metropolis-Hastings (MH) Algorithms

Goal: Generate samples of target pdf where (normalising constant) is hard, whereas value of is easy, to compute. Use adaptive proposal with previous state, and new state being sample. Assume is easy to compute and sample from.

p(x)/Z Z = ∫ p(x) dx p(x) q(x′ ∣ x) x x′ q

40

1. Initialise

  • randomly. Set

. 2. Repeat: a. b. c. d. ; 3. Output samples % Discard from “burn-in” period

x1 n := 1 x′ ∼ q(x′ ∣ xn) α := min (1, p(x′) ⋅ q(xn ∣ x′) p(xn) ⋅ q(x′ ∣ xn) ) u ∼ 𝖵𝗈𝗃𝗀𝗉𝗌𝗇[0,1] xn+1 := (if u ≤ α then x′ else xn) n := n + 1 xN, xN+1, … x1, …, xN

slide-41
SLIDE 41
  • Example. Let

— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .

q(x′ ∣ x) = 𝒪(x, σ2) x p(x)

41

Initialise Draw, accept

x0 x1 ∙ x0 ∘ x1 q(x1 ∣ x0) p(x)

slide-42
SLIDE 42

42

Initialise Draw, accept Draw, accept

x0 x1 x2 ∙ x1 ∙ x0 ∘ x2

  • Example. Let

— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .

q(x′ ∣ x) = 𝒪(x, σ2) x p(x)

q(x2 ∣ x1) p(x)

slide-43
SLIDE 43

43

Initialise Draw, accept Draw, accept Draw but reject ;

x0 x1 x2 x′ x3 := x2 ∙ x2 x3 ∙ x0 ∙ x1 ∘ x′ q(x3 ∣ x2) p(x)

(rejected, because is small, hence is close to 0)

p(x′)/p(x2) α

  • Example. Let

— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .

q(x′ ∣ x) = 𝒪(x, σ2) x p(x)

α := min (1, p(x′) ⋅ q(xn ∣ x′) p(xn) ⋅ q(x′ ∣ xn))

slide-44
SLIDE 44

44

Initialise Draw, accept Draw, accept Draw but reject; Draw, accept

x0 x1 x2 x3 := x2 x4 ∙ x2 x3 ∙ x0 ∙ x1 ∘ x4 q(x4 ∣ x3) p(x)

  • Example. Let

— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .

q(x′ ∣ x) = 𝒪(x, σ2) x p(x)

slide-45
SLIDE 45

45

∙ x2 x3 ∙ x0 ∙ x1 ∙ x4 q(x5 ∣ x4) p(x) ∘ x5

Initialise Draw, accept Draw, accept Draw but reject, Draw, accept Draw, accept

x0 x1 x2 x3 := x2 x4 x5

  • Example. Let

— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .

q(x′ ∣ x) = 𝒪(x, σ2) x p(x)

slide-46
SLIDE 46

46

∙ x2 x3 ∙ x0 ∙ x1 ∙ x4 q(x5 ∣ x4) p(x) ∘ x5

Initialise Draw, accept Draw, accept Draw but reject, Draw, accept Draw, accept

x0 x1 x2 x3 := x2 x4 x5

  • Example. Let

— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .

q(x′ ∣ x) = 𝒪(x, σ2) x p(x)

slide-47
SLIDE 47

Markov Chain Monte Carlo (MCMC) Methods

  • A class of sampling algorithms: Metropolis, Metropolis-

Hastings (MH), Gibbs Sampling, etc.

  • Application: numerical approximation of high-dimen. integrals.
  • MH Algorithm is used to construct a Markov chain with

invariant distribution . Detailed balance — sufficient condition for to be invariant distribution:

π p p(xi) q(xi−1 ∣ xi) = p(xi−1) q(xi ∣ xi−1)

47

  • J. Gill: A Primer on Markov Chain Monte Carlo. In Bayesian Methods: A Social and Behavioral

Sciences Approach, 2015.

  • C. Andrieu et al.: An Introduction to MCMC for Machine Learning. J. Mach. Learning. 50, 2003.

Theorem (Convergence). Let be the transition matrix of an ergodic Markov chain (i.e. irreducible, aperiodic and positive recurrent) with invariant distribution . Then is the equilibrium distribution i.e., as .

P π π sup

A∈Σ

|Pn(s, A) − π(A)| → 0 n → ∞

slide-48
SLIDE 48
  • 1. Monte Carlo Methods
  • Sampling algorithms: Metropolis, Metropolis-Hastings (MH),

MCMC, Gibbs Sampling, importance, sequential MC etc.

  • E.g. MH constructs a Markov chain with the desired dist. as

the equilibrium distribution. Thus the samples drawn match the actual desired dist. in the limit.

  • Accurate (asymptotically) but do not scale.
  • Application: numerical approximation of high-dimen. integrals.

48

  • 2. Variational Inference (VI)
  • Transforms inference to an optimisation problem: identify the

dist. in a parameterised family that best approximates the posterior by minimising

  • Black-box VI (score estimator, reparameterised gradient),

ADVI, etc.

  • Scalable; while an approx. solution is easy to obtain, it is hard

to evaluate / improve it.

qλ p

KL(qλ|

|p) := ∫ log(

qλ(z) p(z) ) qλ(dz)

slide-49
SLIDE 49

Complexity aspects of MCMC

Mixing time is the time it takes for the time- distribution of an MCMC to be approximately the invariant distribution. An MCMC is rapidly mixing if its mixing time is polynomially in

  • f no. of states.

A

  • hard problem

Theorem (Jerrum, Vigoda & Sinclair 2001). The permanent of a matrix (nonnegative entries) can be computed approximately in probabilistic polynomial time, up to an error of , where is the value of the permanent and arbitrary (i.e. FPRAS). Critical step in the computation: use of rapidly mixing MCMC whose invariant distribution is an almost-uniform distribution of all perfect matchings in a given bipartite graph (i.e. fully polynomial almost uniform sampler, FPAUS).

t log

#P

ϵM M ϵ > 0

49

slide-50
SLIDE 50

Outline

  • 0. Probabilistic Programming
  • 1. Generative models as probabilistic programs
  • 2. Generic inference algorithms
  • a. approximate methods
  • b. an exact method by disintegration

50

slide-51
SLIDE 51

Exact inference algorithms

1.

  • Existence. Does the posterior

always exist? 2. Analytic (closed form) solutions? Only exist in limited cases (e.g. prior and posterior are conjugate distributions).

P(θ ∣ d)

51

Bayesian Inference Problem: Given joint dist. and prior probability , compute posterior probability .

P(θ, d) P(θ) P(θ ∣ d)

slide-52
SLIDE 52

Exact inference algorithms

1.

  • Existence. Does the posterior

always exist? 2. Analytic (closed form) solutions? Only exist in limited cases (e.g. prior and posterior are conjugate distributions).

  • Ans. to Q.1 is yes for -finite (incl. probability) measures.

Problem: Probabilistic programs denote (exactly all the) s-finite measures (i.e. possibly infinite measures that are countable sums of finite measures). Need to establish basic measure-theoretic results for s-finite measures.

P(θ ∣ d) σ

52

Bayesian Inference Problem: Given joint dist. and prior probability , compute posterior probability .

P(θ, d) P(θ) P(θ ∣ d)

Staton, S.: Commutative semantics for probabilistic programming. ESOP 2017

probability finite

  • finite

σ

s-finite

slide-53
SLIDE 53

New existence results for s-finite measures

Disintegration generalises conditional probability to arbitrary measures over general spaces, formalising a non-trivial restriction of a measure to a measure-0 subset. We extend Radon-Nikodym Theorem to s-finite measures, and prove The disintegration theorem for s-finite measures tells us (roughly), subject to certain (modified absolute continuity) conditions, given joint distribution and prior probability, the posterior distribution is guaranteed to exist — all denotable as probabilistic programs.

53

Theorem (Disintegration). Given sbs and , s-finite measures

  • n

and on satisfying some modified absolute continuity conditions, then there exists an essentially unique s-finite kernel s.t.

X Y μ X × Y ν Y k : Y ⇝ X μ = ν ⊗ k := λW . ∫Y ν(dy) ∫X k(y, dx) 𝕁W(x, y)

Vakar & O.: On s-finite measures and kernel. https://arxiv.org/pdf/1810.01837, 2018

slide-54
SLIDE 54

New existence results for s-finite measures

Disintegration generalises conditional probability to arbitrary measures over general spaces, formalising a non-trivial restriction of a measure to a measure-0 subset. We extend Radon-Nikodym Theorem to s-finite measures, and prove Goal: Build disintegration / density calculators, using techniques in meta-programming (e.g. lazy partial evaluation), giving exact (analytic) solutions where possible. Useful for the general problem of computing approximate inference. Tools: PSI (Gehr et al. ETH, 2016); Hakaru (Shan et al. 2017)

54

Theorem (Disintegration). Given sbs and , s-finite measures

  • n

and on satisfying some absolute continuity conditions, then there exists an essentially unique s-finite kernel s.t.

X Y μ X × Y ν Y k : Y ⇝ X μ = ν ⊗ k := λW . ∫Y ν(dy) ∫X k(y, dx) 𝕁W(x, y)

Vakar & O.: On s-finite measures and kernel. https://arxiv.org/pdf/1810.01837, 2018

slide-55
SLIDE 55

Summary

Probabilistic programming is a general-purpose method of

  • 1. constructing probabilistic models as computer programs,
  • 2. automatically performing Bayesian inference.

Separating model construction from inference procedures democratises access to machine learning, with potentially huge benefits to AI and scientific modelling. Probabilistic programming poses unique and interesting challenges in programming languages (semantics and pragmatics) and the statistical foundations of machine learning.

55

ML & AI: Algorithms & Applications; DL Statistics: Inference & Theory

  • Prog. Lang.:

Evaluators & Semantics

Probabilistic Programming

slide-56
SLIDE 56

Further directions

  • 1. Variational inference:
  • a. theory (less explored than Monte Carlo methods)
  • b. can we: find better local optima? accelerate convergence?
  • c. are there better divergences than KL?
  • d. VI under-approximates posterior variances. Can we do

better?

  • 2. Many inference algorithms have side conditions (e.g. ergodicity

for MCMC; commutativity of and for score gradient VI). Find corresponding “checkable” sufficient conditions on probabilistic programs.

  • 3. Design and implementation of probabilistic prog. lang. for

Bayesian nonparametric models; semantics and inference algorithms for these languages.

  • 4. Exploit DL as backend engine (e.g. amortised compilation);

deep probabilistic prog. lang.: Pyro and Edward).

∇ ∫

56