Probabilistic Programming for Bayesian Machine Learning
Luke Ong 翁之昊
University of Oxford
1
Probabilistic Programming for Bayesian Machine Learning Luke Ong - - PowerPoint PPT Presentation
Probabilistic Programming for Bayesian Machine Learning Luke Ong University of Oxford 1 What is Machine Learning? Many related terms: neural networks, pattern recognition, data mining, data science, statistical modelling, AI, machine
1
2
3
Autonomous veh. / robotics / drones Computer vision: facial recognition Financial prediction / automated trading Recommender systems Language / speech technologies Scientific modelling / data analysis
4
* Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018
5
Much of the hype concerns Deep Learning
* Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018
6
Directly learn to predict: given training data (input-output pairs), learn a parametrised (non-linear) function from inputs to outputs.
.
Examples: Neural nets, support vector machines, decision trees ensembles (e.g. random forests).
Build a probabilistic model to explain observed data by generating them, i.e., simulator. The model defines joint probability
variables, and parameters) and outputs (data).
7
is typically uninterpretable
8
ConvNet figure from Clarifai Technology
Neal, R. M.:Bayesian learning for neural networks (Vol. 118). Springer, 1994. Gal, Yarin: Uncertainty in Deep Learning, Univ. of Cambridge PhD Thesis 2017 Gal & Ghahramani: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016
9
10
Thomas Bayes (1701-1761)
Given observed data
11
Thomas Bayes (1701-1761)
* Cox 1946; Jayne 1996, van Horn 2003
data (observed) parameter (latent)
12
13
* Wood, F .:Probabilistic Programming. NIPS 2015 tutorial. * Tenenbaum & Mansinghka: Engineering and Reverse-Engineering Intelligence Using Probabilistic Programs, Program Induction, and Deep Learning, NIPS17 tutorial
14
Knowledge & Questions
(Make Assumptions) PRIOR PROBABILITY (Discover Patterns) JOINT PROBABILITY
Data
(Make Assumptions) PRIOR PROBABILITY (Infer, Predict, Explore) POSTERIOR PROBABILITY
15
Knowledge & Questions
(Make Assumptions) PRIOR PROBABILITY (Discover Patterns) JOINT PROBABILITY
Data
Criticise Model (Make Assumptions) PRIOR PROBABILITY (Infer, Predict, Explore) POSTERIOR PROBABILITY
n→∞ p(θ ∣ n) = δ(θ − θ*)
n→∞ p(θ ∣ n) = δ(θ −
16
Doob, 1949; Freedman, 1963
E
17
Tanner: Tools for Statistical Inference. Springer, 1996. (Ch. 2)
18
Posterior ∝ Likelihood × Prior
Gordon et al. Probabilistic Programming. FOSE 2014 Staton, S.: Commutative Semantics for Probabilistic Programming. ESOP 2017
19
Picture based on Wood, Introduction to Probabilistic Programming, 2018
Extension of Clojure, a variant of Lisp
20
Bernoulli distribution, bernoulli( ), has two outcomes: 1 with probability , 0 with probability .
21
Poisson distribution, poisson( ), gives the probability of a given number of events occurring in a fixed time interval, where is the average number of events
poisson(3) poisson(10)
22
23
generates 1 with prob. ; 0 with prob.
p 1 − p
gives prob. of events occurring in fixed duration, where rate is avg.
n
w
24
25
26
Based on Wood and Paige: Probabilistic Programming Practicals, MLSS 2015
27
28
29
30
Well-placed bumpers by probabilistic programming, sending all 20 balls into bin
Alphanumeric strings Distribution on
31
warpings, etc.)
Le et al. AISTATS 2017
Captchas
Performance: Inference on tests < 100ms; recognition rate Wikipedia 81%, Facebook 41% “If you can create instances of captchas, you can break it!”
Alphanumeric strings Distribution on
32
warpings, etc.)
Captchas
Le et al. AISTATS 2017
33
34
* Richard McElreath: Statistical Rethinking, 2015
35
36
i ~ discrete [1, .., 10] % draws i with prob. i/55
37
Physics, Vol. 21, 1953
38
Figure from web.stanford.edu/class/stats305a/MCMC.html
39
“We tried to assemble the 10 algorithms with the greatest influence on the development and practice of science and engineering in the 20th century: 1. Metropolis Algorithm for Monte Carlo. Through the use of random processes, this algorithm offers an efficient way to stumble toward answers to problems that are too complicated to solve exactly. 2. Simplex Method for Linear Programming 3. Krylov Subspace Iteration Methods 4. The Decompositional Approach to Matrix Computations 5. The Fortran Optimizing Compiler 6. QR Algorithm for Computing Eigenvalues 7. Quicksort Algorithm for Sorting 8. Fast Fourier Transform 9. Integer Relation Detection
IEEE Computing in Sc. & Eng. Vol. 2, 2000
40
1. Initialise
. 2. Repeat: a. b. c. d. ; 3. Output samples % Discard from “burn-in” period
— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .
41
Initialise Draw, accept
42
Initialise Draw, accept Draw, accept
— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .
43
Initialise Draw, accept Draw, accept Draw but reject ;
(rejected, because is small, hence is close to 0)
— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .
α := min (1, p(x′) ⋅ q(xn ∣ x′) p(xn) ⋅ q(x′ ∣ xn))
44
Initialise Draw, accept Draw, accept Draw but reject; Draw, accept
— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .
45
Initialise Draw, accept Draw, accept Draw but reject, Draw, accept Draw, accept
— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .
46
Initialise Draw, accept Draw, accept Draw but reject, Draw, accept Draw, accept
— Gaussian with mean . We aim to MH sample a bimodal Gaussian mixture distribution .
47
Sciences Approach, 2015.
A∈Σ
48
qλ(z) p(z) ) qλ(dz)
49
50
51
52
Staton, S.: Commutative semantics for probabilistic programming. ESOP 2017
probability finite
s-finite
53
Vakar & O.: On s-finite measures and kernel. https://arxiv.org/pdf/1810.01837, 2018
54
Vakar & O.: On s-finite measures and kernel. https://arxiv.org/pdf/1810.01837, 2018
55
ML & AI: Algorithms & Applications; DL Statistics: Inference & Theory
Evaluators & Semantics
Probabilistic Programming
56