Probabilistic Programming for Bayesian Machine Learning Luke Ong - PowerPoint PPT Presentation

Probabilistic Programming for Bayesian Machine Learning Luke Ong 翁之昊 University of Oxford 1

What is Machine Learning? Many related terms: neural networks, pattern recognition, data mining, data science, statistical modelling, AI, machine learning, etc. How to make machines learn from data? • Computer Science : AI, computer vision, information retrieval • Statistics : learning theory, learning and inference from data • Cognitive Science / Psychology : perception, computational linguistics, mathematical psychology • Neuroscience : neural networks, neural information processing • Engineering : signal processing, adaptive and optimal control, information theory, robotics • Economics : decision theory, game theory, operations research 2

Truly useful real-world applications of ML / AI Autonomous veh. / robotics / drones Computer vision: facial recognition Financial prediction / automated trading Recommender systems Language / speech technologies Scientific modelling / data analysis 3

Intense global interest (and hype) in AI China’s “Next Generation AI Dev. Plan (2017)” 1. Join first echelon by 2020 (big data, swarm AI, theory) 2. Breakthroughs by 2025 (medicine, AI laws, security & control) 3. World-leading by 2030 with CNY 1 trillion ( USD 150 billion) ≈ domestic AI industry (social governance, defence, industry) * Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018 4

Intense global interest (and hype) in AI China’s “Next Generation AI Dev. Plan (2017)” 1. Join first echelon by 2020 (big data, swarm AI, theory) 2. Breakthroughs by 2025 (medicine, AI laws, security & control) 3. World-leading by 2030 with USD150 billion domestic AI industry (social governance, defence, industry) “Objective 1 (technologies & market applications) was already achieved in mid-2018 . China is: Much of the hype • #1 in AI funding*: globally 48% from China; 38% from US concerns Deep • #1 in total and highly cited AI papers worldwide Learning • #1 in AI patents Tsinghua U. Report, 2018 “A ‘ Sputnik moment ’ was felt by the West.” Stuart Russell 2018 * Total equity funding of AI start-ups Allen, G. C.: Understanding China’s AI Strategy. Centre for a New American Security. 2019 Ding, F .: Deciphering China’s AI Dream. Future of Humanity Institute, Univ. of Oxford, 2018 Xue Lan: China AI Development Report, Tsinghua University, 2018 5

How to situate Deep Learning in ML? Discriminative ML Directly learn to predict : given training data (input-output pairs), learn a parametrised (non-linear) function from inputs to outputs. f θ • Training uses data to estimate optimal value of parameter . θ * • Prediction : given unseen input , return output x ′ � y ′ � := f θ * ( x ′ � ) Examples : Neural nets, support vector machines, decision trees ensembles (e.g. random forests). is typically θ uninterpretable Generative (probabilistic) ML Build a probabilistic model to explain observed data by generating them, i.e., simulator . The model defines joint probability of inputs (latent p ( X , Y ) X variables, and parameters) and outputs (data). Y 7

Deep Learning Limitations 1. Very data hungry 2. Compute-intensive to train and deploy; finicky to optimise 3. Easily fooled by adversarial examples. 4. Poor in giving uncertainty estimates, leading to over- confidence, so unsuitable for safety-critical systems 5. Hard to use prior knowledge & symbolic representation 6. Uninterpretable black-boxes: parameters have no real-world meanings ConvNet figure from Clarifai Technology 8

Deep learning ad infinitum ? Give up probability, logic, symbolic representation? “Deep learning will plateau out: many things are needed to make further progress, such as reasoning , and programmable models .” “Many more applications are completely out of reach for current deep learning techniques — even given vast amounts of human-annotated data. … The main directions in which I see promise are models closer to general-purpose programs .” Francois Chollet (deep learning expert, Keras inventor) Pace Bayesian deep learning / uncertainty in deep learning. Neal, R. M.:Bayesian learning for neural networks (Vol. 118). Springer, 1994. Gal, Yarin: Uncertainty in Deep Learning, Univ. of Cambridge PhD Thesis 2017 Gal & Ghahramani: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016 9

Thomas Bayes (1701-1761) In contrast to Deep Learning… Probabilistic Machine Learning Given a system with some data: 1. Build a model capable of generating data observable from the system. 2. Use probability to express belief / uncertainty (including noise) about the model. 3. Apply Bayes' Rule ( = Bayesian Inversion ) to learn from data: a. infer unknown quantities b. predict c. explore and adapt models 10

Thomas Bayes (1701-1761) Two axioms “ from which everything follows”* Sum Rule P ( x ) = ∑ y P ( x , y ) Product Rule P ( x , y ) = P ( x ) P ( y ∣ x ) P ( θ ∣ 𝒠 ) = P ( 𝒠 ∣ θ ) P ( θ ) Bayes’ Rule: P ( 𝒠 ) Given observed data 𝒠 = { d 1 , ⋯ , d N } data (observed) parameter (latent) P ( θ ∣ 𝒠 ) ∝ P ( 𝒠 ∣ θ ) × P ( θ ) Posterior ∝ Likelihood × Prior • Likelihood function: , not a probability (w.r.t. ) P ( 𝒠 ∣ θ ) θ • Model evidence: P ( 𝒠 ) = ∫ P ( 𝒠 ∣ θ ) p ( θ ) d θ , normalising constant - computational challenge in ML • Significance of Bayes’ Rule: it prescribes how our prior belief about is changed after observing the data . 𝒠 θ * Cox 1946; Jayne 1996, van Horn 2003 11

What is Probabilistic Programming? Problem : Probabilistic model development, and design & implementation of inference algorithms, are time-consuming and error-prone, requiring bespoke constructions. Probabilistic programming is a general-purpose means of 1. expressing probabilistic models as programs, & 2. automatically performing Bayesian inference. 12

What is Probabilistic Programming? Problem : probabilistic model development, and design & implementation of inference algorithms, are time-consuming and error-prone, and (unnecessarily) bespoke constructions Probabilistic programming is a general-purpose means of 1. expressing probabilistic models as programs, & 2. automatically performing Bayesian inference. Separation of concerns. Probabilistic programming systems • enable data scientists / domain experts to focus on designing good models • leaving the development of e ffi cient inference engines to experts in Bayesian statistics, machine learning & prog. langs. Key advantage : democratise access to machine learning. * Wood, F .:Probabilistic Programming. NIPS 2015 tutorial. * Tenenbaum & Mansinghka: Engineering and Reverse-Engineering Intelligence Using Probabilistic Programs, Program Induction, and Deep Learning, NIPS17 tutorial 13

Bayesian / Probabilistic Pipeline Data Knowledge & Questions (Make Assumptions) (Discover Patterns) (Infer, Predict, Explore) (Make Assumptions) PRIOR PROBABILITY JOINT PROBABILITY POSTERIOR PRIOR PROBABILITY PROBABILITY The pipeline distinguishes roles of 1. knowledge and questions (domain experts) 2. making assumptions (data scientists & ML experts) 3. building models and computing inferences (data scientists & ML experts), and 4. implementing applications (ML users and practitioners) 14

Bayesian / Probabilistic Pipeline Loop Criticise Model Data Knowledge & Questions (Make Assumptions) (Discover Patterns) (Infer, Predict, Explore) (Make Assumptions) PRIOR PROBABILITY JOINT PROBABILITY POSTERIOR PRIOR PROBABILITY PROBABILITY Probabilistic programming provides the means to iterate the Bayesian pipeline —the posterior probability of th iterate becomes n the prior of the ( )th iterate. n + 1 Loop Robustness : Asymptotic consensus of Bayesian posterior inference. 15

̂ ̂ Asymptotic certainty of posterior inference Theorem (Bernstein-von Mises). Assume data set 𝒠 n (comprising data points) was generated from some true . n θ * Under some regularity conditions, provided p ( θ *) > 0 n →∞ p ( θ ∣ 𝒠 n ) = δ ( θ − θ *) lim In the unrealisable case, where data was generated from some which cannot be modelled by any , then the posterior p *( x ) θ will converge to n →∞ p ( θ ∣ 𝒠 n ) = δ ( θ − lim θ ) KL ( p *( x ) ∣ ∣ p ( x ∣ θ ) ) where minimises θ The posterior distribution for unknown quantities in any problem is e ff ectively asymptotically independent of the prior distribution as the data sample grows large. Doob, 1949; Freedman, 1963 16

Asymptotic consensus of posterior inference Theorem . Take two Bayesians with di ff erent priors, and p 1 ( θ ) , observing same data . Assume and have the 𝒠 p 2 ( θ ) p 1 p 2 same support. Then, as , the posteriors, and n → ∞ p 1 ( θ ∣ 𝒠 n ) , will converge in uniform distance between p 2 ( θ ∣ 𝒠 n ) distributions . | P 1 ( E ) − P 2 ( E ) | ρ ( P 1 , P 2 ) := sup E Tanner: Tools for Statistical Inference. Springer , 1996. (Ch. 2) B. J. K. Kleijn, A. W. Van der Vaart, et al. The Bernstein-von-Mises theorem under misspecification. Electronic Journal ofStatistics, 6:354–381, 2012. 17

Probabilistic Programming for Bayesian Machine Learning Luke Ong - PowerPoint PPT Presentation

Probabilistic Programming for Bayesian Machine Learning Luke Ong University of Oxford 1 What is Machine Learning? Many related terms: neural networks, pattern recognition, data mining, data science, statistical modelling, AI, machine

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Bayesian networks Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian networks

Bayesian networks Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian networks

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Probabilistic Modeling: Bayesian Networks Bioinformatics: Sequence Analysis COMP 571 - Spring

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

On Computational and Probabilistic Inference Rajat Mani Thomas Objectives: Revisiting Bayesian

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Edward: Deep Probabilistic Programming Extended Seminar Systems and Machine Learning Steven

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Improving Robustness of Deep-Learning-Based Image Reconstruction Ankit Raj [1] , Yoram Bresler

XENON1T for the XENON collaboration Rafael F. Lang Purdue University rafael@purdue.edu Aspen,

The principle of concentration-compactness and an application. Alexis Drouot September 3rd 2015

Data assimilation and Markov Chain Monte Carlo techniques Moritz Schauer, Gothenburg University,

Value of Transgenics: Weed Management Culpepper and Steckel Survey Participants Jamshid Asigh

Ruby For Pentesters Mike Tracy, Chris Rohlf, Eric Monti Friday, July 24, 2009 Who Mike Tracy

THE WHOLE NINE YARDS DEEPSEC 2012 INTROS Peter Morgan Senior Consultant at Accuvant LABS,

Network Routing Capacity Jillian Cannons (University of California, San Diego) Randy Dougherty