Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 1 / 11

Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11

Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ∧ e ) P ( m | x ∧ e ) m ∈ M = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11

Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ∧ e ) P ( m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ) P ( m | e ) m ∈ M M is a set of mutually exclusive and covering models (hypotheses). What assumptions are made here? � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11

Learning Under Uncertainty The posterior probability of a model m given examples e : P ( m | e ) = P ( e | m ) × P ( m ) P ( e ) The likelihood, P ( e | m ), is the probability that model m would have produced examples e . The prior, P ( m ), encodes the learning bias P ( e ) is a normalizing constant so the probabilities of the models sum to 1. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 3 / 11

Plate Notation Examples e = [ e 1 , . . . , e k ] are independent and identically distributed (i.i.d.) given m if k � P ( e | m ) = P ( e i | m ) i =1 m m ei e1 e2 ... ek i � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 4 / 11

Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11

Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = P ( e | φ = p ) × P ( φ = p ) P ( e ) Suppose e is a sequence of n 1 instances of y and n 0 instances of ¬ y : P ( e | φ = p ) = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11

Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = P ( e | φ = p ) × P ( φ = p ) P ( e ) Suppose e is a sequence of n 1 instances of y and n 0 instances of ¬ y : P ( e | φ = p ) = p n 1 × (1 − p ) n 0 Uniform prior: P ( φ = p ) = 1 for all p ∈ [0 , 1]. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11

φ Posterior Probabilities for Different Training Examples (beta distribution) 3.5 n 0 =0, n 1 =0 n 0 =1, n 1 =2 3 n 0 =2, n 1 =4 n 0 =4, n 1 =8 2.5 2 P (φ| e) 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 6 / 11

MAP model The maximum a posteriori probability (MAP) model is the model m that maximizes P ( m | e ). That is, it maximizes: P ( e | m ) × P ( m ) Thus it minimizes: ( − log P ( e | m )) + ( − log P ( m )) which is the number of bits to send the examples, e , given the model m plus the number of bits to send the model m . � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 7 / 11

Averaging Over Models Idea: Rather than choosing the most likely model, average over all models, weighted by their posterior probabilities given the examples. If you have observed a sequence of n 1 instances of y and n 0 instances of ¬ y , with uniform prior: n 1 ◮ the most likely value (MAP) is n 0 + n 1 n 1 + 1 ◮ the expected value is n 0 + n 1 + 2 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 8 / 11

Beta Distribution Beta α 0 ,α 1 ( p ) = 1 K p α 1 − 1 × (1 − p ) α 0 − 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0 , 1] is Beta 1 , 1 . The expected value is α 1 / ( α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0 ,α 1 , the posterior distribution after observing n 1 true cases and n 0 false cases is: � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 9 / 11

Beta Distribution Beta α 0 ,α 1 ( p ) = 1 K p α 1 − 1 × (1 − p ) α 0 − 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0 , 1] is Beta 1 , 1 . The expected value is α 1 / ( α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0 ,α 1 , the posterior distribution after observing n 1 true cases and n 0 false cases is: Beta α 0 + n 0 ,α 1 + n 1 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 9 / 11

Dirichlet distribution Suppose Y has k values. The Dirichlet distribution has two sorts of parameters, ◮ positive counts α 1 , . . . , α k α i is one more than the count of the i th outcome. ◮ probability parameters p 1 , . . . , p k p i is the probability of the i th outcome k Dirichlet α 1 ,...,α k ( p 1 , . . . , p k ) = 1 α j − 1 � p j K j =1 where K is a normalizing constant The expected value of i th outcome is α i � j α j � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 10 / 11

Hierarchical Bayesian Model Where do the priors come from? Example: S XH is true when patient X is sick in hospital H . We want to learn the probability of Sick for each hospital. Where do the prior probabilities for the hospitals come from? α 1 α 2 α 2 α 1 φ H φ 1 φ 2 φ k ... ... ... ... SXH S1k S11 S12 S21 S22 X H (a) (b) � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 11 / 11

Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning. D. Poole and A. Mackworth 2019 c Artificial

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

PVMD Delft University of Technology Learning objectives Improved light calibration Learning

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Standards for Professional Learning Standards for Professional Learning Learning objectives

Designs Learning designs Learning objectives Learners will be able to Provide a

Trigeminal Autonomic Cephalalgias Learning Objectives Learning Objectives At the conclusion

Learning Objectives Learning Objectives Overuse Injuries in Overuse Injuries in Endurance

Rate vs temporal code about synchrony Learning objectives: Learning objectives: To

Testing Object Oriented Software Chapter 15 p Learning objectives Learning objectives

PVMD Ren van Swaaij Delft University of Technology Learning objectives Why use a

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 3: Probability Jan-Willem van de

Thanks to R Parr, C Guesterin

Identifying Parametric Prior Distributions Stephanie Kovalchik UCLA, Department of Biostatistics

Introduction to Bayesian Inference Frank Wood April 6, 2010 Introduction Overview of Topics

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall

Probabilistic Graphical Models Lecture 5 Bayesian Learning of Bayesian Networks CS/CNS/EE