learning objectives
play

Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning. D. Poole and A. Mackworth 2019 c Artificial


  1. Learning Objectives At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 1 / 11

  2. Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11

  3. Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ∧ e ) P ( m | x ∧ e ) m ∈ M = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11

  4. Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e : � p ( Y | x ∧ e ) = P ( Y ∧ m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ∧ e ) P ( m | x ∧ e ) m ∈ M � = P ( Y | m ∧ x ) P ( m | e ) m ∈ M M is a set of mutually exclusive and covering models (hypotheses). What assumptions are made here? � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 2 / 11

  5. Learning Under Uncertainty The posterior probability of a model m given examples e : P ( m | e ) = P ( e | m ) × P ( m ) P ( e ) The likelihood, P ( e | m ), is the probability that model m would have produced examples e . The prior, P ( m ), encodes the learning bias P ( e ) is a normalizing constant so the probabilities of the models sum to 1. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 3 / 11

  6. Plate Notation Examples e = [ e 1 , . . . , e k ] are independent and identically distributed (i.i.d.) given m if k � P ( e | m ) = P ( e i | m ) i =1 m m ei e1 e2 ... ek i � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 4 / 11

  7. Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11

  8. Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = P ( e | φ = p ) × P ( φ = p ) P ( e ) Suppose e is a sequence of n 1 instances of y and n 0 instances of ¬ y : P ( e | φ = p ) = � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11

  9. Bayesian Learning of Probabilities Y has two outcomes y and ¬ y . We want the probability of y given training examples e . We can treat the probability of y as a real-valued random variable on the interval [0 , 1], called φ . Bayes’ rule gives: P ( φ = p | e ) = P ( e | φ = p ) × P ( φ = p ) P ( e ) Suppose e is a sequence of n 1 instances of y and n 0 instances of ¬ y : P ( e | φ = p ) = p n 1 × (1 − p ) n 0 Uniform prior: P ( φ = p ) = 1 for all p ∈ [0 , 1]. � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 5 / 11

  10. φ Posterior Probabilities for Different Training Examples (beta distribution) 3.5 n 0 =0, n 1 =0 n 0 =1, n 1 =2 3 n 0 =2, n 1 =4 n 0 =4, n 1 =8 2.5 2 P (φ| e) 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 6 / 11

  11. MAP model The maximum a posteriori probability (MAP) model is the model m that maximizes P ( m | e ). That is, it maximizes: P ( e | m ) × P ( m ) Thus it minimizes: ( − log P ( e | m )) + ( − log P ( m )) which is the number of bits to send the examples, e , given the model m plus the number of bits to send the model m . � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 7 / 11

  12. Averaging Over Models Idea: Rather than choosing the most likely model, average over all models, weighted by their posterior probabilities given the examples. If you have observed a sequence of n 1 instances of y and n 0 instances of ¬ y , with uniform prior: n 1 ◮ the most likely value (MAP) is n 0 + n 1 n 1 + 1 ◮ the expected value is n 0 + n 1 + 2 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 8 / 11

  13. Beta Distribution Beta α 0 ,α 1 ( p ) = 1 K p α 1 − 1 × (1 − p ) α 0 − 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0 , 1] is Beta 1 , 1 . The expected value is α 1 / ( α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0 ,α 1 , the posterior distribution after observing n 1 true cases and n 0 false cases is: � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 9 / 11

  14. Beta Distribution Beta α 0 ,α 1 ( p ) = 1 K p α 1 − 1 × (1 − p ) α 0 − 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0 , 1] is Beta 1 , 1 . The expected value is α 1 / ( α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0 ,α 1 , the posterior distribution after observing n 1 true cases and n 0 false cases is: Beta α 0 + n 0 ,α 1 + n 1 � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 9 / 11

  15. Dirichlet distribution Suppose Y has k values. The Dirichlet distribution has two sorts of parameters, ◮ positive counts α 1 , . . . , α k α i is one more than the count of the i th outcome. ◮ probability parameters p 1 , . . . , p k p i is the probability of the i th outcome k Dirichlet α 1 ,...,α k ( p 1 , . . . , p k ) = 1 α j − 1 � p j K j =1 where K is a normalizing constant The expected value of i th outcome is α i � j α j � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 10 / 11

  16. Hierarchical Bayesian Model Where do the priors come from? Example: S XH is true when patient X is sick in hospital H . We want to learn the probability of Sick for each hospital. Where do the prior probabilities for the hospitals come from? α 1 α 2 α 2 α 1 φ H φ 1 φ 2 φ k ... ... ... ... SXH S1k S11 S12 S21 S22 X H (a) (b) � D. Poole and A. Mackworth 2019 c Artificial Intelligence, Lecture 10.4 11 / 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend