Learning Objectives At the end of the class you should be able to: - - PowerPoint PPT Presentation

learning objectives
SMART_READER_LITE
LIVE PREVIEW

Learning Objectives At the end of the class you should be able to: - - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning. D. Poole and A. Mackworth 2019 c Artificial


slide-1
SLIDE 1

Learning Objectives

At the end of the class you should be able to: derive Bayesian learning from first principles explain how the Beta and Dirichlet distributions are used for Bayesian learning.

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 1 / 11

slide-2
SLIDE 2

Model Averaging (Bayesian Learning)

We want to predict the output Y of a new case that has input X = x given the training examples e: p(Y | x ∧ e) =

  • m∈M

P(Y ∧ m | x ∧ e) =

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 2 / 11

slide-3
SLIDE 3

Model Averaging (Bayesian Learning)

We want to predict the output Y of a new case that has input X = x given the training examples e: p(Y | x ∧ e) =

  • m∈M

P(Y ∧ m | x ∧ e) =

  • m∈M

P(Y | m ∧ x ∧ e)P(m | x ∧ e) =

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 2 / 11

slide-4
SLIDE 4

Model Averaging (Bayesian Learning)

We want to predict the output Y of a new case that has input X = x given the training examples e: p(Y | x ∧ e) =

  • m∈M

P(Y ∧ m | x ∧ e) =

  • m∈M

P(Y | m ∧ x ∧ e)P(m | x ∧ e) =

  • m∈M

P(Y | m ∧ x)P(m | e) M is a set of mutually exclusive and covering models (hypotheses). What assumptions are made here?

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 2 / 11

slide-5
SLIDE 5

Learning Under Uncertainty

The posterior probability of a model m given examples e: P(m | e) = P(e | m) × P(m) P(e) The likelihood, P(e | m), is the probability that model m would have produced examples e. The prior, P(m), encodes the learning bias P(e) is a normalizing constant so the probabilities of the models sum to 1.

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 3 / 11

slide-6
SLIDE 6

Plate Notation

Examples e = [e1, . . . , ek] are independent and identically distributed (i.i.d.) given m if P(e | m) =

k

  • i=1

P(ei | m)

e1 e2 ek m ... ei m i

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 4 / 11

slide-7
SLIDE 7

Bayesian Learning of Probabilities

Y has two outcomes y and ¬y. We want the probability of y given training examples e. We can treat the probability of y as a real-valued random variable on the interval [0, 1], called φ. Bayes’ rule gives: P(φ=p | e) =

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 5 / 11

slide-8
SLIDE 8

Bayesian Learning of Probabilities

Y has two outcomes y and ¬y. We want the probability of y given training examples e. We can treat the probability of y as a real-valued random variable on the interval [0, 1], called φ. Bayes’ rule gives: P(φ=p | e) = P(e | φ=p) × P(φ=p) P(e) Suppose e is a sequence of n1 instances of y and n0 instances of ¬y: P(e | φ=p) =

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 5 / 11

slide-9
SLIDE 9

Bayesian Learning of Probabilities

Y has two outcomes y and ¬y. We want the probability of y given training examples e. We can treat the probability of y as a real-valued random variable on the interval [0, 1], called φ. Bayes’ rule gives: P(φ=p | e) = P(e | φ=p) × P(φ=p) P(e) Suppose e is a sequence of n1 instances of y and n0 instances of ¬y: P(e | φ=p) = pn1 × (1 − p)n0 Uniform prior: P(φ=p) = 1 for all p ∈ [0, 1].

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 5 / 11

slide-10
SLIDE 10

Posterior Probabilities for Different Training Examples (beta distribution)

0.5 1 1.5 2 2.5 3 3.5 0.2 0.4 0.6 0.8 1

n0=0, n1=0 n0=1, n1=2 n0=2, n1=4 n0=4, n1=8 φ P(φ|e)

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 6 / 11

slide-11
SLIDE 11

MAP model

The maximum a posteriori probability (MAP) model is the model m that maximizes P(m | e). That is, it maximizes: P(e | m) × P(m) Thus it minimizes: (− log P(e | m)) + (− log P(m)) which is the number of bits to send the examples, e, given the model m plus the number of bits to send the model m.

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 7 / 11

slide-12
SLIDE 12

Averaging Over Models

Idea: Rather than choosing the most likely model, average over all models, weighted by their posterior probabilities given the examples. If you have observed a sequence of n1 instances of y and n0 instances of ¬y, with uniform prior:

◮ the most likely value (MAP) is n1 n0 + n1 ◮ the expected value is n1 + 1 n0 + n1 + 2

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 8 / 11

slide-13
SLIDE 13

Beta Distribution

Betaα0,α1(p) = 1 K pα1−1 × (1 − p)α0−1 where K is a normalizing constant. αi > 0. The uniform distribution on [0, 1] is Beta1,1. The expected value is α1/(α0 + α1). If the prior probability of a Boolean variable is Betaα0,α1, the posterior distribution after observing n1 true cases and n0 false cases is:

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 9 / 11

slide-14
SLIDE 14

Beta Distribution

Betaα0,α1(p) = 1 K pα1−1 × (1 − p)α0−1 where K is a normalizing constant. αi > 0. The uniform distribution on [0, 1] is Beta1,1. The expected value is α1/(α0 + α1). If the prior probability of a Boolean variable is Betaα0,α1, the posterior distribution after observing n1 true cases and n0 false cases is: Betaα0+n0,α1+n1

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 9 / 11

slide-15
SLIDE 15

Dirichlet distribution

Suppose Y has k values. The Dirichlet distribution has two sorts of parameters,

◮ positive counts α1, . . . , αk αi is one more than the count of the ith outcome. ◮ probability parameters p1, . . . , pk pi is the probability of the ith outcome

Dirichletα1,...,αk(p1, . . . , pk) = 1 K

k

  • j=1

p

αj−1 j

where K is a normalizing constant The expected value of ith outcome is αi

  • j αj

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 10 / 11

slide-16
SLIDE 16

Hierarchical Bayesian Model

Where do the priors come from? Example: SXH is true when patient X is sick in hospital H. We want to learn the probability of Sick for each hospital. Where do the prior probabilities for the hospitals come from?

φH α1 X H SXH α2 φ1 φ2 φk α1 ... α2 S11 S12 ... S21 S22 ... S1k ... (a) (b)

c

  • D. Poole and A. Mackworth 2019

Artificial Intelligence, Lecture 10.4 11 / 11