Bayesian networks Chapter 14, Sections 14 of; based on AIMA Slides - - PowerPoint PPT Presentation

bayesian networks
SMART_READER_LITE
LIVE PREVIEW

Bayesian networks Chapter 14, Sections 14 of; based on AIMA Slides - - PowerPoint PPT Presentation

Bayesian networks Chapter 14, Sections 14 of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 14 1 Bayesian networks A simple, graphical notation


slide-1
SLIDE 1

Bayesian networks

Chapter 14, Sections 1–4

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 1

slide-2
SLIDE 2

Bayesian networks

A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: – a set of nodes, one per variable – a directed, acyclic graph (link ≈ “directly influences”) – a conditional distribution for each node given its parents: P(Xi|Parents(Xi)) In the simplest case, the conditional distribution is represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 2

slide-3
SLIDE 3

Example

The topology of a network encodes conditional independence assertions:

Weather Cavity Toothache Catch

Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 3

slide-4
SLIDE 4

Example

I’m at work. My neighbor John calls to say my alarm is ringing, but my neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls The network topology reflects our “causal” knowledge: – a burglar can trigger the alarm – an earthquake can trigger the alarm – the alarm can cause Mary to call – the alarm can cause John to call

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 4

slide-5
SLIDE 5

Example contd.

.001

P(B)

.002

P(E)

Alarm Earthquake MaryCalls JohnCalls Burglary

B

T T F F

E

T F T F .95 .29 .001 .94

P(A|B,E) A

T F .90 .05

P(J|A) A

T F .70 .01

P(M|A)

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 5

slide-6
SLIDE 6

Compactness

A CPT for Boolean Xi with k Boolean parents has

B E J A M

2k rows for the combinations of parent values Each row requires one number p for Xi = true (the number for Xi = false is just 1 − p) If each variable has no more than k parents, the complete network requires O(n · 2k) numbers I.e., it grows linearly with n, vs. O(2n) for the full joint distribution For the burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31)

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 6

slide-7
SLIDE 7

Global semantics

The global semantics defines the full joint distribution

B E J A M

as the product of the local conditional distributions: P(x1, . . . , xn) = Πn

i = 1P(xi|parents(Xi))

e.g., P(j ∧ m ∧ a ∧ ¬b ∧ ¬e) = P(j|a)P(m|a)P(a|¬b, ¬e)P(¬b)P(¬e) = 0.9 × 0.7 × 0.001 × 0.999 × 0.998 ≈ 0.00063

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 7

slide-8
SLIDE 8

Markov blanket

Theorem: Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents

. . . . . . U1 X Um Yn Znj Y

1

Z1j

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 8

slide-9
SLIDE 9

Constructing Bayesian networks

We need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics

  • 1. Choose an ordering of variables X1, . . . , Xn
  • 2. For i = 1 to n

add Xi to the network select parents from X1, . . . , Xi−1 such that P(Xi|Parents(Xi)) = P(Xi|X1, . . . , Xi−1) This choice of parents guarantees the global semantics: P(X1, . . . , Xn) = Πn

i = 1P(Xi|X1, . . . , Xi−1)

(chain rule) = Πn

i = 1P(Xi|Parents(Xi))

(by construction)

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 9

slide-10
SLIDE 10

Example

Suppose we choose the ordering M, J, A, B, E

MaryCalls JohnCalls

P(J|M) = P(J)?

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 10

slide-11
SLIDE 11

Example

Suppose we choose the ordering M, J, A, B, E

MaryCalls Alarm JohnCalls

P(J|M) = P(J)? No P(A|J, M) = P(A|J)? P(A|J, M) = P(A)?

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 11

slide-12
SLIDE 12

Example

Suppose we choose the ordering M, J, A, B, E

MaryCalls Alarm Burglary JohnCalls

P(J|M) = P(J)? No P(A|J, M) = P(A|J)? P(A|J, M) = P(A)? No P(B|A, J, M) = P(B|A)? P(B|A, J, M) = P(B)?

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 12

slide-13
SLIDE 13

Example

Suppose we choose the ordering M, J, A, B, E

MaryCalls Alarm Burglary Earthquake JohnCalls

P(J|M) = P(J)? No P(A|J, M) = P(A|J)? P(A|J, M) = P(A)? No P(B|A, J, M) = P(B|A)? Yes P(B|A, J, M) = P(B)? No P(E|B, A, J, M) = P(E|A)? P(E|B, A, J, M) = P(E|A, B)?

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 13

slide-14
SLIDE 14

Example

Suppose we choose the ordering M, J, A, B, E

MaryCalls Alarm Burglary Earthquake JohnCalls

P(J|M) = P(J)? No P(A|J, M) = P(A|J)? P(A|J, M) = P(A)? No P(B|A, J, M) = P(B|A)? Yes P(B|A, J, M) = P(B)? No P(E|B, A, J, M) = P(E|A)? No P(E|B, A, J, M) = P(E|A, B)? Yes

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 14

slide-15
SLIDE 15

Example contd.

MaryCalls Alarm Burglary Earthquake JohnCalls

Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed Compare with the original burglary net: 1 + 1 + 4 + 2 + 2 = 10 numbers

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 15

slide-16
SLIDE 16

Example contd.

The chosen ordering of the variables can have a big impact on the size of the network! Network (b) has 25 − 1 = 31 numbers—exactly the same as the full joint distribution

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 16

slide-17
SLIDE 17

Inference tasks

Simple queries: compute posterior marginal P(Xi|E = e) e.g., P(Burglar|JohnCalls = true, MaryCalls = true)

  • r shorter, P(B|j, m)

Conjunctive queries: P(Xi, Xj|E = e) = P(Xi|E = e)P(Xj|Xi, E = e) Optimal decisions: decision networks include utility information; probabilistic inference required for P(outcome|action, evidence) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor?

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 17

slide-18
SLIDE 18

Inference by enumeration

Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network:

B E J A M

P(B|j, m) = P(B, j, m)/P(j, m) = αP(B, j, m) = α Σe Σa P(B, e, a, j, m) (where e and a are the hidden variables) Rewrite full joint entries using product of CPT entries: P(B|j, m) = α Σe Σa P(B)P(e)P(a|B, e)P(j|a)P(m|a) = α P(B) Σe P(e) Σa P(a|B, e)P(j|a)P(m|a) Recursive depth-first enumeration: O(n) space, O(dn) time

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 18

slide-19
SLIDE 19

Evaluation tree

P(j|a) .90 P(m|a) .70 .01 P(m| a) .05 P(j| a) P(j|a) .90 P(m|a) .70 .01 P(m| a) .05 P(j| a) P(b) .001 P(e) .002 P( e) .998 P(a|b,e) .95 .06 P( a|b, e) .05 P( a|b,e) .94 P(a|b, e)

Enumeration is inefficient: repeated computation e.g., computes P(j|a)P(m|a) for each value of e

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 19

slide-20
SLIDE 20

Inference by variable elimination

Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation P(B|j, m) = α P(B) Σe P(e) Σa P(a|B, e) P(j|a) P(m|a) = α f1(B) Σe f2(E) Σa f3(A, B, E) f4(A) f5(A) (where f1, f2, f4, f5, are 2-element vectors, and f3 is a 2 × 2 × 2 matrix) Sum out A to get the 2 × 2 matrix f6, and then E to get the 2-vector f7: f6(B, E) = Σa f3(A, B, E) × f4(A) × f5(A) = f3(a, B, E) × f4(a) × f5(a) + f3(¬a, B, E) × f4(¬a) × f5(¬a) f7(B) = Σe f2(E) × f6(B, E) = f2(e) × f6(B, e) + f2(¬e) × f6(B, ¬e) Finally, we get this: P(B|j, m) = α f1(B) × f7(B)

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 20

slide-21
SLIDE 21

Irrelevant variables

Consider the query P(JohnCalls|Burglary = true)

B E J A M

P(J|b) = αP(b)

  • e P(e)
  • a P(a|b, e)P(J|a)
  • m P(m|a)

Sum over m is identically 1; M is irrelevant to the query Theorem: Y is irrelevant unless Y ∈ Ancestors({X} ∪ E) Here, X = JohnCalls, E = {Burglary}, and Ancestors({X} ∪ E) = {Alarm, Earthquake} so MaryCalls is irrelevant

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 21

slide-22
SLIDE 22

Summary

Bayes nets provide a natural representation for (causally induced) conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for (non)experts to construct Probabilistic inference tasks can be computed exactly: – variable elimination avoids recomputations – irrelevant variables can be removed, which reduces complexity

Artificial Intelligence, spring 2013, Peter Ljungl¨

  • f; based on AIMA Slides c

Stuart Russel and Peter Norvig, 2004 Chapter 14, Sections 1–4 22