Bayesian networks: basics Machine Intelligence Thomas D. Nielsen - - PowerPoint PPT Presentation

▶

Jun 04, 2023 30 likes •264 views

Bayesian networks: basics Machine Intelligence Thomas D. Nielsen September 2008 Bayesian networks: basics September 2008 1 / 17 Basics Random/Chance Variables A name and a state space: Weather : { sunny, cloudy,rain } Blood Pressure: {

SLIDE 1

Bayesian networks: basics

Machine Intelligence Thomas D. Nielsen September 2008

Bayesian networks: basics September 2008 1 / 17

SLIDE 2

Basics

Random/Chance Variables A name and a state space:

Weather: {sunny, cloudy,rain} Blood Pressure: {high, normal, low} Grade: {−3, 00, 02, 4, 7, 10, 12} Annual income: {1DKK, 2DKK, 3DKK, 4DKK, . . .} Weight: x ∈ R

A probability distribution on the state space:

sunny: 0.3, cloudy: 0.5, rain: 0.2 Occurrence of k events within a time interval: e−λλk/k! (Poisson distribution)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 5 10 15 20

Continuous distribution: x ∼ N(µ, σ) (Gaussian distribution)

Notation: sp(A) denotes the state space of random variable A.

Bayesian networks: basics September 2008 2 / 17

SLIDE 3

Basics

Joint Distribution Usually we are interested in the joint distribution of several variables, e.g. probability that Weather is sunny and Grade is 10. Notation: we also write sp(A, B) for the joint state space of two (or more) variables. Example: sp(Weather, Grade) = {(sunny, 1), (sunny, 2), . . . , (rain, 13)}

Bayesian networks: basics September 2008 3 / 17

SLIDE 4

Basics

Joint Distribution Usually we are interested in the joint distribution of several variables, e.g. probability that Weather is sunny and Grade is 10. Notation: we also write sp(A, B) for the joint state space of two (or more) variables. Example: sp(Weather, Grade) = {(sunny, 1), (sunny, 2), . . . , (rain, 13)} Conditional Probabilities A joint distribution defines conditional probabilities: P(A = a | B = b) := P(A = a, B = b)/P(B = b) This is also known as the fundamental rule (when read as a theorem, not a definition).

Bayesian networks: basics September 2008 3 / 17

SLIDE 5

Basics

Joint Distribution Usually we are interested in the joint distribution of several variables, e.g. probability that Weather is sunny and Grade is 10. Notation: we also write sp(A, B) for the joint state space of two (or more) variables. Example: sp(Weather, Grade) = {(sunny, 1), (sunny, 2), . . . , (rain, 13)} Conditional Probabilities A joint distribution defines conditional probabilities: P(A = a | B = b) := P(A = a, B = b)/P(B = b) This is also known as the fundamental rule (when read as a theorem, not a definition). Bayes Rule From the definition of the conditional probability: P(B = b | A = a) = P(A = a | B = b) · P(B = b)/P(A = a)

Bayesian networks: basics September 2008 3 / 17

SLIDE 6

Basics

Generalization If an equality (like Baye’s rule) is true for all possible values a, b of the random variables A, B, one simply writes it in the form P(B | A) = P(A | B)P(B)/P(A)

Bayesian networks: basics September 2008 4 / 17

SLIDE 7

Basics

Generalization If an equality (like Baye’s rule) is true for all possible values a, b of the random variables A, B, one simply writes it in the form P(B | A) = P(A | B)P(B)/P(A) Conditioning on context A probabilistic law remains valid when all probabilities are conditioned on a common “context” variable C. E.g. Baye’s rule: P(B | A, C) = P(A | B, C)P(B | C)/P(A | C)

Bayesian networks: basics September 2008 4 / 17

SLIDE 8

Basics

Generalization If an equality (like Baye’s rule) is true for all possible values a, b of the random variables A, B, one simply writes it in the form P(B | A) = P(A | B)P(B)/P(A) Conditioning on context A probabilistic law remains valid when all probabilities are conditioned on a common “context” variable C. E.g. Baye’s rule: P(B | A, C) = P(A | B, C)P(B | C)/P(A | C) Chain rule For any set of random variables V1, V2, . . . , Vn: P(V1, . . . , Vn) = P(V1, . . . , Vn−1)P(Vn | V1, . . . , Vn−1) = P(V1, . . . , Vn−2)P(Vn−1 | V1, . . . , Vn−2)P(Vn | V1, . . . , Vn−1) . . . = P(V1)P(V2 | V1) · · · P(Vi | V1, . . . , Vi−1) · · · P(Vn | V1, . . . , Vn−1)

Bayesian networks: basics September 2008 4 / 17

SLIDE 9

Basics

Conditional Independence A is conditionally independent from B given C if one of the following equivalent conditions holds: P(A, B | C) = P(A | C)P(B | C) P(A | B, C) = P(A | C) P(B | A, C) = P(B | C) This extends to sets of random variables. E.g.: A1, A2, A3 is independent from B1, B2 given C1, C2, C3 if P(A1, A2, A3, B1, B2 | C1, C2, C3) = P(A1, A2, A3 | C1, C2, C3)P(B1, B2 | C1, C2, C3) A conditional independence relation remains not necessarily true under conditioning on a context: P(A, B | C) = P(A | C)P(B | C) does not imply P(A, B | C, D) = P(A | C, D)P(B | C, D)

Bayesian networks: basics September 2008 5 / 17

SLIDE 10

Basics

Chain rule + Conditional Independence → Factorization Chain rule again: P(V1, . . . , Vn) = P(V1)P(V2 | V1) · · · P(Vi | V1, . . . , Vi−1) · · · P(Vn | V1, . . . , Vn−1) Now suppose that for each i: pa(Vi) ⊆ {V1, . . . , Vi−1} such that P(Vi | V1, . . . , Vi−1) = P(Vi | pa(Vi)) (i.e. Vi is conditionally independent of {V1, . . . , Vi−1} \ pa(Vi) given pa(Vi)). This gives the factorization of P(V1, . . . , Vn): P(V1, . . . , Vn) =

n

Y

i=1

P(Vi | pa(Vi)).

Bayesian networks: basics September 2008 6 / 17

SLIDE 11

Basics

Factorization → Bayesian Networks B C A E D

B b1 b2 b3 0.3 0.3 0.4 C B c1 c2 b1 0.2 0.8 b2 0.5 0.5 b3 0.6 0.4 A C E a1 a2 a3 c1 e1 0.1 0.6 0.3 c1 e2 0.5 0.5 0.0 c2 e1 0.4 0.2 0.4 c2 e2 0.1 0.1 0.8 . . .

A Bayesian network for the (discrete) random variables V = V1, . . . , Vn is defined by a directed acyclic graph (V, →) for each Vi a conditional probability table P(Vi | pa(Vi)) specifying the conditional distribution

f Vi given its parents in the graph.

The Bayesian network defines a joint distribution of V as: P(V1, . . . , Vn) =

n

Y

i=1

P(Vi | pa(Vi))

Bayesian networks: basics September 2008 7 / 17

SLIDE 12

Basics

Elementary Conditional Independence Property Vi: node in Bayesian network desc(Vi): descendants of Vi rest(Vi): nondescendants without parents and Vi Vi pa(Vi) desc(Vi) rest(Vi) P(Vi | pa(Vi), rest(Vi)) = P(Vi | pa(Vi)) “Vi is independent of its nondescendants, given its parents”

Bayesian networks: basics September 2008 8 / 17

SLIDE 13

Basics

The d-Separation Relation (V, →) a directed acyclic graph, A, B, C ⊆ V disjoint subsets of nodes. C d-separates A from B if the following holds: every undirected path that connects a node A ∈ A with a node B ∈ B satisfies at least one of the following two conditions:

1. the path contains a node C ∈ C, and the edges that connect C are serial

(. . . → C → . . .) or divergent (. . . ← C → . . .).

2. the path contains a node U, the edges that connect U are convergent

(. . . → U ← . . .), and (U ∪ desc(U)) ∩ C = ∅.

Diverging A B C Serial A B C D Converging A B C

Bayesian networks: basics September 2008 9 / 17

SLIDE 14

Basics

pa(A) d-separates A from rest(A):

C A U B C

Bayesian networks: basics September 2008 10 / 17

SLIDE 15

Basics

d-Separation Theorem (V, →, {P(Vi | pa(Vi)) | i = 1, . . . , n}) a Bayesian network that defines joint distribution P. Then for all pairwise disjoint A, B, C ⊆ V: If C d-separates A from B in (V, →), then P(A | B, C) = P(A | C). [Elementary Conditional Independence Property is a special case] Proof can be found in Verma & Pearl (1990)

Bayesian networks: basics September 2008 11 / 17

SLIDE 16

Basics

Basic Inference Problems Given a Bayesian network (V, →, {P(Vi | pa(Vi)) | i = 1, . . . , n}). (a) Computation of a-posteriori distributions: Given E1, . . . , Ek ∈ V, ei ∈ sp(Ei). Wanted: For all A ∈ V \ E: the conditional distribution of A given (“the evidence”) E = e: P(A | E1 = e1, . . . , Ek = ek)

Bayesian networks: basics September 2008 12 / 17

SLIDE 17

Basics

Basic Inference Problems Given a Bayesian network (V, →, {P(Vi | pa(Vi)) | i = 1, . . . , n}). (a) Computation of a-posteriori distributions: Given E1, . . . , Ek ∈ V, ei ∈ sp(Ei). Wanted: For all A ∈ V \ E: the conditional distribution of A given (“the evidence”) E = e: P(A | E1 = e1, . . . , Ek = ek) (b) Computation of most likely configurations (most probable explanations (MPE)): Evidence E = e as in (a). A := V \ E. Wanted: amax ∈ sp(A) with P(A = amax | E = e) = arg max

a∈sp(A) P(A = a | E = e).

Bayesian networks: basics September 2008 12 / 17

SLIDE 18

Basics

Basic Inference Problems Given a Bayesian network (V, →, {P(Vi | pa(Vi)) | i = 1, . . . , n}). (a) Computation of a-posteriori distributions: Given E1, . . . , Ek ∈ V, ei ∈ sp(Ei). Wanted: For all A ∈ V \ E: the conditional distribution of A given (“the evidence”) E = e: P(A | E1 = e1, . . . , Ek = ek) (b) Computation of most likely configurations (most probable explanations (MPE)): Evidence E = e as in (a). A := V \ E. Wanted: amax ∈ sp(A) with P(A = amax | E = e) = arg max

a∈sp(A) P(A = a | E = e).

(c) Computation of maximum a posteriori (MAP) configurations. Generalization of (b): Evidence E = e as in (b), B ⊂ V \ E. Wanted: bmax ∈ sp(B) with P(B = bmax | E = e) = arg max

b∈sp(B) P(B = a | E = e).

Note: In general amax

i

= arg maxai ∈sp(Ai ) P(Ai = ai | E = e).

Bayesian networks: basics September 2008 12 / 17

SLIDE 19

Basics

P(Alarm | C_W = true, C_G = true) P(Quake | C_W = true, C_G = true) Evidence

Bayesian networks: basics September 2008 13 / 17

SLIDE 20

Inference

Variable Elimination Direct approach to solve (a): denote U := V \ (E, A), and let u range over sp(U). Then for a ∈ sp(A): P(A = a, E = e) = P

u P(A = a, U = u, E = e)

= P

u

Qn

i=1 P(Vi | pa(Vi))(a, u, e)

= P

u1∈sp(U1) . . . P um∈sp(Um)

Q P(Vi | pa(Vi))(a, u, e) “Algorithm”: sum out the uj one by one, move factors P(Vi | pa(Vi))(a, u, e) that do not depend on current uj (because Uj ∈ {Vi} ∪ pa(Vi)}) out of the sum.

Bayesian networks: basics September 2008 14 / 17

SLIDE 21

Inference

Example B A C D

B t f .5 .5 A B t f t .7 .3 f .1 .9 C B t f t .7 .3 f .2 .8 D A C t f t t .9 .1 t f .7 .3 f t .8 .2 f f .4 .6

P(A, D = f) = X

b,c∈{t,f}

P(B = b, A, C = c, D = f) = X

b,c

P(B = b)P(A | B = b)P(C = c | B = b)P(D = f | A, C = c) =

Bayesian networks: basics September 2008 15 / 17

SLIDE 22

Inference

X

b

P(B = b)P(A | B = b) X

c

P(C = c | B = b)P(D = f | A, C = c) = X

b

P(B = b)P(A | B = b)F1(B = b, A) = F2(A) where

C B t f t .7 .3 f .2 .8 D A C t f t t .9 .1 t f .7 .3 f t .8 .2 f f .4 .6

→

b a F1(B, A) t t .7· .1 + .3· .3 = .16 t f .7·.2 + .3·.6 = .32 f t .2·.1 + .8·.3 = .26 f f .2·.2 + .8·.6 = .52

and

B t f .5 .5 A B t f t .7 .3 f .1 .9 b a F1(B, A) t t .16 . . . . . . . . .

→

a F2(A) t . . . f . . .

Bayesian networks: basics September 2008 16 / 17

SLIDE 23

Inference

Complexity Call subsets U of V that are the arguments of factors P(. . . | . . .) resp. Fj(. . .) which appear in the elimination process factor sets. The complexity of variable elimination is exponential in the size of the largest factor set. The size of the largest factor set can depend strongly on the order in which variables are summed

Bayesian networks: basics September 2008 17 / 17