Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , - - PowerPoint PPT Presentation

bayesian networks li xiong
SMART_READER_LITE
LIVE PREVIEW

Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , - - PowerPoint PPT Presentation

Probabilistic Graphical Models: Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , Zhu (Wisconsin) KDD 12 tutorial Outline Graphical models Bayesian networks - definition Bayesian networks - inference Bayesian


slide-1
SLIDE 1

Probabilistic Graphical Models: Bayesian Networks Li Xiong

Slide credits: Page (Wisconsin) CS760, Zhu (Wisconsin) KDD ’12 tutorial

slide-2
SLIDE 2

Outline

 Graphical models  Bayesian networks - definition  Bayesian networks - inference  Bayesian networks - learning

November 5, 2017 Data Mining: Concepts and Techniques 2

slide-3
SLIDE 3

Overview

The envelope quiz

 There are two envelopes, one has a red ball ($100)

and a black ball, the other two black balls

 You randomly picked an envelope, randomly took

  • ut a ball – it was black

 At this point, you are given the option to switch

  • envelopes. Should you?
slide-4
SLIDE 4

Overview

The envelope quiz

Random variables E ∈{1, 0}, B ∈{r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2,P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b)

slide-5
SLIDE 5

Overview

The envelope quiz

Random variables E ∈{1, 0}, B ∈{r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2,P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) P (E = 1 | B = b) = =

P(B=b|E=1)P (E=1) 1/2×1/2 P (B=b ) 3/4

= 1/3

B

The graphical model:

E

slide-6
SLIDE 6

Overview

Reasoning with uncertainty

  • A set of random variables x1, . . . , xn

e.g. (x1, . . . , xn−1) a feature vector , xn ≡ y the class label

  • Inference: given joint distribution p(x1, ...,xn), compute

p(XQ | XE ) where XQ ∪XE ⊆ {x1 ...xn}

e.g. Q = {n}, E = {1 ...n − 1}, by the definition of conditional p(x | x , . . . , x

n 1 n −1) =

p(x1, ...,xn−1, xn) Σ v p(x1, ...,xn−1, xn = v)

  • Learning: estimate p(x1, ...,xn) from training data X(1), ...,X(N ),

1 n

where X(i) = (x(i), ...,x(i))

slide-7
SLIDE 7

Overview

It is difficult to reason with uncertainty

  • joint distribution p(x1, ...,xn)

exponential naive storage (2n for binary r .v.) hard to interpret (conditional independence)

  • inference p(XQ |XE )

Often can’t afford to do it by brute force

  • If p(x1, ...,xn) not given, estimate it from data

Often can’t afford to do it by brute force

  • Graphical model: efficient representation, inference, and learning
  • n p(x1, ...,xn), exactly or approximately
slide-8
SLIDE 8

Definitions

Graphical-Model-Nots

Graphical model is the study of probabilistic models Just because there are nodes and edges doesn’t mean it’s a graphical model These are not graphical models: neural network decision tree network flow HMM template

slide-9
SLIDE 9

Graphical Models

 Bayesian networks – directed  Markov networks – undirected

November 5, 2017 Data Mining: Concepts and Techniques 9

slide-10
SLIDE 10

Outline

 Graphical models  Bayesian networks - definition  Bayesian networks - inference  Bayesian networks - learning

November 5, 2017 Data Mining: Concepts and Techniques 10

slide-11
SLIDE 11

Bayesian Networks: Intuition

  • A graphical representation for a joint probability distribution

 Nodes are random variables  Directed edges between nodes reflect dependence

  • Some informal examples:

Understood Material Assignment Grade Exam Grade Alarm Smoking At Sensor Fire

slide-12
SLIDE 12

Bayesian networks

  • a BN consists of a Directed Acyclic Graph (DAG) and

a set of conditional probability distributions

  • in the DAG

– each node denotes random a variable – each edge from X to Y represents that X directly influences Y – formally: each variable X is independent of its non- descendants given its parents

  • each node X has a conditional probability distribution

(CPD) representing P(X | Parents(X) )

slide-13
SLIDE 13

Definitions Directed Graphical Models

Conditional Independence

Two r .v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r .v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent)

slide-14
SLIDE 14

Bayesian networks

P(X1, … , Xn ) 

  • a BN provides a compact representation of a joint

probability distribution

n

P(Xi | Parents(Xi))

i1

P(X1, … , Xn )  P(X1)P(Xi | X1, … , Xi1))

i2

  • using the chain rule, a joint probability distribution can be

expressed as

n

slide-15
SLIDE 15

Bayesian network example

  • Consider the following 5 binary random variables:

B = a burglary occurs at your house E = an earthquake occurs at your house A = the alarm goes off J = John calls to report the alarm M = Mary calls to report the alarm

  • Suppose we want to answer queries like what is

P(B | M, J) ?

slide-16
SLIDE 16

Bayesian network example

Burglary Earthquake Alarm JohnCalls MaryCalls

B E t f t t 0.95 0.05 t f 0.94 0.06 f t 0.29 0.71 f f 0.001 0.999

P ( A | B, E )

t f 0.001 0.999

P ( B )

t f 0.001 0.999

P ( E )

A t f t f 0.9 0.1 0.05 0.95

P ( J |A)

A t f t f 0.7 0.3 0.01 0.99

P ( M |A)

slide-17
SLIDE 17

Bayesian networks

P(B,E,A,J, M )  P(B)  P(E)  P(A | B,E)  P(J | A)  P(M | A)

  • a standard representation of the joint distribution for the

Alarm example has 25 = 32 parameters

  • the BN representation of this distribution has 20 parameters

Burglary Earthquake Alarm JohnCalls MaryCalls

slide-18
SLIDE 18

Bayesian networks

= 1024

  • consider a case with 10 binary random variables
  • How many parameters does a BN with the following

graph structure have?

2 4 4 = 42 4 4 4 4 4 8 4

  • How many parameters does the standard table

representation of the joint distribution have?

slide-19
SLIDE 19

Advantages of the Bayesian network representation

  • Captures independence and conditional independence

where they exist

  • Encodes the relevant portion of the full joint among

variables where dependencies exist

  • Uses a graphical representation which lends insight into

the complexity of inference

slide-20
SLIDE 20

20

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Network learning

slide-21
SLIDE 21

The inference task in Bayesian networks

Given: values for some variables in the network (evidence), and a set of query variables Do: compute the posterior distribution over the query variables

  • variables that are neither evidence variables nor query

variables are other variables

  • the BN representation is flexible enough that any set can

be the evidence variables and any set can be the query variables

slide-22
SLIDE 22

Recall Naïve Bayesian Classifier

 Derive the maximum posterior  Independence assumption  Simplified network

) ( ) ( ) | ( ) | ( X X X P i C P i C P i C P 

) | ( ... ) | ( ) | ( 1 ) | ( ) | (

2 1

Ci x P Ci x P Ci x P n k Ci x P Ci P

n k

       X

slide-23
SLIDE 23

Inference Exact Inference

Exact Inference by Enumeration

Let X = (XQ, XE ,XO) for query , evidence, and other variables. Infer P(XQ | XE ) By definition

Q E

P (X | X ) =

Q E

P (X , X ) P(XE) Σ X O

Q E O

P (X , X , X ) = ΣX Q ,XO P (XQ, XE ,XO)

slide-24
SLIDE 24

Inference by enumeration example

  • let a denote A=true, and ¬a denote A=false
  • suppose we’re given the query: P(b | j, m)

“probability the house is being burglarized given that John and Mary both called”

  • from the graph structure we can first compute:

P(b, j,m)P(b)P(e)P(a|b,e)P(j |a)P(m|a)

e a

sum over possible values for E and A variables (e, ¬e, a, ¬a)

A B E M J

slide-25
SLIDE 25

Inference by enumeration example

B E P(A) t t 0.95 t f 0.94 f t 0.29 f f 0.001 P(B) 0.001 P(E) 0.001 A P(J) t f 0.9 0.05 A P(M) t f 0.7 0.01

P(b, j,m) P(b)P(e)P(a |b,e)P( j | a)P(m | a)

e a

 P(b)P(e)P(a |b,e)P( j | a)P(m | a)

e a

 0.001 (0.001 0.95  0.9  0.7  0.001 0.05  0.05  0.01 0.999  0.94  0.9  0.7  0.999  0.06  0.05  0.01)

e, a e, ¬a ¬e, a ¬ e, ¬ a B E A J M

A B E M J

slide-26
SLIDE 26

Inference by enumeration example

  • now do equivalent calculation for P(¬b, j, m)
  • and determine P(b | j, m)

P(b | j,m)  P(b, j,m)  P(b, j,m) P( j,m) P(b, j,m) P(b, j,m)

slide-27
SLIDE 27

Inference Exact Inference

Exact Inference by Enumeration

Let X = (XQ, XE ,XO) for query , evidence, and other variables. Infer P(XQ | XE ) By definition

Q E

P (X | X ) =

Q E

P (X , X ) P(XE) Σ X O

Q E O

P (X , X , X ) = Σ X Q ,XO P (XQ, XE ,XO) Computational issue: summing exponential number of terms - with k variables in XO each taking r values, there are rk terms

slide-28
SLIDE 28

28

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Network learning

slide-29
SLIDE 29

Approximate (Monte Carlo) Inference in Bayes Nets

  • Basic idea: repeatedly generate data

samples according to the distribution represented by the Bayes Net

  • Estimate the probability P(XQ | XE )

A B E M J

slide-30
SLIDE 30

Inference Markov Chain Monte Carlo

Forward Sampling: Example

P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001 P(J | A) =0.9 P(J | ~A) = 0.05 P(M | ~A) =0.01

A J M

P(M | A) =0.7

B E

P(E)=0.002 P(B)=0.001

To generate a sample X = (B, E, A, J,M ): Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B = 0 Sample E ∼ Ber(0.002) If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05) If A = 1 sample M ∼ Ber(0.7) else M ∼Ber(0.01)

slide-31
SLIDE 31

Inference Markov Chain Monte Carlo

Inference with Forward Sampling

Given inference task is P (B = 1 | E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) p(B = 1 | E = 1, M = 1) ≈ 1 m

m

Σ

i=1

1(B (i )=1) where m is the number of surviving samples

Issue: Can be highly inefficient (note P (E = 1) tiny), few samples agree with the evidence

slide-32
SLIDE 32

Markov Chain Monte Carlo

  • Fix evidence variables, sample non-evidence

variables

  • Generate next setting probabilistically based
  • n current setting (Markov chain)
  • Gibbs Sampling for Bayes Networks
slide-33
SLIDE 33

Inference Markov Chain Monte Carlo

Gibbs Sampler: Example P (B = 1 | E = 1,M = 1)

Initialization:

Fix evidence; randomly set other variables e.g. X(0) = (B = 0, E = 1, A = 0, J = 0, M = 1)

P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001 P(J | A) = 0.9 P(J | ~A) =0.05

A J B

P(E)=0.002 P(B)=0.001

E=1 M=1

P(M | A) = 0.7 P(M | ~A) =0.01

slide-34
SLIDE 34

Inference Markov Chain Monte Carlo

Gibbs Update

  • For each non-evidence variable xi, fixing all other nodes X−i,

resample xi ∼ P (xi | X−i) equivalent to xi ∼ P (xi | MarkovBlanket(xi))

  • MarkovBlanket(xi) includes xi’s parents, spouses, and children

P (xi | MarkovBlanket(xi)) ∝ P (xi |P a(xi))

P (y |P a(y))

y∈C(xi)

where P a(x) are the parents of x, and C(x) the children of x. Example: B ∼ P (B | E = 1, A = 0, J = 0, M = 1) ∝ P (B | E = 1, A = 0) ∝P (B)P (A = 0 | B, E = 1)

P(J | A) = 0.9 P(M | A) =0. 7 P(B)=0.001 P(E)=0.002

B

E=1

P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001

A J

M=1

slide-35
SLIDE 35

Inference Markov Chain Monte Carlo

Gibbs Update

  • Say we sampled B = 1. Then

X(1) = (B = 1, E = 1, A = 0, J = 0, M = 1)

  • Sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X(2)
  • Move on to J, then repeat B, A, J,B, A, J ...
  • Keep all later samples, P (B = 1 | E = 1, M = 1) is the fraction of

samples with B = 1.

P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) =0.001 P(J | A) = 0.9 P(J | ~A) =0.05

A J B

P(E)=0.002 P(B)=0.001

E=1 M=1

P(M | A) = 0.7 P(M | ~A) =0.01

slide-36
SLIDE 36

Inference Markov Chain Monte Carlo

Gibbs Sampling as a Markov Chain

  • A Markov chain is defined by a transition matrix T (Xj |X)
  • Certain Markov chains have a stationary distribution π such that

π = Tπ

  • Gibbs sampler is such a Markov chain with

Ti((X−i, xji) | (X−i, xi)) = p(xji | X−i), and stationarydistribution p(xQ | XE )

  • It takes time for the chain to reach stationary distribution
  • In practice: discard early ones
slide-37
SLIDE 37

38

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Parameter learning and inference with partial data  Network learning

slide-38
SLIDE 38

The parameter learning task

  • Given: a set of training instances, the graph structure of a BN
  • Do: infer the parameters of the CPDs

B E A J M f f f t f f t f f f f f t f t …

Burglary Earthquake Alarm JohnCalls MaryCalls

slide-39
SLIDE 39

The parameter and data learning task

  • Given: a set of training instances (with some missing or

unobservable data), the graph structure of a BN

  • Do: infer the parameters of the CPDs

infer the missing data values

B E A J M f f ? t f f t ? f f f f ? f t …

Burglary Earthquake Alarm JohnCalls MaryCalls

slide-40
SLIDE 40

The structure learning task

  • Given: a set of training instances
  • Do: infer the graph structure (and perhaps the

parameters of the CPDs too)

B E A J M f f f t f f t f f f f f t f t …

slide-41
SLIDE 41

42

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Parameter and data learning with partial data  Network learning

slide-42
SLIDE 42

The parameter learning task

  • Given: a set of training instances, the graph structure of a BN
  • Do: infer the parameters of the CPDs

B E A J M f f f t f f t f f f f f t f t …

Burglary Earthquake Alarm JohnCalls MaryCalls

slide-43
SLIDE 43

Parameter learning and maximum likelihood estimation

  • maximum likelihood estimation (MLE)

– given a model structure (e.g. a Bayes net graph) and a set of data D – set the model parameters θ to maximize P(D | θ)

  • i.e. make the data D look as likely as possible under

the model P(D | θ)

slide-44
SLIDE 44

Maximum likelihood estimation

for h heads in n flips the MLE is h/n

x1

L = P(x|)   (1)1x1

xn

 (1)1xn  xi(1)

nxi

consider trying to estimate the parameter θ (probability of heads) of a biased coin from a sequence of flips

x  1,1,1,0,1,0,0,1,0,1

the likelihood function for θ is given by:

slide-45
SLIDE 45

Maximum likelihood estimation

P( j | a)  3 0.75 4 P(j | a)  1  0.25 4 P( j | a)  2 0.5 4 P(j | a)  2  0.5 4 P(b)  1  0.125 8 7 P(b)   0.875 8

B E A J M f f f t f f t f f f f f f t t t f f f t f f t t f f f t f t f f t t t f f t t t

A B E M J

consider estimating the CPD parameters for B and J in the alarm network given the following data set

slide-46
SLIDE 46

Maximum likelihood estimation

P(b)  0  0 8 8 P(b)  1 8

B E A J M f f f t f f t f f f f f f t t f f f f t f f t t f f f t f t f f t t t f f t t t

A B E M J

suppose instead, our data set was this… do we really want to set this to 0?

slide-47
SLIDE 47

Maximum a posteriori (MAP) estimation

  • instead of estimating parameters strictly from the

data, we could start with some prior belief for each

  • for example, we could use Laplace estimates
  • where nv represents the number of occurrences of

value v

P(X  x)  nx 1 (nv 1)

vValues(X)

pseudocounts

slide-48
SLIDE 48

Maximum a posteriori estimation

a more general form: m-estimates

P(X  x)  nx  pxm

vValues( X)

nv m

number of “virtual” instances prior probability of value x

slide-49
SLIDE 49

M-estimates example

B E A J M f f f t f f t f f f f f f t t f f f f t f f t t f f f t f t f f t t t f f t t t

A B E M J

now let’s estimate parameters for B using m=4 and pb=0.25

P(b)  0  0.25  4  1  0.08 8  4 12 P(b)  8  0.75 4  11  0.92 8  4 12

slide-50
SLIDE 50

Incorporating a Prior

  • The pseudo-counts are really parameters of a

beta distribution

  • beta distribution (a, b) is the conjugate prior

for the parameter p given the likelihood function

  • We can specify initial parameters a and b such

that a/(a+b)=p, and specify confidence in this belief with high initial values for a and b

  • After h heads out n trials, posterior

distribution is P(heads)=(a+h)/(a+b+n).

slide-51
SLIDE 51
slide-52
SLIDE 52

56

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Parameter learning + inference  Network learning

slide-53
SLIDE 53

The parameter learning task + inference

  • Given: a set of training instances (with some missing or

unobservable data), the graph structure of a BN

  • Do: infer the parameters of the CPDs

infer the missing data values

B E A J M f f ? t f f t ? f f f f ? f t …

Burglary Earthquake Alarm JohnCalls MaryCalls

slide-54
SLIDE 54

Inferring missing data and parameter learning with EM

Given:

  • data set with some missing values
  • model structure, initial model parameters

Repeat until convergence

  • Expectation (E) step: using current model, compute

expectation over missing values

  • Maximization (M) step: given the expectations,

compute/update maximum likelihood (MLE) or maximum posterior probability (MAP , maximum a posteriori) parameters

slide-55
SLIDE 55

example: EM for parameter learning

B E A J M f f ? f f f f ? t f t f ? t t f f ? f t f t ? t f f f ? f t t t ? t t f f ? f f f f ? t f f f ? f t

A B E M J

B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t f 0.9 0.2 A P(M) t f 0.8 0.1

suppose we’re given the following initial BN and training set

slide-56
SLIDE 56

example: E-step

A J M

t: 0.0069 f: 0.9931

f f t f t t f t t f f t t t f f t f B E f f f f t f f f f t f f t t f f f f f f

t:0.2 f:0.8 t:0.98 f: 0.02 t: 0.2 f: 0.8 t: 0.3 f: 0.7 t:0.2 f: 0.8 t: 0.997 f: 0.003 t: 0.0069 f: 0.9931 t:0.2 f: 0.8 t: 0.2 f: 0.8

f t

A B E M J

B E P(A) t t 0.9 t f 0.6 f t 0.3 f f 0.2 P(B) 0.1 P(E) 0.2 A P(J) t f 0.9 0.2 A P(M) t f 0.8 0.1

P(a | b,e,j,m) P(a | b,e,j,m)

slide-57
SLIDE 57

example: E-step

P(b,e,a,j,m) P(a | b,e,j,m)  P(b,e,a,j,m)  P(b,e,a,j,m) 0.9  0.8  0.2  0.1 0.2  0.9  0.8  0.2  0.1 0.2  0.9  0.8  0.8  0.8  0.9  0.00288 .4176  0.0069 P(b,e,a, j,m) P(a | b,e, j,m)  P(b,e,a, j,m)  P(b,e,a, j,m) 0.9  0.8  0.2  0.9  0.2  0.9  0.8  0.2  0.9  0.2  0.9  0.8  0.8  0.2  0.9  0.02592 .1296  0.2 P(a |b,e, j,m) P(b,e,a, j,m)  P(b,e,a, j,m)  P(b,e,a, j,m) 0.1 0.8  0.6  0.9  0.8  0.1 0.8  0.6  0.9  0.8  0.1 0.8  0.4  0.2  0.1  0.03456 .0352  0.98

slide-58
SLIDE 58

example: M-step

B E A J M f f

t: 0.0069

f f

f: 0.9931

f f

t:0.2

t f

f:0.8

t f

t:0.98

t t

f: 0.02

f f

t: 0.2

f t

f: 0.8

f t

t: 0.3

t f

f: 0.7

f f

t:0.2

f t

f: 0.8

t t

t: 0.997

t t

f: 0.003

f f

t: 0.0069

f f

f: 0.9931

f f

t:0.2

t f

f: 0.8

f f

t: 0.2

f t

f: 0.8

A B E M J

P(a |b,e)  0.997 1 P(a |b,e)  0.98 1 P(a | b,e)  0.3 1 P(a | b,e)  0.0069 0.2  0.2  0.2  0.0069  0.2  0.2 7

P(a |b,e)  E #(a  b  e) E #(b  e)

re-estimate probabilities using expected counts

B E P(A) t t 0.997 t f 0.98 f t 0.3 f f 0.145

re-estimate probabilities for P(J | A) and P(M | A) in same way

slide-59
SLIDE 59

Convergence of EM

  • E and M steps are iterated until probabilities

converge

  • will converge to a maximum in the data likelihood

(MLE or MAP)

  • the maximum may be a local optimum, however
  • the optimum found depends on starting conditions

(initial estimated probability parameters)

slide-60
SLIDE 60

65

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Network learning

slide-61
SLIDE 61

Learning structure + parameters

  • number of structures is super-exponential in the number
  • f variables
  • finding optimal structure is NP-complete problem
  • two common options:

– search very restricted space of possible structures (e.g. networks with tree DAGs) – use heuristic search (e.g. sparse candidate)

slide-62
SLIDE 62

The Chow-Liu algorithm

(Chow & Liu 1968)

  • learns a BN with a tree structure that maximizes the

likelihood of the training data

  • algorithm
  • 1. compute weight I(Xi, Xj) of each possible edge (Xi, Xj)
  • 2. find maximum weight spanning tree (MST)
  • 3. assign edge directions in MST
slide-63
SLIDE 63

The Chow-Liu algorithm

  • 1. use mutual information to calculate edge weights

I(X,Y )  P(x, y) P(x, y)log2 P(x)P(y)

 

x values(X ) y values(Y )

slide-64
SLIDE 64
slide-65
SLIDE 65

Joint

slide-66
SLIDE 66
slide-67
SLIDE 67

November 5, 2017 Data Mining: Concepts and Techniques 72

slide-68
SLIDE 68
slide-69
SLIDE 69

The Chow-Liu algorithm

  • 2. find maximum weight spanning tree: a maximal-weight

tree that connects all vertices in a graph

A C D E F G

1 5 1 5 1 7 1 8 1 9 1 7

B

1 15 1 6 1 8 1 9 1 11

slide-70
SLIDE 70

Prim’s algorithm for finding an MST

given: graph with vertices V and edges E Vnew Enew ← { v } where v is an arbitrary vertex from V ← { } repeat until Vnew = V { choose an edge (u, v) in E with max weight where u is in Vnew and v is not add v to Vnew and (u, v) to Enew } return Vnew and Enew which represent an MST

slide-71
SLIDE 71

Kruskal’s algorithm for finding an MST

given: graph with vertices V and edges E Enew ← { } for each (u, v) in E ordered by weight (from high to low) { remove (u, v) from E if adding (u, v) to Enew does not create a cycle add (u, v) to Enew } return V and Enew which represent an MST

slide-72
SLIDE 72

Returning directed graph in Chow-Liu

A B C D E F G A C D E F G

1 5 1 5 1 7 1 8 1 9 1 7

B

1 15 1 6 1 8 1 9 1 11

  • 3. pick a node for the root, and assign edge directions
slide-73
SLIDE 73

The Chow-Liu algorithm

  • How do we know that Chow-Liu will find a tree that

maximizes the data likelihood?

  • Two key questions:

– Why can we represent data likelihood as sum of I(X;Y)

  • ver edges?

– Why can we pick any direction for edges in the tree?

slide-74
SLIDE 74

G

log P(D|G, )

2

log P(x

(d ) | Pa(X)) i i



 D I(Xi, Pa(Xi)) H(Xi))

i

P(Xi,Pa(Xi))log2P(Xi,Pa(Xi)) /Pa(Xi))  D 

i values(Xi,Pa(Xi))

dD i

P(Xi, Pa(Xi))log2P(Xi | Pa(Xi)))

i values(Xi,Pa( Xi))

 D 

 D 

i values(Xi,Pa(Xi))

P(Xi,Pa(Xi))log2P(Xi,Pa(Xi)) / P(Xi)(Pa(Xi)))

P(Xi, Pa(Xi))log2P(Xi)

Why Chow-Liu maximizes likelihood (for a tree)

data likelihood given directed edges of G, best fit parametersG (since summing over all examples is equivalent to computing average

  • ver all examples and then multiplying by total number of examples |D|)
slide-75
SLIDE 75

G

log P(D|G, )  log P(x(d)

2 i i

| Parents(X ))



Why Chow-Liu maximizes likelihood (for a tree)

data likelihood given directed edges

argmaxGlogP(D |G,G)  argmaxG I(Xi, Xj)

dD i

 D I(Xi,Parents(Xi)) H(Xi))

i

we’re interested in finding the graph G that maximizes this

argmaxG logP(D |G,G)  argmaxGI(Xi,Parents(Xi))

i

if we assume a tree, one node has no parents, all others have exactly one

(Xi,X j)edges

edge directions don’t matter for likelihood, because MI is symmetric

I(Xi, Xj)  I(Xj, Xi)

slide-76
SLIDE 76

Learning structure + parameters

  • number of structures is super-exponential in the number
  • f variables
  • finding optimal structure is NP-complete problem
  • two common options:

– search very restricted space of possible structures (e.g. networks with tree DAGs) – use heuristic search (e.g. sparse candidate)

slide-77
SLIDE 77

Heuristic search for structure learning

  • each state in the search space represents a DAG Bayes

net structure

  • to instantiate a search approach, we need to specify

– state transition operators – scoring function for states – search algorithm (how to move through state space)

slide-78
SLIDE 78

The typical structure search operators

A B C D A B C D

add an edge

A B C D

reverse an edge given the current network at some stage of the search, we can…

A B C

delete an edge

D

slide-79
SLIDE 79

Scoring function decomposability

  • If score is likelihood, and all instances in D are complete,

then score can be decomposed as follows (and so can some other scores we’ll see later)

score(G, D) score(Xi,Parents(Xi): D)

i

  • thus we can

– score a network by summing terms over the nodes in the network – efficiently score changes in a local search procedure

slide-80
SLIDE 80

Bayesian network search: hill-climbing

given: data set D, initial network B0 i = 0 Bbest ←B0 while stopping criteria not met { for each possible operator application a { Bnew ← apply(a, Bi) if score(Bnew) > score(Bbest) Bbest ← Bnew } ++i Bi ← Bbest } return Bi

slide-81
SLIDE 81

Bayesian network search: the Sparse Candidate algorithm

[Friedman et al., UAI 1999] given: data set D, initial network B0, parameter k i = 0 repeat { ++i // restrict step select for each variable Xj a set Cji (|Cj | ≤ k) of candidate parents

i

// maximize step find network Bi maximizing score among networks where ∀Xj, Parents(Xj) ⊆Cji } until convergence return Bi

slide-82
SLIDE 82
  • to identify candidate parents in the first iteration, can compute

the mutual information between pairs of variables, select top k

  • subsequent iterations, condition on parents with conditional

mutual information:

The restrict step in Sparse Candidate

x,y

P(x,y) I(X,Y ) P(x, y)log P(x)P(y)

x,y

P(x, y | z) I(X,Y | Z) P(x, y,z)log P(x | z)P(y | z)

slide-83
SLIDE 83

The maximize step in Sparse Candidate

  • hill-climbing search with add-edge, delete-edge,

reverse-edge operators

  • test to ensure that cycles aren’t introduced into the graph
slide-84
SLIDE 84

Scoring functions for structure learning

  • Can we find a good structure just by trying to maximize the

likelihood of the data?

argmaxG,  logP(D |G,G)

G

slide-85
SLIDE 85

Scoring functions for structure learning

  • Can we find a good structure just by trying to maximize the

likelihood of the data?

argmaxG,  logP(D |G,G)

G

  • If we have a strong restriction on the the structures allowed

(e.g. a tree), then maybe.

  • Otherwise, no! Adding an edge will never decrease
  • likelihood. Overfitting likely.
slide-86
SLIDE 86

Scoring functions for structure learning

  • one general approach (where n is number of data points)

argminG,  G f (n)G  logP(D |G,G)

complexity penalty Akaike Information Criterion (AIC): Bayesian Information Criterion (BIC):

when f (n) 1 f (n)  1log(n) 2

slide-87
SLIDE 87

99

Bayesian Networks

 Graphical models  Bayesian networks - definition  Bayesian networks – inference

 Exact inference  Approximate inference

 Bayesian networks – learning

 Parameter learning  Parameter learning + inference  Network learning