L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S - - PowerPoint PPT Presentation

l earning from d ata
SMART_READER_LITE
LIVE PREVIEW

L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S - - PowerPoint PPT Presentation

Intro Parameters Structures Summary L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S CORE +S EARCH M ETHODS Pedro Larra naga Computational Intelligence Group Artificial Intelligence Department Universidad Polit


slide-1
SLIDE 1

Intro Parameters Structures Summary

LEARNING FROM DATA: DETECTING CONDITIONAL INDEPENDENCIES

AND SCORE+SEARCH METHODS

Pedro Larra˜ naga

Computational Intelligence Group Artificial Intelligence Department Universidad Polit´ ecnica de Madrid

Bayesian Networks: From Theory to Practice International Black Sea University Autumn School on Machine Learning 3-11 October 2019, Tbilisi, Georgia Pedro Larra˜ naga Learning from Data 1 / 69

slide-2
SLIDE 2

Intro Parameters Structures Summary

Outline

1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary

Pedro Larra˜ naga Learning from Data 2 / 69

slide-3
SLIDE 3

Intro Parameters Structures Summary

Outline

1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary

Pedro Larra˜ naga Learning from Data 3 / 69

slide-4
SLIDE 4

Intro Parameters Structures Summary

From data to Bayesian networks

Learning structure and parameters

Pedro Larra˜ naga Learning from Data 4 / 69

slide-5
SLIDE 5

Intro Parameters Structures Summary

From data to Bayesian networks

Learning structure and parameters From raw data to other ways of representation (Bayesian networks) of the information

more condensed: showing the essential of the data, and reducing the number of parameters necessary to specify the joint probability distribution more abstract: a model describing the joint probability distribution that generates the data more useful: a model able to make different types of reasoning

Pedro Larra˜ naga Learning from Data 5 / 69

slide-6
SLIDE 6

Intro Parameters Structures Summary

Discovering associations

The task of learning Bayesian networks from data Given a data set of cases D = {x(1), ..., x(N)} drawn at random from a joint probability distribution p0(x1, ..., xn)

  • ver X1, ..., Xn, and possibly some domain expert

background knowledge The task consists of identifying (learning) a DAG (directed acyclic graph) structure S and a set of corresponding parameters Θ

Pedro Larra˜ naga Learning from Data 6 / 69

slide-7
SLIDE 7

Intro Parameters Structures Summary

Discovering associations

The task of learning Bayesian networks from data When discovering associations all the variables have the same treatment There is not a target variable, as in supervised classification There is not a hidden variable, as in clustering

Pedro Larra˜ naga Learning from Data 7 / 69

slide-8
SLIDE 8

Intro Parameters Structures Summary

Outline

1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary

Pedro Larra˜ naga Learning from Data 8 / 69

slide-9
SLIDE 9

Intro Parameters Structures Summary

Expert vs estimating from a data base

Assuming we have the structure of the Bayesian network Direct estimation given by an expert Estimation from a data base of cases Example: If each variable Xi has ri possible values, and variable Y has rY values, the number of parameters to be learnt to specify P(Y = y|X1 = x1, ..., Xn = xn) are (rY − 1) n

i=1 ri

Pedro Larra˜ naga Learning from Data 9 / 69

slide-10
SLIDE 10

Intro Parameters Structures Summary

Maximum likelihood estimation

Parameter space Let consider a variable X with r possible values: {1, 2, ...., r} We have N observations (cases) of X: D = {x1, .., xN}, that is a sample of size N extracted from X

Example: X variable measuring the result obtained after rolling a dice five times. D = {1, 6, 4, 3, 1}, r = 6, and N = 5

We are interested in estimating: P(X = k) The parametric space is Θ = {θ = (θ1, ..., θr)|θi ∈ [0, 1], θr = 1 − r−1

i=1 θi}

P(X = k|θ1, ..., θr) = θk

Pedro Larra˜ naga Learning from Data 10 / 69

slide-11
SLIDE 11

Intro Parameters Structures Summary

Maximum likelihood estimation

Likelihood function L(D : θ) = P(D|θ) = P(X = x1, ..., X = xN)|θ) The likelihood function measures how probable is to obtain the data base of cases for a concrete value of the parameter θ Assuming that the cases are independent: P(D|θ) =

N

  • i=1

P(X = xi|θ) =

r

  • k=1

θNk

k

Nk denotes the number of cases in the data base for which X = k

Pedro Larra˜ naga Learning from Data 11 / 69

slide-12
SLIDE 12

Intro Parameters Structures Summary

Likelihood function

Example X 1 2 3 4 5 6 1 7 1 8 1 9 1 10 1 θ = P(X = 1) = 1

4

L(D : 1

4) = P(D| 1 4)

= P(X = 0, ..., X = 1)| 1

4) = 3 4 5 1 4 5

θ = P(X = 1) = 1

2

L(D : 1

2) = P(D| 1 2)

= P(X = 0, ..., X = 1)| 1

2) = 1 2 5 1 2 5

= 1

2 10 > 3 4 5 1 4 5

Pedro Larra˜ naga Learning from Data 12 / 69

slide-13
SLIDE 13

Intro Parameters Structures Summary

Maximum likelihood estimation

Multinomial distribution: relative frequencies θ∗ = (θ∗

1, θ∗ 2, ..., θ∗ r−1) = arg max(θ1,θ2,...,θr−1)P(D|θ)

In a multinomial distribution, the maximum likelihood estimator for P(X = k) is: θ∗

k = Nk

N the relative frequency

In the previous example, the maximum likelihood estimator

  • f P(X = 1) is θ∗ =

5 10

Pedro Larra˜ naga Learning from Data 13 / 69

slide-14
SLIDE 14

Intro Parameters Structures Summary

Bayesian estimation

Prior, posterior and predictive distributions It is assumed a prior knowledge expressed by means of a prior joint distribution over the parameters: ρ(θ1, θ2, ..., θr−1) The posterior distribution of the parameters given D is denoted by ρ(θ1, θ2, ..., θr−1|D) The predictive distribution P(X = k|D) is the average of the marginal over θk of the posterior distribution of the parameters P(X = k|D) =

  • Θ

θkρ(θ1, θ2, ..., θr−1|D)dθ1dθ2...dθr−1 Using the Bayes formula

P(X = k|D) = 1 P(D)

  • Θ

θk

r

  • j=1

θ

Nj j ρ(θ1, θ2, ..., θr−1)dθ1dθ2...dθr−1 Pedro Larra˜ naga Learning from Data 14 / 69

slide-15
SLIDE 15

Intro Parameters Structures Summary

Bayesian estimation

Dirichlet distribution The calculus of the previous integral depends on the prior distribution For a family of prior distributions, called Dirichlet distribution:

ρ(θ1, ..., θr−1) ≡ Dir(θ1, ..., θr−1; a1, ..., ar) = Γ(r

i=1 ai)

r

i=1 Γ(ai) θa1−1 1

...θar −1

r

ai > 0, 0 ≤ θi ≤ 1,

r

  • i=1

θi = 1 Γ(u) = ∞ tu−1e−tdt if uǫN ⇒ Γ(u) = (u − 1)!

the integral has an analytic solution The solution is obtained using a property of the Dirichlet distribution:

If the prior is Dir(θ1, ..., θr−1; a1, ..., ar), then the posterior is Dir(θ1, ..., θr−1; a1 + N1, ..., ar + Nr)

Pedro Larra˜ naga Learning from Data 15 / 69

slide-16
SLIDE 16

Intro Parameters Structures Summary

Bayesian estimation

Dirichlet distribution The Bayesian estimation is: P(X = k|D) = Nk + ak N + r

i=1 ai

The value r

i=1 ai is called the equivalent sample size

Interpretation of Dirichlet as prior distribution: before

  • btaining the data base D = {x1, ..., xN} we had virtually
  • bserved a sample of size r

i=1 ai, where X takes the

value k ak times

Pedro Larra˜ naga Learning from Data 16 / 69

slide-17
SLIDE 17

Intro Parameters Structures Summary

Bayesian estimation

Lindstone rule for estimation Assuming a specific Dirichlet distribution as prior, where ai = λ for all i = 1, .., r, that is Dir(θ1, ..., θr−1; λ, ..., λ) we

  • btain the Lindstone rule for estimation:

P(X = k|D) = Nk + λ N + rλ

Laplace rule (λ = 1) P(X = k|D) = Nk+1

N+r

Jeffreys-Perks rule (λ = 0.5) P(X = k|D) = Nk+0.5

N+ r

2

Schurmann-Grassberger rule (λ = 1

r ) P(X = k|D) = Nk+ 1

r

N+1

Pedro Larra˜ naga Learning from Data 17 / 69

slide-18
SLIDE 18

Intro Parameters Structures Summary

Estimation of parameters

Parameters θijk Bayesian network structure S = (X, A) with X = (X1, ..., Xn) and A denoting the set of arcs Variable Xi has ri possible values: x1

i , . . . , xri i

Local probability distribution P(xi | paj,S

i

, θi): P(xi

k | paj,S i

, θi) = θxk

i |paj i

≡ θijk The parameter θijk represents the conditional probability of variable Xi being in its k-th value, knowing that the set of its parent variables is in its j-th value pa1,S

i

, . . . , paqi ,S

i

denotes the values of PaS

i , the set of parents of the variable Xi

in the structure S The term qi denotes the number of possible different instances of the parent variables of Xi. Thus, qi =

Xg∈Pai rg

The local parameters for variable Xi are given by θi = ((θijk)ri

k=1)qi j=1)

Global parameters: θ = (θ1, ..., θn)

Pedro Larra˜ naga Learning from Data 18 / 69

slide-19
SLIDE 19

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Parameters θijk example

Local probabilities θ1 = (θ1−1, θ1−2) P(x1

1), P(x2 1)

θ2 = (θ2−1, θ2−2, θ2−3) P(x1

2), P(x2 2), P(x3 2)

θ3 = (θ311, θ321, θ331, P(x1

3|x1 1, x1 2), P(x1 3|x1 1, x2 2), P(x1 3|x1 1, x3 2),

θ341, θ351, θ361, P(x1

3|x2 1, x1 2), P(x1 3|x2 1, x2 2), P(x1 3|x2 1, x3 2),

θ312, θ322, θ332, P(x2

3|x1 1, x1 2), P(x2 3|x1 1, x2 2), P(x2 3|x1 1, x3 2),

θ342, θ352, θ362) P(x2

3|x2 1, x1 2), P(x2 3|x2 1, x2 2), P(x1 3|x2 1, x3 2),

θ4 = (θ411, θ421, θ412, θ422) P(x1

4|x1 3), P(x1 4|x2 3), P(x2 4|x1 3), P(x2 4, x2 3

Factorisation of the joint mass probability P(x1, x2, x3, x4) = P(x1)P(x2)P(x3|x1, x2)P(x4|x3) Figure: Structure, local probabilities and resulting factorization for a Bayesian network with four variables (X1, X3 and X4 with two possible values, and X2 with three possible values) variable possible values parent variables possible values of the parents Xi ri Pai qi X1 2 ∅ X2 3 ∅ X3 2 {X1, X2} 6 X4 2 {X3} 2 Table: Variables (Xi ), number of possible values of variables (ri ), set of variable parents of a variable (Pai ), number

  • f possible instantiations of the parent variables (qi )

Pedro Larra˜ naga Learning from Data 19 / 69

slide-20
SLIDE 20

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Global independence of the parameters For a given structure S and a data base D = {x(1), ..., x(N)}: L(D : θ) = P(D|θ) = P(x(1), ..., x(N)|θ) =

N

  • h=1

P(x(h)|θ) =

N

  • h=1

P(x(h)

1 , ..., x(h) n |θ)

=

N

  • h=1

(

n

  • i=1

P(x(h)

i

|pa(h),S

i

, θ) =

n

  • i=1

(

N

  • h=1

P(x(h)

i

|pa(h),S

i

, θ) Assuming global independence of the parameters: L(D : θ) =

n

  • i=1

N

  • h=1

P(x(h)

i

|pa(h),S

i

, θ) =

n

  • i=1

N

  • h=1

P(x(h)

i

|pa(h),S

i

, θi) =

n

  • i=1

L(Di : θi) It is possible to estimate the parameter for each variable Xi independently of the rest of variables

Pedro Larra˜ naga Learning from Data 20 / 69

slide-21
SLIDE 21

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Global independence: L(D : θ) = n

i=1 L(Di : θi)

Figure: Data base D for four variables

Pedro Larra˜ naga Learning from Data 21 / 69

slide-22
SLIDE 22

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Global independence: L(D : θ) = n

i=1 L(Di : θi)

Figure: Data base D2 for estimating the parameters of variable X2

Pedro Larra˜ naga Learning from Data 22 / 69

slide-23
SLIDE 23

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Global independence: L(D : θ) = n

i=1 L(Di : θi)

Figure: Data base D1 for estimating the parameters of variable X1

Pedro Larra˜ naga Learning from Data 23 / 69

slide-24
SLIDE 24

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Global independence: L(D : θ) = n

i=1 L(Di : θi)

Figure: Data base D4 for estimating the parameters of variable X4

Pedro Larra˜ naga Learning from Data 24 / 69

slide-25
SLIDE 25

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Global independence: L(D : θ) = n

i=1 L(Di : θi)

Figure: Data base D3 for estimating the parameters of variable X3

Pedro Larra˜ naga Learning from Data 25 / 69

slide-26
SLIDE 26

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Local independence of the parameters L(D : θ) =

n

  • i=1

L(Di : θi) =

n

  • i=1

(

N

  • h=1

P(x(h)

i

|pa(h),S

i

, θi)) =

n

  • i=1

(

qi

  • j=1

(

Nij

  • hj =1

P(x

(hj ) i

|paj,S

i

, θi))) Nij number of cases in D where the configuration paj,S

i

has been observed If PaS

i = ∅ then Nij = N

Assuming local independence of the parameters: L(D : θ) =

n

  • i=1

(

qi

  • j=1

(

Nij

  • hj =1

P(x

(hj ) i

|paj,S

i

, θi))) =

n

  • i=1

(

qi

  • j=1

(

Nij

  • hj =1

P(x

(hj ) i

|paj,S

i

, θij))) =

n

  • i=1

qi

  • j=1

L(Dij : θij)

Pedro Larra˜ naga Learning from Data 26 / 69

slide-27
SLIDE 27

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Local independence: L(D : θ) = n

i=1

qi

j=1 L(Dij : θij)

Figure: Data base D21 for estimating the parameters of variable X2 when X1 = 0

Pedro Larra˜ naga Learning from Data 27 / 69

slide-28
SLIDE 28

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Local independence: L(D : θ) = n

i=1

qi

j=1 L(Dij : θij)

Figure: Data base D22 for estimating the parameters of variable X2 when X1 = 1

Pedro Larra˜ naga Learning from Data 28 / 69

slide-29
SLIDE 29

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Local independence: L(D : θ) = n

i=1

qi

j=1 L(Dij : θij)

Figure: Data base D41 for estimating the parameters of variable X4 when X1 = 0, X3 = 0

Pedro Larra˜ naga Learning from Data 29 / 69

slide-30
SLIDE 30

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Local independence: L(D : θ) = n

i=1

qi

j=1 L(Dij : θij)

Figure: Data base D42 for estimating the parameters of variable X4 when X1 = 0, X3 = 1

Pedro Larra˜ naga Learning from Data 30 / 69

slide-31
SLIDE 31

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

L(D : θ) = n

i=1

qi

j=1

ri

k=1 θ Nijk ijk

P(Xi = x(k)

i

| paj

i, θij) = θijk with i = 1, ..., n; j = 1, ..., qi and

k = 1, ..., ri Nij number of cases in D where the configuration paj

i has

been observed Nijk number of cases in D where simultaneously Xi = xk

i

and Pai = paj

i has been observed (Nij = ri k=1 Nijk)

L(Dij : θij) =

ri

  • k=1

θ

Nijk ijk

L(D : θ) =

n

  • i=1

qi

  • j=1

ri

  • k=1

θ

Nijk ijk

Pedro Larra˜ naga Learning from Data 31 / 69

slide-32
SLIDE 32

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Relative frequency For each variable Xi and each configuration paj

i of Pai

θijk = P(Xi = xk

i |Pai = paj i) = Nijk

Nij Drawbacks with sparse data set: Nij can be zero (P(X4 = 0|X1 = 1, X3 = 1)) Difficulty for generalization (P(X2 = 1|X1 = 0) = 1) based on only three cases)

Pedro Larra˜ naga Learning from Data 32 / 69

slide-33
SLIDE 33

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Incomplete data We have assumed that the data base D was complete. For every case we have recorded the value of all variables In real word problems data bases can be incomplete

Figure: Incomplete data base

Pedro Larra˜ naga Learning from Data 33 / 69

slide-34
SLIDE 34

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Incomplete data Ignore the cases containing incomplete data

Pedro Larra˜ naga Learning from Data 34 / 69

slide-35
SLIDE 35

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Incomplete data Ignore the cases containing incomplete data only in the estimation of the corresponding variables

Pedro Larra˜ naga Learning from Data 35 / 69

slide-36
SLIDE 36

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Incomplete data In the previous two approaches some data are not used (we do not use all the information) If missing data has been produced at random, we can fill them (imputation of missing data) Standard approaches to imputation:

Impute with the mode of the variable Impute with the mode of the variable conditioned to the

  • bserved information in the rest of variables

Pedro Larra˜ naga Learning from Data 36 / 69

slide-37
SLIDE 37

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Incomplete data. EM algorithm (Dempster et al. 1977) Iterative algorithm based on two steps: Expectation step: estimate the missing data from its expected values. These are obtained from the current estimates of the parameters Maximization step: using the complexions of the missing data as complete data, obtain the maximum likelihood estimations of the parameters

Pedro Larra˜ naga Learning from Data 37 / 69

slide-38
SLIDE 38

Intro Parameters Structures Summary

Maximum likelihood estimation of parameters

Incomplete data. EM algorithm (Dempster et al. 1977)

E Step: Expectation Using an inference algorithm (in iteration 0, can be at random) P0(X3|X1 = 1, X2 = 0, X4 = 0) ≡ (0.60, 0.40) P0(X1, X3|X2 = 1, X4 = 1) ≡ (0.08, 0.02, 0.72, 0.18) P0(X2|X1 = 0, X3 = 0, X4 = 0) ≡ (0.30, 0.70) M Step: Maximization From the imputed data set we obtain the frequencies and the maximum likelihood estimations for the parameters Iterate E step and M step until convergence

X1 X2 weight 0 0 0.30 0 1 1.80 1 0 2.00 1 1 0.90 X1 weight 0 2.10 1 2.90 X3 weight 0 2.40 1 2.60 X1 X4 X3 weight 0 0 0 1.00 0 0 1 1.00 0 1 0 0.08 0 1 1 0.02 1 0 0 0.00 1 0 1 1.40 1 1 0 0.72 1 1 1 0.18

Pedro Larra˜ naga Learning from Data 38 / 69

slide-39
SLIDE 39

Intro Parameters Structures Summary

Outline

1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary

Pedro Larra˜ naga Learning from Data 39 / 69

slide-40
SLIDE 40

Intro Parameters Structures Summary

Introduction

Learning structure (DAG) and parameters (conditional tables)

Pedro Larra˜ naga Learning from Data 40 / 69

slide-41
SLIDE 41

Intro Parameters Structures Summary

Introduction

Three types of methods Based on detecting conditional independencies First: carry out a study of the dependence and independence relationships between the variables by means of statistical tests Second: try to find the structure (or structures) that represents the most (or all) of these relationships Based on score + search They try to find the structure that best ”fit” the data They need: A score (metric or evaluation function) in order to measure the fitness of each candidate structure A search method (heuristic) to explore in an intelligent manner the space of possible solutions Several types of spaces can be considered Hybrid methods Based on a technique search guided by a score and the detection of conditional independencies

Pedro Larra˜ naga Learning from Data 41 / 69

slide-42
SLIDE 42

Intro Parameters Structures Summary

Testing conditional independencies

Different kind of input A list of true conditional independencies A joint probability distribution where the (in)dependence relationships can be checked A data base of cases where the (in)dependence relationships can be tested

Figure: Different inputs for structure learning algorithms based on detecting conditional independencies

Pedro Larra˜ naga Learning from Data 42 / 69

slide-43
SLIDE 43

Intro Parameters Structures Summary

Testing conditional independencies

Different methods Required additional information: ordering between the variables, .... Type of structure recovered: tree, polytree, multiply connected graph, .... Efficiency:

Number of independencies to be checked Order (number of variables in the conditional part) of these independencies

Guaranty of the solution Robustness against small sample sizes

Pedro Larra˜ naga Learning from Data 43 / 69

slide-44
SLIDE 44

Intro Parameters Structures Summary

Testing conditional independencies

Testing I(Xi, Xj|∅) Given a data base D = {x(1), ..., x(N)}, how to check I(Xi, Xj|∅)? P(Xi = xk

i ) ≈ Nik N

P(Xj = xh

j ) ≈ Njh N

P(Xi = xk

i , Xj = xh j ) ≈ Nkh

ij

N

Nkh

ik number of times we observe simultaneously Xi = xk i and Xj = xh j in D

Xi and Xj are independent iff: P(Xi = xk

i , Xj = xh j ) = P(Xi = xk i )P(Xj = xh j ) ⇔

Nkh

ij

N ≈ Nik N Njh N = NNkh

ij

NikNjh ≈ 1 Mutual information between Xi and Xj: MI(Xi, Xj) =

  • k
  • h

P(xk

i , xh j ) ln

P(xk

i , xh j )

P(xk

i )P(xh j ) ≈

  • k
  • h

Nkh

ij

N ln NNkh

ij

NikNjh If Xi and Xj are independent 2NMI(Xi, Xj) ≈ 2

k

  • h Nkh

ik ln NNkh

ik

Nik Njh will close to

zero It can be tested statistically using the fact that if Xi and Xj are independent: 2NMI(Xi, Xj) → χ2

(ri −1)(rj −1) Pedro Larra˜ naga Learning from Data 44 / 69

slide-45
SLIDE 45

Intro Parameters Structures Summary

Testing conditional independencies

Testing I(Xi, Xj|Z) Given a data base D = {x(1), ..., x(N)}, check I(Xi, Xj|Z) analogously to I(Xi, Xj|∅) Z can denote a single variable or more than one variable

P(Xi = xk

i |Z = z) ≈ Nkz iz N

P(Xj = xh

j |z = z) ≈ Nhz jz N

P(Xi = xk

i , Xj = xh j |z = z) ≈ Nkhz ijz N

Xi and Xj are conditionally independent given Z iff: Nkhz

ijZ

N ≈ Nkz

iZ

N Nhz

jZ

N Mutual information between Xi and Xj given Z: 2NMI(Xi, Xj|Z) ≈ 2

  • k
  • h
  • z

Nkhz

ijZ ln

NNkhz

ijZ

Nkz

iZ Nhz jZ

If Xi and Xj are conditionally independent given Z then 2NMI(Xi, Xj|Z) will close to zero It can be tested statistically using the fact that if Xi and Xj are independent: 2NMI(Xi, Xj|Z) → χ2

(ri −1)(rj −1)|Z | Pedro Larra˜ naga Learning from Data 45 / 69

slide-46
SLIDE 46

Intro Parameters Structures Summary

Testing conditional independencies

Some considerations

The order of a conditional independence test I(Xi, Xj|Z) is the number of variables in Z To test I(Xi, Xj|Z) its is necessary to calculate Nkhz

ijZ and to compute

2NMI(Xi, Xj|Z) The complexity of the test Grows exponentially with the order Grows linearly with the number of cases The reliability of the test: Increases with the number of cases (it is an asymptotic test) Reduces dramatically with the order of the test We would like to: Have large data sets Carry out low order test

Pedro Larra˜ naga Learning from Data 46 / 69

slide-47
SLIDE 47

Intro Parameters Structures Summary

Testing conditional independencies

Equivalent DAGs Two DAGs, S1 and S2 are equivalent (in independence or Markov equivalent) if for all W, Y, ZǫX IS1(W, Y|Z) ⇐ ⇒ IS2(W, Y|Z) A head to head pattern in a DAG, S, is an ordered triplet of variables (X, Z, Y) such that X and Y are not adjacent and S contains the arcs X − → Z and Y − → Z

Figure: Head to head pattern

Pedro Larra˜ naga Learning from Data 47 / 69

slide-48
SLIDE 48

Intro Parameters Structures Summary

Testing conditional independencies

Equivalent DAGs Two DAGs, S1 and S2 are equivalents iff they have the same edges (arcs without direction) and the same head to head patterns

Figure: Equivalent DAGs

Pedro Larra˜ naga Learning from Data 48 / 69

slide-49
SLIDE 49

Intro Parameters Structures Summary

Testing conditional independencies

Partially directed acyclic graph (PDAG) Using only conditional independence tests can not be possible to obtain a unique DAG Usually a partially directed acyclic graph (PDAG) is obtained Each PDAG represents an equivalent class of DAGs Figure: PDAGs The arcs in a PDAG will appear in every of the DAGs associated to its equivalent class The edges in a PDAG can be orientated in different ways in each of the DAGs associated to its class (without having head to head patterns)

Pedro Larra˜ naga Learning from Data 49 / 69

slide-50
SLIDE 50

Intro Parameters Structures Summary

Testing conditional independencies

PC algorithm (Spirtes et al. 1993) General idea is based on generating a skeleton derived though statistical tests for detecting conditional independencies Start from the complete undirected graph Recursive conditional independence tests for deleting edges The output is a PDAG where the edges should be transformed in arcs Constraint based structure learning algorithm

Pedro Larra˜ naga Learning from Data 50 / 69

slide-51
SLIDE 51

Intro Parameters Structures Summary

Testing conditional independencies

PC algorithm (Spirtes et al. 1993)

Form complete, undirected graph S t = −1 repeat t = t + 1 repeat select ordered pair of adjacent nodes A, B in S select neighborhood C of A of size t (if possible) delete edge A − B in S if A and B cond. ind. given C until all ordered pairs have been tested until all neighborhood are of size smaller than t Transform edges in arcs by applying a couple of simple rules Figure: Pseudocode of the PC algorithm (Spirtes et al. 1993)

Pedro Larra˜ naga Learning from Data 51 / 69

slide-52
SLIDE 52

Intro Parameters Structures Summary

Testing conditional independencies

PC algorithm (Spirtes et al. 1993). Example with t = 2

Figure: Example of the PC algorithm with t = 2

Pedro Larra˜ naga Learning from Data 52 / 69

slide-53
SLIDE 53

Intro Parameters Structures Summary

Score+search approaches

Introduction

They try to find the structure that best ”fit” the data They are characterized by: A score (metric or evaluation function) in order to measure the fitness of each candidate structure Penalized log-likelihood Bayesian metrics A space of structures where the search is carried out Directed acyclic graphs Equivalence classes Order between the variables A search method (heuristic) to explore in an intelligent manner the space

  • f possible solutions

Local search Heuristics

Pedro Larra˜ naga Learning from Data 53 / 69

slide-54
SLIDE 54

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Penalized log-likelihood Likelihood of the data: L(D : S, θ) = n

i=1

qi

j=1

ri

k=1 θ Nijk ijk

Log-likelihood of the data: log P(D : S, θ) =

n

  • i=1

qi

  • j=1

ri

  • k=1

log(θijk)Nijk

Nijk denotes the number of cases in D where variable Xi is equal to xk

i and Pai is in its j-th value

Maximum likelihood estimate: θijk = Nijk

Nij with

Nij = ri

k=1 Nijk

log P(D : S, θ) = n

i=1

qi

j=1

ri

k=1 Nijk log Nijk Nij

Pedro Larra˜ naga Learning from Data 54 / 69

slide-55
SLIDE 55

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Penalized log-likelihood Figure: Likelihood of the data increases monotonically with respect the complexity of the model

Pedro Larra˜ naga Learning from Data 55 / 69

slide-56
SLIDE 56

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Penalized log-likelihood Avoid overfitting penalizing the complexity of the Bayesian network in the log-likelihood :

n

  • i=1

qi

  • j=1

ri

  • k=1

Nijk log Nijk Nij − dim(S)pen(N)

dim(S) = n

i=1 qi(ri − 1) model dimension

pen(N) no negative penalization function

pen(N) = 1: Akaike’s information criterion (AIC) (Akaike, 1974) pen(N) = 1

2 log N: Bayesian information criterion (BIC)

(Schwarz, 1978). It is equivalent to the minimum description lenght (MDL) (Lam and Bacchus, 1994) criterion

Pedro Larra˜ naga Learning from Data 56 / 69

slide-57
SLIDE 57

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Bayesian model selection Try to obtain the structure with maximum a posterior probability given the data: that is arg maxSP(S|D) Using Bayes formula: P(S|D) = P(D|S)P(S) P(D) P(S|D) ∝ P(D|S)P(S) P(D|S) is the marginal likelihood of the data P(S) denotes the prior distribution over structures If P(S) is uniform (maxP(S|D) ≡ maxP(D|S)) we try to

  • btain the structure with maximum marginal likelihood

Pedro Larra˜ naga Learning from Data 57 / 69

slide-58
SLIDE 58

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Bayesian model selection. K2 metric Accounts for uncertainty also in the parameters: P(D|S) =

  • P(D|S, θ)p(θ|S)dθ

P(D|S) posterior probability of the data given the structure P(D|S, θ) likelihood of the data given the Bayesian network (structure + parameters) p(θ|S) prior distribution over the parameters

Pedro Larra˜ naga Learning from Data 58 / 69

slide-59
SLIDE 59

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Bayesian model selection. K2 metric Assuming that p(θ|S) is uniform, it is possible to obtain a closed formula for P(D|S) (Cooper and Herskovits, 1992) P(D|S) =

n

  • i=1

qi

  • j=1

(ri − 1)! (Nij + ri − 1)!

ri

  • k=1

Nijk!

n: number of variables ri: number of states Xi can have qi: number of possible state combinations of Pai Nijk: number of cases in D where Xi takes its k-th value and the parent set of Xi are on their j-th combination of values Nij: ri

k=1 Nijk

Pedro Larra˜ naga Learning from Data 59 / 69

slide-60
SLIDE 60

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Bayesian model selection. K2 algorithm

An ordering between the nodes is assumed An upper bound is set on the number of parents for any node For every node, Xi, K2 searches for the set of parent nodes that maximizes: g(Xi, Pai) =

qi

  • j=1

(ri − 1)! (Nij + ri − 1)!

ri

  • k=1

Nijk! K2 assumes initially that a node does not have parents At each step K2 incrementally adds the parent whose addition provides the best value for g(Xi, Pai) K2 stops when adding a single parent to any node cannot increase g(Xi, Pai) K2 is a greedy algorithm

Pedro Larra˜ naga Learning from Data 60 / 69

slide-61
SLIDE 61

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Bayesian model selection. BDe metric Assuming that p(θ|S) follows a Dirichlet distribution, it is possible to obtain a closed formula for P(D|S) (Heckerman et al., 1995) P(D|S) =

n

  • i=1

qi

  • j=1

Γ(αij) Γ(αij + Nij)

ri

  • k=1

Γ(αijk + Nijk) Γ(αijk)

αijk denotes the parameters of the Dirichlet distribution αij = ri

k=1 αijk

This score is called Bayesian Dirichlet equivalence metric because it verifies the score equivalence property (two Markov equivalent graphs score the same)

Pedro Larra˜ naga Learning from Data 61 / 69

slide-62
SLIDE 62

Intro Parameters Structures Summary

Score+search approaches

Score metrics. Bayesian approaches Bayesian model selection Search for the model maximizing the posterior probability

  • f the data given the structure

Bayesian model averaging (full Bayesian approach) Average over all possible structures and all possible assignments for the values to the parameters Selective model averaging Average over some selected structures integrating over the parameters Bayesian metrics are consistent: augmenting the size of the data base, a reduction of the weight of prior distribution is verified

Pedro Larra˜ naga Learning from Data 62 / 69

slide-63
SLIDE 63

Intro Parameters Structures Summary

Score+search approaches

Different spaces for search Space of directed acyclic graphs d(n) =

n

  • i=1

(−1)i+1(n

i )2i(n−i)d(n − i); d(0) = 1; d(1) = 1

Space of equivalence classes (each class reflect the same set of conditional independencies)

Scores: score equivalents (Chickering, 1996)

Ordering between the variables (Larra˜ naga et al., 1996, Friedman and Koller, 2002): cardinality of the search space n!

Pedro Larra˜ naga Learning from Data 63 / 69

slide-64
SLIDE 64

Intro Parameters Structures Summary

Score+search approaches

Search algorithms. Local search. B algorithm (Buntine, 1991) Local operators: insert, delete and invert an arc Efficient search due to the decomposability of the most usual metrics (AIC, BIC, K2, BDe, ...)

Pedro Larra˜ naga Learning from Data 64 / 69

slide-65
SLIDE 65

Intro Parameters Structures Summary

Score+search approaches

Search algorithms. Genetic algorithms (Larraaga et al. 1996)

Each individual of the genetic algorithm represents a DAG structure (binary representation) One point crossover and bit mutation do not guaranty that the descendants verify the DAG condition Repair operators

Pedro Larra˜ naga Learning from Data 65 / 69

slide-66
SLIDE 66

Intro Parameters Structures Summary

Hybrid methods

Combining detecting conditional independencies and score+search Obtain different solutions applying the PC algorithm. Local search (with a score to evaluate the goodness of the structure) starting from each of the solutions provided by the PC Search of the structure that best represents the conditional independencies obtained in the data set. The number of conditional independencies represented in the structure that have been tested in the data set is the score to maximize

Pedro Larra˜ naga Learning from Data 66 / 69

slide-67
SLIDE 67

Intro Parameters Structures Summary

Outline

1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary

Pedro Larra˜ naga Learning from Data 67 / 69

slide-68
SLIDE 68

Intro Parameters Structures Summary

Learning Bayesian networks

Structure + parameters Learning parameters

Maximum likelihood estimation Bayesian estimation (Dirichlet distribution) Incomplete data (EM algorithm)

Learning structures

Detecting conditional independencies (PC algorithm) Score + search (penalized log-likelihood (AIC, BIC, MDL), Bayesian metrics (K2, BDe); local, genetic algorithms) Hybrid methods

Pedro Larra˜ naga Learning from Data 68 / 69

slide-69
SLIDE 69

Intro Parameters Structures Summary

LEARNING FROM DATA: DETECTING CONDITIONAL INDEPENDENCIES

AND SCORE+SEARCH METHODS

Pedro Larra˜ naga

Computational Intelligence Group Artificial Intelligence Department Universidad Polit´ ecnica de Madrid

Bayesian Networks: From Theory to Practice International Black Sea University Autumn School on Machine Learning 3-11 October 2019, Tbilisi, Georgia Pedro Larra˜ naga Learning from Data 69 / 69