Scoring Bayesian Networks of Mixed Variables Bryan Andrews, MS - - PowerPoint PPT Presentation

scoring bayesian networks of mixed variables
SMART_READER_LITE
LIVE PREVIEW

Scoring Bayesian Networks of Mixed Variables Bryan Andrews, MS - - PowerPoint PPT Presentation

Scoring Bayesian Networks of Mixed Variables Bryan Andrews, MS Joesph Ramsey, PhD and Greg Cooper, MD, PhD August 14, 2017 Learning Bayesian Networks (BNs) BNs constitute a widely used graphical framework for representing probabilistic


slide-1
SLIDE 1

Scoring Bayesian Networks of Mixed Variables

August 14, 2017 Bryan Andrews, MS Joesph Ramsey, PhD and Greg Cooper, MD, PhD

slide-2
SLIDE 2

2

Learning Bayesian Networks (BNs)

  • BNs constitute a widely used graphical framework for

representing probabilistic relationships

  • Many application in Bayesian Inference and Causal Discovery
  • Learning structure is crucial

– Limited work has been done in the presence of both discrete

and continuous variables

slide-3
SLIDE 3

3

Learning Bayesian Networks (BNs)

  • BNs constitute a widely used graphical framework for

representing probabilistic relationships

  • Many application in Bayesian Inference and Causal Discovery
  • Learning structure is crucial

– Limited work has been done in the presence of both discrete

and continuous variables Goal: Provide scalable solutions for learning BNs in the presence of both discrete and continuous variables

slide-4
SLIDE 4

4

Outline

  • Bayesian Information Criterion (BIC)
  • Mixed Variable Polynomial (MVP) score
  • Conditional Gaussian (CG) score
  • Adaptations
  • Simulations and empirical results
slide-5
SLIDE 5

5

Outline

  • Bayesian Information Criterion (BIC)
  • Mixed Variable Polynomial (MVP) score
  • Conditional Gaussian (CG) score
  • Adaptations
  • Simulations and empirical results
slide-6
SLIDE 6

6

The Bayesian Information Criterion

Let M be a model and D be a dataset BIC is an approximation for log p(M|D)

slide-7
SLIDE 7

7

log p(M∣D)≈−2lik+dof logn

The Bayesian Information Criterion

Let M be a model and D be a dataset BIC is an approximation for log p(M|D) Where lik is the log likelihood, dof are the degrees of freedom, and n is the number of samples

slide-8
SLIDE 8

8

log p(M∣D)≈−2lik+dof logn

The Bayesian Information Criterion

Let M be a model and D be a dataset BIC is an approximation for log p(M|D) Where lik is the log likelihood, dof are the degrees of freedom, and n is the number of samples Scores a BN as the sum over all BIC calculations for each node given its parents

slide-9
SLIDE 9

9

Outline

  • Bayesian Information Criterion (BIC)
  • Mixed Variable Polynomial (MVP) score
  • Conditional Gaussian (CG) score
  • Adaptations
  • Simulations and empirical results
slide-10
SLIDE 10

10

The Mixed Variable Polynomial (MVP) score

  • Use higher order polynomials to estimate relationships between

variables

– Allows for nonlinear relationships between continuous

variables

– Allows for complicated PMFs for discrete variables

Approximates Logistic Regression

slide-11
SLIDE 11

11

The Mixed Variable Polynomial (MVP) score

  • Use higher order polynomials to estimate relationships between

variables

– Allows for nonlinear relationships between continuous

variables

– Allows for complicated PMFs for discrete variables

Approximates Logistic Regression

  • Calculate a log-likelihood and degrees of freedom for BIC
slide-12
SLIDE 12

12

Modeling a Continuous Child

  • Partition according to the discrete parents

– Splits the data into subsets

slide-13
SLIDE 13

13

  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Calculate a log likelihood and degrees of freedom for each

subset

Modeling a Continuous Child

slide-14
SLIDE 14

14

  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Calculate a log likelihood and degrees of freedom for each

subset

  • Aggregate the log likelihood and degrees of freedom terms from

each subset together

Modeling a Continuous Child

slide-15
SLIDE 15

15

  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Calculate a log likelihood and degrees of freedom for each

subset

  • Aggregate the log likelihood and degrees of freedom terms from

each subset together

  • Score continuous child using BIC

Modeling a Continuous Child

slide-16
SLIDE 16

16

  • Let X, Y be continuous
  • Let A be discrete (|A| = 3)
  • Want: likX | Y, A, dofX | Y, A

Y A X

Modeling a Continuous Child

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

dof2 lik2 dof3 lik3 dof1 lik1

slide-20
SLIDE 20

20

dof2 lik2 dof3 lik3 dof1 lik1 likX | Y, A = lik1 + lik2 + lik3 dofX | Y, A = dof1 + dof2 + dof3

slide-21
SLIDE 21

21

dof2 lik2 dof3 lik3 dof1 lik1 likX | Y, A = lik1 + lik2 + lik3 dofX | Y, A = dof1 + dof2 + dof3

  • 2likX | Y, A + dofX | Y, A log n
slide-22
SLIDE 22

22

  • Binarize the child A into d (0, 1) variables where d = |A|

Modeling a Discrete Child

slide-23
SLIDE 23

23

  • Binarize the child A into d (0, 1) variables where d = |A|
  • Partition according to the discrete parents

– Splits the data into subsets

Modeling a Discrete Child

slide-24
SLIDE 24

24

  • Binarize the child A into d (0, 1) variables where d = |A|
  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Treat the regression lines a components to PMFs for A

Modeling a Discrete Child

slide-25
SLIDE 25

25

  • Binarize the child A into d (0, 1) variables where d = |A|
  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Treat the regression lines a components to PMFs for A

  • Calculate a log likelihood and degrees of freedom for each

subset

Modeling a Discrete Child

slide-26
SLIDE 26

26

  • Binarize the child A into d (0, 1) variables where d = |A|
  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Treat the regression lines a components to PMFs for A

  • Calculate a log likelihood and degrees of freedom for each

subset

  • Aggregate the log likelihood and degrees of freedom terms from

each subset together

Modeling a Discrete Child

slide-27
SLIDE 27

27

  • Binarize the child A into d (0, 1) variables where d = |A|
  • Partition according to the discrete parents

– Splits the data into subsets

  • Perform regression with the continuous parents for each partition

– Treat the regression lines a components to PMFs for A

  • Calculate a log likelihood and degrees of freedom for each

subset

  • Aggregate the log likelihood and degrees of freedom terms from

each subset together

  • Score discrete child using BIC

Modeling a Discrete Child

slide-28
SLIDE 28

28

Modeling a Discrete Child

X A

  • Let X be continuous
  • Let A be discrete (|A| = 3)
  • Want: likA | X, dofA | X
slide-29
SLIDE 29

29

1 3 2

slide-30
SLIDE 30

30

1 2 3

a∈{0,1,2}

p( A=a∣X=x)=1∀ x p(A=a∣X=x)≥0∀ a, x

slide-31
SLIDE 31

31

1 2 3

a∈{0,1,2}

p( A=a∣X=x)=1∀ x p(A=a∣X=x)≥0∀ a, x

– True for the proposed method

slide-32
SLIDE 32

32

1 2 3

a∈{0,1,2}

p( A=a∣X=x)=1∀ x p(A=a∣X=x)≥0∀ a, x

– True in the sample limit given some assumptions – True for the proposed method

slide-33
SLIDE 33

33

1 2 3

a∈{0,1,2}

p( A=a∣X=x)=1∀ x p(A=a∣X=x)≥0∀ a, x

– True in the sample limit given some assumptions

Define a procedure to shrink illegal distributions back into the domain of probabilities

– True for the proposed method

slide-34
SLIDE 34

34

1 3 2

slide-35
SLIDE 35

35

1 3 2

slide-36
SLIDE 36

36

1 2 3

likA | X dofA | X

slide-37
SLIDE 37

37

1 2 3

likA | X dofA | X

  • 2likA | X + dofA | X log n
slide-38
SLIDE 38

38

Outline

  • Bayesian Information Criterion (BIC)
  • Mixed Variable Polynomial (MVP) score
  • Conditional Gaussian (CG) score
  • Adaptations
  • Simulations and empirical results
slide-39
SLIDE 39

39

The Conditional Gaussian (CG) score

  • Move all the continuous variables to the left and all the discrete

variables to the right of the conditioning bar

– Calculate the desired probability using partitioned Gaussian

and Multinomial distributions

slide-40
SLIDE 40

40

The Conditional Gaussian (CG) score

  • Move all the continuous variables to the left and all the discrete

variables to the right of the conditioning bar

– Calculate the desired probability using partitioned Gaussian

and Multinomial distributions

  • Calculate a log-likelihood and degrees of freedom for BIC
slide-41
SLIDE 41

41

Let X, Y be continuous Let A be discrete Assume Y, A are parents of X Y A X

Modeling a Continuous Child

slide-42
SLIDE 42

42

p(X∣Y , A)= p(X ,Y , A) p(Y , A) Let X, Y be continuous Let A be discrete Assume Y, A are parents of X Y A X

Modeling a Continuous Child

slide-43
SLIDE 43

43

p(X∣Y , A)= p(X ,Y , A) p(Y , A) 1= p(X ,Y∣A) p(A) p(Y∣A) p( A) 1= p(X ,Y∣A) p(Y∣A) Let X, Y be continuous Let A be discrete Assume Y, A are parents of X Y A X

Modeling a Continuous Child

slide-44
SLIDE 44

44

p(X∣Y , A)= p(X ,Y , A) p(Y , A) 1= p(X ,Y∣A) p(A) p(Y∣A) p( A) 1= p(X ,Y∣A) p(Y∣A) Let X, Y be continuous Let A be discrete Partitioned Gaussians Assume Y, A are parents of X Y A X

Modeling a Continuous Child

slide-45
SLIDE 45

45

  • Want: likX, Y | A, dofX, Y | A

likY | A, dofY | A

p( X ,Y∣A) p(Y∣A)

Modeling a Continuous Child

Y A X

slide-46
SLIDE 46

46

likX, Y | A, dofX, Y | A

slide-47
SLIDE 47

47

likX, Y | A, dofX, Y | A

slide-48
SLIDE 48

48

dof2 lik2 dof3 lik3 dof1 lik1 likX, Y | A, dofX, Y | A

slide-49
SLIDE 49

49

dof2 lik2 dof3 lik3 dof1 lik1 likX, Y | A = lik1 + lik2 + lik3 dofX, Y | A = dof1 + dof2 + dof3 likX, Y | A, dofX, Y | A

slide-50
SLIDE 50

50

likY | A, dofY | A

slide-51
SLIDE 51

51

likY | A, dofY | A

slide-52
SLIDE 52

52

dof2 lik2 dof3 lik3 dof1 lik1 likY | A, dofY | A

slide-53
SLIDE 53

53

dof2 lik2 dof3 lik3 dof1 lik1 likY | A = lik1 + lik2 + lik3 dofY | A = dof1 + dof2 + dof3 likY | A, dofY | A

slide-54
SLIDE 54

54

Have: likX, Y | A, dofX, Y | A

likY | A, dofY | A

p( X ,Y∣A) p(Y∣A)

Modeling a Continuous Child

slide-55
SLIDE 55

55

Have: likX, Y | A, dofX, Y | A

likY | A, dofY | A

p( X ,Y∣A) p(Y∣A) likX | Y, A = likX, Y | A – likY | A dofX | Y, A = dofX, Y | A – dofY | A

Modeling a Continuous Child

slide-56
SLIDE 56

56

Have: likX, Y | A, dofX, Y | A

likY | A, dofY | A

p( X ,Y∣A) p(Y∣A) likX | Y, A = likX, Y | A – likY | A dofX | Y, A = dofX, Y | A – dofY | A

  • 2likX | Y,A + dofX | Y, A log n

Modeling a Continuous Child

slide-57
SLIDE 57

57

Let X, Y be continuous Let A be discrete Assume X, Y are parents of A X Y A

Modeling a Discrete Child

slide-58
SLIDE 58

58

p(A∣X ,Y )= p(X ,Y , A) p( X ,Y ) Let X, Y be continuous Let A be discrete Assume X, Y are parents of A X Y A

Modeling a Discrete Child

slide-59
SLIDE 59

59

p(A∣X ,Y )= p(X ,Y , A) p( X ,Y ) 1= p(X ,Y∣A) p(A) p(X ,Y ) Let X, Y be continuous Let A be discrete Assume X, Y are parents of A X Y A

Modeling a Discrete Child

slide-60
SLIDE 60

60

p(A∣X ,Y )= p(X ,Y , A) p( X ,Y ) 1= p(X ,Y∣A) p(A) p(X ,Y ) Let X, Y be continuous Let A be discrete Partitioned Gaussians Multinomial Assume X, Y are parents of A X Y A

Modeling a Discrete Child

slide-61
SLIDE 61

61

  • Want: likX, Y | A, dofX, Y | A

likA, dofA

likX, Y, dofX, Y

p( X ,Y∣A) p(A) p(X ,Y )

Modeling a Discrete Child

X Y A

slide-62
SLIDE 62

62

likX, Y | A, dofX, Y | A

slide-63
SLIDE 63

63

likX, Y | A, dofX, Y | A

slide-64
SLIDE 64

64

dof2 lik2 dof3 lik3 dof1 lik1 likX, Y | A, dofX, Y | A

slide-65
SLIDE 65

65

dof2 lik2 dof3 lik3 dof1 lik1 likX, Y | A = lik1 + lik2 + lik3 dofX, Y | A = dof1 + dof2 + dof3 likX, Y | A, dofX, Y | A

slide-66
SLIDE 66

66

likA, dofA likX, Y, dofX, Y

slide-67
SLIDE 67

67

dofX, Y likX, Y dofA likA likA, dofA likX, Y, dofX, Y

slide-68
SLIDE 68

68

Have: likX, Y | A, dofX, Y | A likA, dofA

likX, Y, dofX, Y

p( X ,Y∣A) p(A) p(X ,Y )

Modeling a Discrete Child

slide-69
SLIDE 69

69

Have: likX, Y | A, dofX, Y | A likA, dofA

likX, Y, dofX, Y

p( X ,Y∣A) p(A) p(X ,Y ) likA | X, Y = likX, Y | A + likA – likX, Y dofA | X, Y = dofX, Y | A + dofA – dofX, Y

Modeling a Discrete Child

slide-70
SLIDE 70

70

Have: likX, Y | A, dofX, Y | A likA, dofA

likX, Y, dofX, Y

p( X ,Y∣A) p(A) p(X ,Y ) likA | X, Y = likX, Y | A + likA – likX, Y dofA | X, Y = dofX, Y | A + dofA – dofX, Y

  • 2likA | X, Y + dofA | X, Y log n

Modeling a Discrete Child

slide-71
SLIDE 71

71

Outline

  • Bayesian Information Criterion (BIC)
  • Mixed Variable Polynomial (MVP) score
  • Conditional Gaussian (CG) score
  • Adaptations
  • Simulations and empirical results
slide-72
SLIDE 72

72

Adaptations

  • Binomial Structure Prior

– Treat the addition of each parent as an independent

random trial

– Model the prior probability of each parent-child model

using a Binomial distribution

  • Discretization Heuristic

– Discretize continuous parents of discrete children in

  • rder to use multinomial scoring
slide-73
SLIDE 73

73

Outline

  • Bayesian Information Criterion (BIC)
  • Mixed Variable Polynomial (MVP) score
  • Conditional Gaussian (CG) score
  • Adaptations
  • Simulations and empirical results
slide-74
SLIDE 74

74

Conditional Gaussian Simulation

  • Randomly generate a set of variables and edges
slide-75
SLIDE 75

75

Conditional Gaussian Simulation

  • Randomly generate a set of variables and edges
  • Specify a causal ordering over the variables
slide-76
SLIDE 76

76

Conditional Gaussian Simulation

  • Randomly generate a set of variables and edges
  • Specify a causal ordering over the variables
  • In causal order, simulate one variable at a time
slide-77
SLIDE 77

77

Conditional Gaussian Simulation

  • Randomly generate a set of variables and edges
  • Specify a causal ordering over the variables
  • In causal order, simulate one variable at a time

– Use multinomial relationships with discretized

continuous parents for discrete children

slide-78
SLIDE 78

78

Conditional Gaussian Simulation

  • Randomly generate a set of variables and edges
  • Specify a causal ordering over the variables
  • In causal order, simulate one variable at a time

– Use multinomial relationships with discretized

continuous parents for discrete children

– Use partitioned linear Gaussian relationships for

continuous children

slide-79
SLIDE 79

79

Conditional Gaussian Simulation

  • Randomly generate a set of variables and edges
  • Specify a causal ordering over the variables
  • In causal order, simulate one variable at a time

– Use multinomial relationships with discretized

continuous parents for discrete children

– Use partitioned linear Gaussian relationships for

continuous children Note: For all simulation we simulate discrete-continuous at a 50-50 split where discrete variables have a random number of categories between 2 and 5

slide-80
SLIDE 80

80

Non-linear Simulation

  • Randomly generate a set of variables and edges
  • Specify a causal ordering over the variables
  • In causal order, simulate one variable at a time

– Use multinomial relationships with discretized

continuous parents for discrete children

– Use partitioned polynomial regression with Gaussian

noise for continuous children Note: For all simulation we simulate discrete-continuous at a 50-50 split where discrete variables have a random number of categories between 2 and 5

slide-81
SLIDE 81

81

Algorithms

CG – Conditional Gaussian CG d – Conditional Gaussian w/ Discretization Heuristic MVP 1 – Mixed Variable Polynomial w/ linear basis MVP log n – Mixed Variable Polynomial w/ polynomial basis LR 1 – Logistic Regression w/ linear basis LR log n – Logistic Regression w/ polynomial basis

slide-82
SLIDE 82

82

Statistics

AP – Adjacency Precision correctly predicted adjacent / predicted adjacent AR – Adjacency Recall correctly predicted adjacent / true adjacent AHP – Arrowhead Precision correctly predicted arrowhead / predicted arrowhead AHR – Arrowhead Recall correctly predicted arrowhead / true arrowhead T (s) – Computation time (in seconds)

All statistics are averaged over 10 runs on networks of 1000 instances As a search fGES was used (Ramsey 2017) (Chickering 2002)

slide-83
SLIDE 83

83

MVP vs LR

Avg Deg 4 | 100 Measured | Linear Simulation

slide-84
SLIDE 84

84

MVP vs LR

Avg Deg 4 | 100 Measured | Linear Simulation

slide-85
SLIDE 85

85

MVP vs LR

Avg Deg 4 | 100 Measured | Non-Linear Simulation

slide-86
SLIDE 86

86

MVP vs LR

Avg Deg 4 | 100 Measured | Non-Linear Simulation

slide-87
SLIDE 87

87

MVP vs CG

Avg Deg 4 | 100 Measured | Linear Simulation

slide-88
SLIDE 88

88

MVP vs CG

Avg Deg 4 | 100 Measured | Non-Linear Simulation

slide-89
SLIDE 89

89

Scalability

Avg Deg 2 | 500 Measured | Linear Simulation

slide-90
SLIDE 90

90

Scalability

Avg Deg 2 | 500 Measured | Linear Simulation

slide-91
SLIDE 91

91

Conclusions

  • We present two novel scoring methods for learning BNs in the

presence of both continuous and discrete variables

– Mixed Variable Polynomial (MVP)

Similar performance to LR but 10-20 times faster Allows for a more general class of relationship

– Conditional Gaussian (CG)

Quick and effective

slide-92
SLIDE 92

92

Conclusions

  • We present two novel scoring methods for learning BNs in the

presence of both continuous and discrete variables

– Mixed Variable Polynomial (MVP)

Similar performance to LR but 10-20 times faster Allows for a more general class of relationship

– Conditional Gaussian (CG)

Quick and effective

  • Both scores perform well on simulated data (linear and non-

linear) and scale to networks of at least 500 variables

slide-93
SLIDE 93

93

Thank You

All presented methods are available

  • n Tetrad

https://github.com/cmu-phil/tetrad