Accurate parameter estimation for Bayesian network classifiers - - PowerPoint PPT Presentation

accurate parameter estimation for bayesian network
SMART_READER_LITE
LIVE PREVIEW

Accurate parameter estimation for Bayesian network classifiers - - PowerPoint PPT Presentation

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes Fran cois Petitjean, Wray Buntine , Geoff Webb and Nayyar Zaidi Monash University 2018-09-13 1 / 35 Outline Motivation Bayesian


slide-1
SLIDE 1

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes

Fran¸ cois Petitjean, Wray Buntine, Geoff Webb and Nayyar Zaidi Monash University 2018-09-13

1 / 35

slide-2
SLIDE 2

Outline

Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion

2 / 35

slide-3
SLIDE 3

A Cultural Divide

Context: When discussing teaching Data Science with a well known professor of Statistics. She said: “when first teaching overfitting, I always give some examples where machine learning has trouble, like decision trees” I said: “funny, I do the reverse, I always give examples where statistical models have trouble”

2 / 35

slide-4
SLIDE 4

A Cultural Divide

Context: When discussing teaching Data Science with a well known professor of Statistics. She said: “when first teaching overfitting, I always give some examples where machine learning has trouble, like decision trees” I said: “funny, I do the reverse, I always give examples where statistical models have trouble” ASIDE: our hierarchical smoothing also gives state of the art results for decision tree smoothing

2 / 35

slide-5
SLIDE 5

State of the Art in Classification

Favoured techniques for standard classification are Random Forest and Gradient Boosting (of trees).

3 / 35

slide-6
SLIDE 6

State of the Art in Classification

Favoured techniques for standard classification are Random Forest and Gradient Boosting (of trees).

  • NB. for sequences, images or graphs, deep neural networks (recurrent

NN, convolutional NN, etc.) are better

3 / 35

slide-7
SLIDE 7

Main Claim

Main Claim: Hierarchical smoothing applied to Bayesian network classifiers on categorical data beats Random Forest

1not well shown in the paper ... 4 / 35

slide-8
SLIDE 8

Main Claim

Main Claim: Hierarchical smoothing applied to Bayesian network classifiers on categorical data beats Random Forest

◮ a single model beats state of the art ensemble

◮ is also comparable with XGBoost1

◮ but only on categorical data

◮ though also for a lot of other data too1

1not well shown in the paper ... 4 / 35

slide-9
SLIDE 9

Unpacking the Main Claim

◮ Hierarchical smoothing

◮ using hierarchical Dirichlet models

◮ applied to Bayesian network classifiers

◮ the KDB and SKDB family

◮ on categorical datasets

◮ or pre-discretised attributes

◮ beats Random Forest

5 / 35

slide-10
SLIDE 10

Outline

Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion

6 / 35

slide-11
SLIDE 11

Reminder: Main Claim

◮ Hierarchical smoothing ◮ applied to Bayesian network classifiers

◮ the KDB and SKDB family

◮ on categorical datasets ◮ beats Random Forest

6 / 35

slide-12
SLIDE 12

Learning Bayesian Networks

tutorial by Cussens, Malone and Yuan, IJCAI 2013

Bayesian Networks learning = Structure learning + Conditional Probability Table estimation

7 / 35

slide-13
SLIDE 13

Bayesian Network Classifiers

Friedman, Geiger, Goldszmidt, Machine Learning 1997

◮ Defined by parent relation π and Conditional Probability Tables (CPTs)

◮ π encodes conditional independence / structure ◮ πi is the parent variables for Xi ◮ CPTs encode conditional probabilities

◮ For classification, make class variable Y a parent of all Xi

8 / 35

slide-14
SLIDE 14

Bayesian Network Classifiers

Friedman, Geiger, Goldszmidt, Machine Learning 1997

◮ Defined by parent relation π and Conditional Probability Tables (CPTs)

◮ π encodes conditional independence / structure ◮ πi is the parent variables for Xi ◮ CPTs encode conditional probabilities

◮ For classification, make class variable Y a parent of all Xi ◮ Classifies using P(y | x) ∝ P(y | πY ) P(xi | πi)

8 / 35

slide-15
SLIDE 15

Bayesian Network Classifiers

Friedman, Geiger, Goldszmidt, Machine Learning 1997

◮ Defined by parent relation π and Conditional Probability Tables (CPTs)

◮ π encodes conditional independence / structure ◮ πi is the parent variables for Xi ◮ CPTs encode conditional probabilities

◮ For classification, make class variable Y a parent of all Xi ◮ Classifies using P(y | x) ∝ P(y | πY ) P(xi | πi) Na¨ ıve Bayes classifier: πi = {Y }

X2 X4 X1 X3 Y Decreasing mutual information with Y

8 / 35

slide-16
SLIDE 16

k-Dependence Bayes (KDB)

Sahami, KDD 1996

KDB-1 classifier:

(attributes have 1 extra parent) X2 X4 X1 X3 Y Decreasing mutual information with Y

KDB-2 classifier:

(attributes have 2 extra parents) X2 X4 X1 X3 Y

  • NB. other parents also selected by mutual information

9 / 35

slide-17
SLIDE 17

Learning k-Dependence Bayes (KDB)

◮ Two pass learning ◮ 1st pass, learn structure π:

◮ Uses variable ordering heuristics based on mutual information, so efficient and scalable.

10 / 35

slide-18
SLIDE 18

Learning k-Dependence Bayes (KDB)

◮ Two pass learning ◮ 1st pass, learn structure π:

◮ Uses variable ordering heuristics based on mutual information, so efficient and scalable.

◮ 2nd pass, learn CPTs:

◮ Collect statistics according to the structure learned. ◮ Form CPTs using Laplace smoothers, or m-estimation. ◮ With simple CPTs is exponential family so inherently scalable.

10 / 35

slide-19
SLIDE 19

Selective k-Dependence Bayes (SKDB)

Martnez, Webb, Chen and Zaidi, JMLR 2016

But, how do we pick k in KDB, and how do we select which attributes to use?

11 / 35

slide-20
SLIDE 20

Selective k-Dependence Bayes (SKDB)

Martnez, Webb, Chen and Zaidi, JMLR 2016

But, how do we pick k in KDB, and how do we select which attributes to use?

◮ Use Leave-one-out cross validation (LOOCV) on MSE to select both k and which attributes to use. ◮ Requires a third pass through the data to compute LOOCV MSE estimates of probability and minimise. ◮ As efficient as previous passes. ◮ Called SKDB.

11 / 35

slide-21
SLIDE 21

Learning Curves: Typical Comparison

12 / 35

slide-22
SLIDE 22

Outline

Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion

13 / 35

slide-23
SLIDE 23

Reminder: Main Claim

◮ Hierarchical smoothing

◮ using hierarchical Dirichlet models

◮ applied to Bayesian network classifiers ◮ on categorical datasets ◮ beats Random Forest

13 / 35

slide-24
SLIDE 24

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

14 / 35

slide-25
SLIDE 25

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

p(disease|has-gene & male)?

14 / 35

slide-26
SLIDE 26

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

pMLE = 0%

14 / 35

slide-27
SLIDE 27

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

pLaplace = 33%

14 / 35

slide-28
SLIDE 28

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

pm-estimate = 25%

14 / 35

slide-29
SLIDE 29

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

pm-estimate = 25%

None of them use the fact that 91% of the patients with that gene have the disease!

14 / 35

slide-30
SLIDE 30

Why doing Hierarchical Smoothing?

◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females

#patients with disease #patients without disease

100–901 10–1 90–900 10–0 0–1

has gene doesn’t have gene female male

pm-estimate = 25%

None of them use the fact that 91% of the patients with that gene have the disease!

14 / 35

The idea of hierarchical smoothing/estimation is to make each node a function of the data at the node and the estimate at the parent. p(disease|has gene & male) ∼ p(disease|has gene) p(disease|has gene) ∼ p(disease)

slide-31
SLIDE 31

Hierarchical Smoothing

Hierarchical Smoothing: When smoothing parameters in the context of a tree, use parent or ancestor parameters estimates in the smoothing.

15 / 35

slide-32
SLIDE 32

Hierarchical Smoothing

◮ You add prior parameters φ representing prior probability vectors for all ancestor nodes. φdisease φdisease|has-gene φdisease|¬has-gene θdisease|has-gene,female θdisease|has-gene,male

has gene doesn’t have gene female male

16 / 35

slide-33
SLIDE 33

Hierarchical Smoothing

◮ You add prior parameters φ representing prior probability vectors for all ancestor nodes. φdisease φdisease|has-gene φdisease|¬has-gene θdisease|has-gene,female θdisease|has-gene,male

has gene doesn’t have gene female male

16 / 35

the leaf variables θ are models parameters for the leaf probabilities ◮ our task is to estimate these

slide-34
SLIDE 34

Hierarchical Smoothing

◮ You add prior parameters φ representing prior probability vectors for all ancestor nodes. φdisease φdisease|has-gene φdisease|¬has-gene θdisease|has-gene,female θdisease|has-gene,male

has gene doesn’t have gene female male

16 / 35

the ancestor variables φ are prior parameters used in estimating the leaf probabilities ◮ these are beliefs not frequencies ◮ they do not correspond to frequencies at the ancestor nodes

slide-35
SLIDE 35

Hierarchical Smoothing Model

Use Dirichlet distributions hierarchically. ◮ use Dir (θ, α) to represent a Dirichlet with parameter αθ ◮ normalised probability vector θ ◮ concentration (inverse variance) α

17 / 35

slide-36
SLIDE 36

Hierarchical Smoothing Model

Use Dirichlet distributions hierarchically. ◮ use Dir (θ, α) to represent a Dirichlet with parameter αθ ◮ normalised probability vector θ ◮ concentration (inverse variance) α Use the pattern: θ(node) | φ(node) ∼ Dir (φ(parent), α(node))

17 / 35

slide-37
SLIDE 37

Hierarchical Smoothing Model, cont.

Leaf probabilities: θXc|y,x1,··· ,xn ∼ Dir

  • φXc|y,x1,··· ,xn−1, αy,x1,··· ,xn
  • 18 / 35
slide-38
SLIDE 38

Hierarchical Smoothing Model, cont.

Leaf probabilities: θXc|y,x1,··· ,xn ∼ Dir

  • φXc|y,x1,··· ,xn−1, αy,x1,··· ,xn
  • Prior probabilities:

φXc ∼ Dir 1 |Xc|

  • 1, α0
  • φXc|y

∼ Dir (φXc, αy)

...

φXc|y,x1,··· ,xn−1 ∼ Dir

  • φXc|y,x1,··· ,xn−2, αy,x1,··· ,xn−1
  • 18 / 35
slide-39
SLIDE 39

Smoothing Formula

Smoothed probability estimates work back down the tree from the root using the pattern: p(node) ∝ count(node) + p(parent)×α(node)

19 / 35

slide-40
SLIDE 40

Smoothing Formula

Smoothed probability estimates work back down the tree from the root using the pattern: p(node) ∝ count(node) + p(parent)×α(node) Yielding: ˆ φxc = nxc +

1 |Xc|α0

n· + α0 ˆ φxc|y,x1,··· ,xi = nxc|y,x1,··· ,xi + ˆ φxc|y,x1,··· ,xi−1αy,x1,··· ,xi n·|y,x1,··· ,xi + αy,x1,··· ,xi ˆ θxc|y,x1,··· ,xn = nxc|y,x1,··· ,xn + ˆ φxc|y,x1,··· ,xn−1αy,x1,··· ,xn n·|y,x1,··· ,xn + αy,x1,··· ,xn

19 / 35

slide-41
SLIDE 41

Smoothing Formula

Smoothed probability estimates work back down the tree from the root using the pattern: p(node) ∝ count(node) + p(parent)×α(node) Yielding: ˆ φxc = nxc +

1 |Xc|α0

n· + α0 ˆ φxc|y,x1,··· ,xi = nxc|y,x1,··· ,xi + ˆ φxc|y,x1,··· ,xi−1αy,x1,··· ,xi n·|y,x1,··· ,xi + αy,x1,··· ,xi ˆ θxc|y,x1,··· ,xn = nxc|y,x1,··· ,xn + ˆ φxc|y,x1,··· ,xn−1αy,x1,··· ,xn n·|y,x1,··· ,xn + αy,x1,··· ,xn

But how do we get the estimates ˆ φxc|y,x1,··· ,xi ?

19 / 35

slide-42
SLIDE 42

Hierarchical Dirichlet

The Dirichlet distribution corresponds to a Dirichlet process with a discrete base distribution.

20 / 35

slide-43
SLIDE 43

Hierarchical Dirichlet

The Dirichlet distribution corresponds to a Dirichlet process with a discrete base distribution. We use a hierarchical Dirichlet processes (HDP) to handle the hierarchical Dirichlet distributions.

20 / 35

slide-44
SLIDE 44

Historical Context for HDP

1990s-2003: Pitman and Ishwaran and James in mathematical statistics develop theory. 2006: Teh, Jordan, Beal and Blei develop HDP, e.g. applied to LDA. 2006-2011: Chinese restaurant processes (CRPs) go wild! ◮ require dynamic memory in implementation, e.g. Chinese restaurant franchise, stick-breaking, etc. But: very slow, require large amounts of dynamic memory.

21 / 35

slide-45
SLIDE 45

Historical Context for HDP

1990s-2003: Pitman and Ishwaran and James in mathematical statistics develop theory. 2006: Teh, Jordan, Beal and Blei develop HDP, e.g. applied to LDA. 2006-2011: Chinese restaurant processes (CRPs) go wild! ◮ require dynamic memory in implementation, e.g. Chinese restaurant franchise, stick-breaking, etc. But: very slow, require large amounts of dynamic memory.

popularity of HDPs has decreased!

21 / 35

slide-46
SLIDE 46

Historical Context for HDP, cont.

2011: Chen, Du, Buntine show slow methods not needed by introducing collapsed samplers. 2011: Buntine (unpublished) develops high performance algorithm for HDP and n-grams. 2014: Buntine and Mishra develop high performance algorithm for HDP and topic models.

22 / 35

slide-47
SLIDE 47

Historical Context for HDP, cont.

2011: Chen, Du, Buntine show slow methods not needed by introducing collapsed samplers. 2011: Buntine (unpublished) develops high performance algorithm for HDP and n-grams. 2014: Buntine and Mishra develop high performance algorithm for HDP and topic models. ◮ We use high performance techniques for the hierarchical Dirichlet process (HDP) to do inference.

◮ outperforms Stochastic Variational Inference on some tasks

◮ This uses a (fairly) efficient Gibbs sampler.

◮ no dynamic memory ◮ with variable augmentation and caching

◮ Details in the paper.

22 / 35

slide-48
SLIDE 48

Outline

Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion

23 / 35

slide-49
SLIDE 49

Main Claim

◮ Hierarchical smoothing ◮ applied to Bayesian network classifiers ◮ on categorical datasets

◮ or pre-discretised attributes

◮ beats Random Forest

23 / 35

slide-50
SLIDE 50

UCI Datasets

and lots more datasets ... (not shown in the figure)

24 / 35

slide-51
SLIDE 51

UCI Datasets Preprocessing

◮ Convert into ARFF format and process on WEKA. ◮ Apply the MDL discretization method of Fayyad and Irani. ◮ Also did one experiment with the very large Splice dataset.

25 / 35

slide-52
SLIDE 52

Experimental Setup

◮ Use 5 runs of 2-fold cross validation.

◮ known to be more stable than 10-fold cross validation

26 / 35

slide-53
SLIDE 53

Experimental Setup

◮ Use 5 runs of 2-fold cross validation.

◮ known to be more stable than 10-fold cross validation

◮ Evaluate with RMSE and 0-1 loss.

◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities

26 / 35

slide-54
SLIDE 54

Experimental Setup

◮ Use 5 runs of 2-fold cross validation.

◮ known to be more stable than 10-fold cross validation

◮ Evaluate with RMSE and 0-1 loss.

◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities

◮ Test the KDB versions and SKDB (max k=5) for HDP versus m-estimation with a back-off (for zero counts).

◮ with m-estimation, we estimate m from {0, 0.05, 0.2, 1, 5, 20} using cross validation on non-test subset

26 / 35

slide-55
SLIDE 55

Experimental Setup

◮ Use 5 runs of 2-fold cross validation.

◮ known to be more stable than 10-fold cross validation

◮ Evaluate with RMSE and 0-1 loss.

◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities

◮ Test the KDB versions and SKDB (max k=5) for HDP versus m-estimation with a back-off (for zero counts).

◮ with m-estimation, we estimate m from {0, 0.05, 0.2, 1, 5, 20} using cross validation on non-test subset

◮ Also test against Random Forest with 100 trees.

26 / 35

slide-56
SLIDE 56

Experimental Setup

◮ Use 5 runs of 2-fold cross validation.

◮ known to be more stable than 10-fold cross validation

◮ Evaluate with RMSE and 0-1 loss.

◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities

◮ Test the KDB versions and SKDB (max k=5) for HDP versus m-estimation with a back-off (for zero counts).

◮ with m-estimation, we estimate m from {0, 0.05, 0.2, 1, 5, 20} using cross validation on non-test subset

◮ Also test against Random Forest with 100 trees. ◮ Did one experiment with the very large Splice dataset.

26 / 35

slide-57
SLIDE 57

Outline

Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion

27 / 35

slide-58
SLIDE 58

Main Claim

◮ Hierarchical smoothing ◮ applied to Bayesian network classifiers ◮ on categorical datasets ◮ beats Random Forest

27 / 35

slide-59
SLIDE 59

KDBs for HDP versus m-estimation

∗ bold W-D-L values are significant at 5% by two-tailed binomial sign test

28 / 35

slide-60
SLIDE 60

RMSE for KDB-5 for HDP versus m-estimation

29 / 35

slide-61
SLIDE 61

Comparison of TAN, SKDB and RF100

∗ bold W-D-L values are significant at 5% by two-tailed binomial sign test

30 / 35

slide-62
SLIDE 62

0-1 Loss for SKDB-HDP versus RF100

31 / 35

slide-63
SLIDE 63

SKDB versus Gradient Boosting

◮ Splice data: 50 million plus training data ◮ imbalanced: 1% positive class ◮ RF could not run with WEKA (out of memory) ◮ using XGBoost v0.6, 1 hour computation ◮ SKDB-HDP, 4 hour computation

32 / 35

slide-64
SLIDE 64

Outline

Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion

33 / 35

slide-65
SLIDE 65

Software for HDP Hierarchical Smoothing

Download, compile and run

git clone https://github.com/fpetitjean/HDP #download cd HDP ant #compile java -jar jar/HDP.jar #run example

Example with your data

String [][]data = { // (stroke,weight,height) {"yes","heavy","tall"}, ... {"yes","heavy","med"} }; ProbabilityTree hdp = new ProbabilityTree(); // init. hdp.addDataset(data); //learn HDP tree - p(stroke|weight,height) hdp.query("heavy","short"); //returns [61%, 39%] hdp.query("heavy","tall"); //returns [31%, 69%] hdp.query("light","tall"); //returns [9%, 91%]

33 / 35

slide-66
SLIDE 66

Conclusion

  • 1. Hierarchical smoothing using HDP theory and algorithm

presented.

◮ HDP smoothing code on Github in Java

34 / 35

slide-67
SLIDE 67

Conclusion

  • 1. Hierarchical smoothing using HDP theory and algorithm

presented.

◮ HDP smoothing code on Github in Java

  • 2. Combined HDP smoother with SKDB learner for BNCs to

produce fast(-ish), scalable classification algorithm beating RFs.

34 / 35

slide-68
SLIDE 68

Conclusion

  • 1. Hierarchical smoothing using HDP theory and algorithm

presented.

◮ HDP smoothing code on Github in Java

  • 2. Combined HDP smoother with SKDB learner for BNCs to

produce fast(-ish), scalable classification algorithm beating RFs.

  • 3. He ‘Penny’ Zhang (Monash PhD student) has significant

improvements to the method.

◮ sped up algorithm and beating Gradient Boosting of trees

34 / 35

slide-69
SLIDE 69

Questions?

35 / 35