Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes
Fran¸ cois Petitjean, Wray Buntine, Geoff Webb and Nayyar Zaidi Monash University 2018-09-13
1 / 35
Accurate parameter estimation for Bayesian network classifiers - - PowerPoint PPT Presentation
Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes Fran cois Petitjean, Wray Buntine , Geoff Webb and Nayyar Zaidi Monash University 2018-09-13 1 / 35 Outline Motivation Bayesian
1 / 35
2 / 35
2 / 35
2 / 35
3 / 35
NN, convolutional NN, etc.) are better
3 / 35
1not well shown in the paper ... 4 / 35
◮ is also comparable with XGBoost1
◮ though also for a lot of other data too1
1not well shown in the paper ... 4 / 35
◮ using hierarchical Dirichlet models
◮ the KDB and SKDB family
◮ or pre-discretised attributes
5 / 35
6 / 35
◮ the KDB and SKDB family
6 / 35
tutorial by Cussens, Malone and Yuan, IJCAI 2013
7 / 35
Friedman, Geiger, Goldszmidt, Machine Learning 1997
◮ π encodes conditional independence / structure ◮ πi is the parent variables for Xi ◮ CPTs encode conditional probabilities
8 / 35
Friedman, Geiger, Goldszmidt, Machine Learning 1997
◮ π encodes conditional independence / structure ◮ πi is the parent variables for Xi ◮ CPTs encode conditional probabilities
8 / 35
Friedman, Geiger, Goldszmidt, Machine Learning 1997
◮ π encodes conditional independence / structure ◮ πi is the parent variables for Xi ◮ CPTs encode conditional probabilities
X2 X4 X1 X3 Y Decreasing mutual information with Y
8 / 35
Sahami, KDD 1996
(attributes have 1 extra parent) X2 X4 X1 X3 Y Decreasing mutual information with Y
(attributes have 2 extra parents) X2 X4 X1 X3 Y
9 / 35
◮ Uses variable ordering heuristics based on mutual information, so efficient and scalable.
10 / 35
◮ Uses variable ordering heuristics based on mutual information, so efficient and scalable.
◮ Collect statistics according to the structure learned. ◮ Form CPTs using Laplace smoothers, or m-estimation. ◮ With simple CPTs is exponential family so inherently scalable.
10 / 35
Martnez, Webb, Chen and Zaidi, JMLR 2016
11 / 35
Martnez, Webb, Chen and Zaidi, JMLR 2016
11 / 35
12 / 35
13 / 35
◮ using hierarchical Dirichlet models
13 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
14 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
14 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
14 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
14 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
14 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
None of them use the fact that 91% of the patients with that gene have the disease!
14 / 35
#patients with disease #patients without disease
has gene doesn’t have gene female male
None of them use the fact that 91% of the patients with that gene have the disease!
14 / 35
The idea of hierarchical smoothing/estimation is to make each node a function of the data at the node and the estimate at the parent. p(disease|has gene & male) ∼ p(disease|has gene) p(disease|has gene) ∼ p(disease)
15 / 35
has gene doesn’t have gene female male
16 / 35
has gene doesn’t have gene female male
16 / 35
the leaf variables θ are models parameters for the leaf probabilities ◮ our task is to estimate these
has gene doesn’t have gene female male
16 / 35
the ancestor variables φ are prior parameters used in estimating the leaf probabilities ◮ these are beliefs not frequencies ◮ they do not correspond to frequencies at the ancestor nodes
17 / 35
17 / 35
19 / 35
1 |Xc|α0
19 / 35
1 |Xc|α0
19 / 35
20 / 35
20 / 35
21 / 35
21 / 35
22 / 35
◮ outperforms Stochastic Variational Inference on some tasks
◮ no dynamic memory ◮ with variable augmentation and caching
22 / 35
23 / 35
◮ or pre-discretised attributes
23 / 35
24 / 35
25 / 35
◮ known to be more stable than 10-fold cross validation
26 / 35
◮ known to be more stable than 10-fold cross validation
◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities
26 / 35
◮ known to be more stable than 10-fold cross validation
◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities
◮ with m-estimation, we estimate m from {0, 0.05, 0.2, 1, 5, 20} using cross validation on non-test subset
26 / 35
◮ known to be more stable than 10-fold cross validation
◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities
◮ with m-estimation, we estimate m from {0, 0.05, 0.2, 1, 5, 20} using cross validation on non-test subset
26 / 35
◮ known to be more stable than 10-fold cross validation
◮ in classification context, MSE is related to the Brier score and is a proper scoring function, so evaluates the probabilities
◮ with m-estimation, we estimate m from {0, 0.05, 0.2, 1, 5, 20} using cross validation on non-test subset
26 / 35
27 / 35
27 / 35
∗ bold W-D-L values are significant at 5% by two-tailed binomial sign test
28 / 35
29 / 35
∗ bold W-D-L values are significant at 5% by two-tailed binomial sign test
30 / 35
31 / 35
32 / 35
33 / 35
git clone https://github.com/fpetitjean/HDP #download cd HDP ant #compile java -jar jar/HDP.jar #run example
String [][]data = { // (stroke,weight,height) {"yes","heavy","tall"}, ... {"yes","heavy","med"} }; ProbabilityTree hdp = new ProbabilityTree(); // init. hdp.addDataset(data); //learn HDP tree - p(stroke|weight,height) hdp.query("heavy","short"); //returns [61%, 39%] hdp.query("heavy","tall"); //returns [31%, 69%] hdp.query("light","tall"); //returns [9%, 91%]
33 / 35
◮ HDP smoothing code on Github in Java
34 / 35
◮ HDP smoothing code on Github in Java
34 / 35
◮ HDP smoothing code on Github in Java
◮ sped up algorithm and beating Gradient Boosting of trees
34 / 35
35 / 35