Accurate parameter estimation for Bayesian network classifiers - PowerPoint PPT Presentation

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes Fran¸ cois Petitjean, Wray Buntine , Geoff Webb and Nayyar Zaidi Monash University 2018-09-13 1 / 35

Outline Motivation Bayesian Network Classifiers Hierarchical Smoothing Experimental Setup Results Conclusion 2 / 35

A Cultural Divide Context: When discussing teaching Data Science with a well known professor of Statistics. She said: “when first teaching overfitting, I always give some examples where machine learning has trouble, like decision trees” I said: “funny, I do the reverse, I always give examples where statistical models have trouble” 2 / 35

A Cultural Divide Context: When discussing teaching Data Science with a well known professor of Statistics. She said: “when first teaching overfitting, I always give some examples where machine learning has trouble, like decision trees” I said: “funny, I do the reverse, I always give examples where statistical models have trouble” ASIDE: our hierarchical smoothing also gives state of the art results for decision tree smoothing 2 / 35

State of the Art in Classification Favoured techniques for standard classification are Random Forest and Gradient Boosting (of trees). 3 / 35

State of the Art in Classification Favoured techniques for standard classification are Random Forest and Gradient Boosting (of trees). NB. for sequences, images or graphs, deep neural networks (recurrent NN, convolutional NN, etc.) are better 3 / 35

Main Claim Main Claim: Hierarchical smoothing applied to Bayesian network classifiers on categorical data beats Random Forest 1 not well shown in the paper ... 4 / 35

Main Claim Main Claim: Hierarchical smoothing applied to Bayesian network classifiers on categorical data beats Random Forest ◮ a single model beats state of the art ensemble ◮ is also comparable with XGBoost 1 ◮ but only on categorical data ◮ though also for a lot of other data too 1 1 not well shown in the paper ... 4 / 35

Unpacking the Main Claim ◮ Hierarchical smoothing ◮ using hierarchical Dirichlet models ◮ applied to Bayesian network classifiers ◮ the KDB and SKDB family ◮ on categorical datasets ◮ or pre-discretised attributes ◮ beats Random Forest 5 / 35

Reminder: Main Claim ◮ Hierarchical smoothing ◮ applied to Bayesian network classifiers ◮ the KDB and SKDB family ◮ on categorical datasets ◮ beats Random Forest 6 / 35

Learning Bayesian Networks tutorial by Cussens, Malone and Yuan, IJCAI 2013 Bayesian Networks learning = Structure learning + Conditional Probability Table estimation 7 / 35

Bayesian Network Classifiers Friedman, Geiger, Goldszmidt, Machine Learning 1997 ◮ Defined by parent relation π and Conditional Probability Tables (CPTs) ◮ π encodes conditional independence / structure ◮ π i is the parent variables for X i ◮ CPTs encode conditional probabilities ◮ For classification, make class variable Y a parent of all X i 8 / 35

Bayesian Network Classifiers Friedman, Geiger, Goldszmidt, Machine Learning 1997 ◮ Defined by parent relation π and Conditional Probability Tables (CPTs) ◮ π encodes conditional independence / structure ◮ π i is the parent variables for X i ◮ CPTs encode conditional probabilities ◮ For classification, make class variable Y a parent of all X i ◮ Classifies using P ( y | x ) ∝ P ( y | π Y ) � P ( x i | π i ) 8 / 35

Bayesian Network Classifiers Friedman, Geiger, Goldszmidt, Machine Learning 1997 ◮ Defined by parent relation π and Conditional Probability Tables (CPTs) ◮ π encodes conditional independence / structure ◮ π i is the parent variables for X i ◮ CPTs encode conditional probabilities ◮ For classification, make class variable Y a parent of all X i ◮ Classifies using P ( y | x ) ∝ P ( y | π Y ) � P ( x i | π i ) Y Na¨ ıve Bayes classifier: π i = { Y } X 2 X 4 X 1 X 3 Decreasing mutual information with Y 8 / 35

k-Dependence Bayes (KDB) Sahami, KDD 1996 Y KDB-1 classifier: (attributes have 1 extra parent) X 2 X 4 X 1 X 3 Decreasing mutual information with Y Y KDB-2 classifier: (attributes have 2 extra parents) X 2 X 4 X 1 X 3 NB. other parents also selected by mutual information 9 / 35

Learning k-Dependence Bayes (KDB) ◮ Two pass learning ◮ 1st pass, learn structure π : ◮ Uses variable ordering heuristics based on mutual information, so efficient and scalable. 10 / 35

Learning k-Dependence Bayes (KDB) ◮ Two pass learning ◮ 1st pass, learn structure π : ◮ Uses variable ordering heuristics based on mutual information, so efficient and scalable. ◮ 2nd pass, learn CPTs: ◮ Collect statistics according to the structure learned. ◮ Form CPTs using Laplace smoothers, or m-estimation. ◮ With simple CPTs is exponential family so inherently scalable. 10 / 35

Selective k-Dependence Bayes (SKDB) Martnez, Webb, Chen and Zaidi, JMLR 2016 But, how do we pick k in KDB, and how do we select which attributes to use? 11 / 35

Selective k-Dependence Bayes (SKDB) Martnez, Webb, Chen and Zaidi, JMLR 2016 But, how do we pick k in KDB, and how do we select which attributes to use? ◮ Use Leave-one-out cross validation (LOOCV) on MSE to select both k and which attributes to use. ◮ Requires a third pass through the data to compute LOOCV MSE estimates of probability and minimise. ◮ As efficient as previous passes. ◮ Called SKDB. 11 / 35

Learning Curves: Typical Comparison 12 / 35

Reminder: Main Claim ◮ Hierarchical smoothing ◮ using hierarchical Dirichlet models ◮ applied to Bayesian network classifiers ◮ on categorical datasets ◮ beats Random Forest 13 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene 10–1 90–900 female male 10–0 0–1 14 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene 10–1 90–900 female male 10–0 0–1 p ( disease | has - gene & male )? 14 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene 10–1 90–900 female male 10–0 0–1 p MLE = 0% 14 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene 10–1 90–900 female male 10–0 0–1 p Laplace = 33% 14 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene 10–1 90–900 female male 10–0 0–1 p m-estimate = 25% 14 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene 10–1 90–900 female male 10–0 0–1 p m-estimate = 25% None of them use the fact that 91% of the patients with that gene have the disease! 14 / 35

Why doing Hierarchical Smoothing? ◮ You want to predict disease as a function of some rare gene G and sex, knowing that this disease is more prevalent for females #patients with disease 100–901 #patients without disease has gene doesn’t have gene The idea of hierarchical smoothing/estimation is to make 10–1 90–900 each node a function of the data at the node and the estimate at the parent. female male p ( disease | has gene & male ) ∼ p ( disease | has gene) 10–0 0–1 p ( disease | has gene) ∼ p ( disease ) p m-estimate = 25% None of them use the fact that 91% of the patients with that gene have the disease! 14 / 35

Hierarchical Smoothing Hierarchical Smoothing: When smoothing parameters in the context of a tree, use parent or ancestor parameters estimates in the smoothing. 15 / 35

Hierarchical Smoothing ◮ You add prior parameters φ representing prior probability vectors for all ancestor nodes. φ disease has gene doesn’t have gene φ disease | has - gene φ disease |¬ has - gene female male θ disease | has - gene , female θ disease | has - gene , male 16 / 35

Accurate parameter estimation for Bayesian network classifiers - PowerPoint PPT Presentation

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes Fran cois Petitjean, Wray Buntine , Geoff Webb and Nayyar Zaidi Monash University 2018-09-13 1 / 35 Outline Motivation Bayesian

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Bayesian Methods for Parameter Estimation Bayesian vs Frequentist Inference Frequentist Chris

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

CCT396, Fall 2011 Database Design and Implementation Yuri Takhteyev University of Toronto This

Welcome to INF1343! Database Modeling and Database Design Yuri Takhteyev University of Toronto

Combinatory Categorial Grammar The effort to develop natural language grammars and

Carsten Ziegeler

GraphWalker: An I/O-Efficient and Resource-Friendly Graph Analytic System for Fast and Scalable

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

The Protg Plugin Architecture Timothy Redmond Tania Tudorache, Jennifer Vendetti Overview

WELCOME TO MENS LIFE 2019-2020 Carl Hofmann Teaching Leader Matthew 25:1-13 1 At that