New Prediction Methods for Tree Ensembles with Applications in Record Linkage
Samuel L. Ventura
Rebecca Nugent
Department of Statistics Carnegie Mellon University
June 11, 2015 45th Symposium on the Interface Computing Science and Statistics
1 / 26
New Prediction Methods for Tree Ensembles with Applications in - - PowerPoint PPT Presentation
New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura Rebecca Nugent Department of Statistics Carnegie Mellon University June 11, 2015 45th Symposium on the Interface Computing Science and Statistics
1 / 26
2 / 26
3 / 26
4 / 26
5 / 26
6 / 26
◮ Reducing the comparison space more efficiently ◮ Preserving false negative error rates
◮ Quantify the similarity of record-pairs ◮ Classifier ensembles when training data is prohibitively large ◮ Improving predictions for classifier ensembles with
◮ Convert estimated probabilities to dissimilarities ◮ Hierarchical clustering to link groups of records 7 / 26
8 / 26
9 / 26
10 / 26
11 / 26
12 / 26
13 / 26
◮ Reducing the comparison space more efficiently ◮ Preserving false negative error rates
◮ Quantify the similarity of record-pairs ◮ Classifier ensembles when training data is prohibitively large ◮ Improving predictions for classifier ensembles with
◮ Convert estimated probabilities to dissimilarities ◮ Hierarchical clustering to link groups of records 14 / 26
10000 20000 30000 40000 50000 0.35 0.40 0.45 0.50
Misclassification Rate vs. Size of Training Dataset Random Forest with 200 Trees, 10 Variables
Number of Training Data Observations Misclassification Rate (%)
15 / 26
16 / 26
0.0 0.2 0.4 0.6 0.8 1.0 10 20 30
Distribution of Tree Probabilities
Truth: Non−Match Density 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 100 120
Distribution of Tree Probabilities
Truth: Match Density
17 / 26
0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 1.2 1.4
Distribution of Tree Probabilities
Truth: Match Density 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5
Distribution of Tree Probabilities
Truth: Match Density
18 / 26
0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 1.2
Distribution of Tree Probabilities
Truth: Match Density 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0
Distribution of Tree Probabilities
Truth: Non−Match Density
19 / 26
20 / 26
r=1, on X1
r=1 to X2
◮ Covariates: Features of the distribution of predictions for X2 ◮ Response: The actual 0-1 response in X2
r=1 to the test data
21 / 26
22 / 26
400 600 800 1000 0.00200 0.00205 0.00210 0.00215
Number of Trees Misclassification Error Rate (%)
PREDS
23 / 26
24 / 26
400 600 800 1000 0.0562 0.0564 0.0566
Random Forest Out−of−Sample Misclassification Error Rate By Number of Trees, Syria Dataset
Number of Trees Misclassification Error Rate (%)
PREDS
25 / 26
26 / 26