New Prediction Methods for Tree Ensembles with Applications in - PowerPoint PPT Presentation

New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura Rebecca Nugent Department of Statistics Carnegie Mellon University June 11, 2015 45th Symposium on the Interface Computing Science and Statistics 1 / 26

Why Do We Need Record Linkage? What happens if we search “Michael Jordan Statistics” on Google? 2 / 26

Google: “Michael Jordan Statistics” 3 / 26

What is Record Linkage? Record Linkage : Match records corresponding to unique entities within and across data sources Fellegi & Sunter (1969) introduced several important early concepts: ◮ Similarity scores to quantify similarity of names, addresses, etc ◮ Theoretical framework for estimating probabilities of matching ◮ Examining effect reducing the comparison space has on error rates Many extensions and alternatives to the Fellegi-Sunter methodology: ◮ Larsen & Rubin (2001): Mixture models for automated RL ◮ Sadinle & Fienberg (2013): Extend Fellegi-Sunter to 3+ files ◮ Steorts et al (2015): Bayesian approach to graphical RL ◮ Ventura et al (2014, 2015): Supervised learning approaches for RL 4 / 26

Example: United States Patent & Trademark Office Inventors often have similar identifying information Last First Mid City St/Co Assignee Zarian James R. Corona CA Lumentye Zarian James N/A Corona CA Lumentye Corp. Zarian Jamshid C. Woodland Hills CA Lumentye De Groot Peter J. Middletown CT Zygo de Groot P. N/A Middleton CT Boeing de Groot Paul N/A Grenoble FR Thomson CSF Six records from USPTO database (Akinsanmi et al, 2014; Ventura et al, 2015) How do we know which records refer to the same person? How do we compare strings? Allow for typos? 5 / 26

Picture: Our Record Linkage Framework Hierarchical Clustering Dendrogram Example USPTO Inventors 0.5 Dissimilarity = 1 − P(Match) 0.4 0.3 Paul 0.2 de Groot Peter P. De Groot de Groot 0.1 Jamshid Zarian James James 0.0 Zarian Zarian Within (across) blocks, records are similar (dissimilar) 6 / 26

Outline: Our Record Linkage Framework 1. Partition the data into groups of similar records, called “blocks” ◮ Reducing the comparison space more efficiently ◮ Preserving false negative error rates 2. Within blocks: Estimate probability that each record-pair matches ◮ Quantify the similarity of record-pairs ◮ Classifier ensembles when training data is prohibitively large ◮ Improving predictions for classifier ensembles with distributions of estimated probabilities 3. Within blocks: Identify unique entities ◮ Convert estimated probabilities to dissimilarities ◮ Hierarchical clustering to link groups of records 7 / 26

Quantify the Similarity of each Record-Pair: γ ij Let γ ij = � γ ij 1 , ..., γ ijM � be the similarity profile for records x i , x j Calculate similarity scores γ ijm for each field m = 1 , ..., M i j Last First Mid City St Assignee 1 4 0.93 1.00 0.75 1.00 1 0.50 1 7 0.93 1.00 0.00 0.42 0 0.50 � n � Need to calculate pairwise comparisons for n records 2 ◮ 1 million records ≈ 500 billion comparisons ◮ Computational tools in place to reduce complexity Does γ ij help us separate matching and non-matching pairs? 8 / 26

Distributions of Similarity Scores given Match/Non-Match Distribution of Similarity Scores for nstate Conditional on Match vs. Non−Match Matching Pairs Non−Matching Pairs 0.6 Density 0.4 0.2 0.0 0 1 state: Exact Match? 1 if fields exactly match, 0 otherwise 9 / 26

Distributions of Similarity Scores given Match/Non-Match Distribution of Similarity Scores for nsuffix Conditional on Match vs. Non−Match 0.8 Matching Pairs Non−Matching Pairs 0.6 Density 0.4 0.2 0.0 0 1 suffix: Exact Match? 1 if fields exactly match, 0 otherwise 10 / 26

Distributions of Similarity Scores given Match/Non-Match Distribution of Similarity Scores for ncity Conditional on Match vs. Non−Match Match Non−Match 8 6 Density 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Similarity Scores for city 11 / 26

Distributions of Similarity Scores given Match/Non-Match Distribution of Similarity Scores for nfirst Conditional on Match vs. Non−Match 80 Match Non−Match 60 Density 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0 Similarity Scores for first 12 / 26

Find the Probability that Record-Pairs Match: ˆ p ij Our approach: supervised learning to estimate P ( x i = x j ) ◮ Train a classifier on comparisons of labeled records ◮ Use the classifier to predict whether record-pairs match ◮ Result: ˆ p ij for any record-pair Last First Mid City St Assignee Match? i j 1 4 0.93 1.00 0.75 1.00 1 0.50 Yes 1 7 0.93 1.00 0.00 0.42 0 0.50 No Given a classifier m , find P ( x i = x j ) = ˆ p ij = m ( γ ij ) 13 / 26

Outline: Our Record Linkage Framework 1. Partition the data into groups of similar records, called “blocks” ◮ Reducing the comparison space more efficiently ◮ Preserving false negative error rates 2. Within blocks: Estimate probability that each record-pair matches ◮ Quantify the similarity of record-pairs ◮ Classifier ensembles when training data is prohibitively large ◮ Improving predictions for classifier ensembles with distributions of estimated probabilities 3. Within blocks: Identify unique entities ◮ Convert estimated probabilities to dissimilarities ◮ Hierarchical clustering to link groups of records 14 / 26

Ensembles: Why do we have multiple estimates of ˆ p ij ? Often computationally infeasible to train single classifier USPTO RL application: over 20 million training data observations Misclassification Rate vs. Size of Training Dataset Random Forest with 200 Trees, 10 Variables ● 0.50 ● ● Misclassification Rate (%) 0.45 ● 0.40 ● ● ● ● ● ● ● ● 0.35 ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 20000 30000 40000 50000 Number of Training Data Observations Error rates stabilize as number of training data observations increase 15 / 26

Ensembles: Why do we have multiple estimates of ˆ p ij ? Some classifiers are, by definition, ensembles (e.g. random forests) Majority Vote of trees (Breiman, 2001) ◮ Predicted probability = Proportion of ensemble’s votes for each class ◮ Predicted class = Majority vote of classifiers in the ensemble Mean Probability (Bauer & Kohavi, 1999) ◮ Predicted probability = Mean of all tree probabilities ◮ Predicted class = 1 if predicted probability ≥ 0.5; 0 otherwise 16 / 26

Distribution of R Predicted Probabilities The good: Distribution of Tree Probabilities Distribution of Tree Probabilities 120 30 100 80 20 Density Density 60 40 10 20 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Truth: Non−Match Truth: Match (random forest with 500 underlying classification trees) 17 / 26

Distribution of R Predicted Probabilities The bad: Distribution of Tree Probabilities Distribution of Tree Probabilities 1.4 2.5 1.2 2.0 1.0 Density Density 1.5 0.8 1.0 0.6 0.5 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Truth: Match Truth: Match (random forest with 500 underlying classification trees) 18 / 26

Distribution of R Predicted Probabilities The ugly: Distribution of Tree Probabilities Distribution of Tree Probabilities 2.0 1.2 1.0 1.5 Density Density 0.8 1.0 0.6 0.5 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Truth: Match Truth: Non−Match (random forest with 500 underlying classification trees) 19 / 26

Idea: Set Aside Some Training Data Remember: Our training datasets are large! ◮ E.g., USPTO RL dataset has over 20 million training observations ◮ We don’t need all of the training data to build an ensemble ◮ Set aside some training data and use it later... Use approach similar to stacked generalization/stacking (Wolpert, 1992): ◮ Split training data into two pieces ◮ On first piece, build the classifier ensemble ◮ In second piece, treat each model/predictor as a covariate ◮ Build logistic regression model to weight predictions from ensemble Use stacking for random forests? ◮ Issue for stacking with RF: Logistic regression with 500+ covariates? ◮ Alternative: Use distribution summary statistics as covariates 20 / 26

PRediction with Ensembles using Distribution Summaries PREDS: Use both pieces of training data ( X 1 , X 2 ) for better prediction 1. Build a classifier ensemble, { F 1 , r } R r =1 , on X 1 2. Apply { F 1 , r } R r =1 to X 2 This yields a distribution of predictions for each observation in X 2 3. Train a new model, F 2 , using: ◮ Covariates: Features of the distribution of predictions for X 2 ◮ Response: The actual 0-1 response in X 2 PREDS: When estimating the probability for a new observation 1. Apply { F 1 , r } R r =1 to the test data 2. Apply F 2 to the resulting distribution of predictions 3. Use F 2 ’s resulting estimated probability as the final estimate 21 / 26

Distribution Summary Statistics Flexible method: Can use any approach for summarizing the distribution ◮ Mean of the distribution ◮ “Majority vote” of the distribution ◮ Location of the largest mode(s) ◮ Skew of the distribution ◮ Mass of the distribution above/below a threshold 22 / 26

New Prediction Methods for Tree Ensembles with Applications in - PowerPoint PPT Presentation

New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura Rebecca Nugent Department of Statistics Carnegie Mellon University June 11, 2015 45th Symposium on the Interface Computing Science and Statistics

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Monte Carlo in different ensembles Daan Frenkel Different Ensembles Ensemble Name Constant

COS424 Scribe Notes Lecture 14: Ensembles Donghun Lee April 8, 2010 1 Ensembles A set of

Coulomb gas ensembles in 2D H. Hedenmalm December 11, 2015 H. Hedenmalm Coulomb gas ensembles

ENSEMBLES FOR TIME SERIES FORECASTING Mariana Oliveira & Lus Torgo Ensembles for Time

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Supervised Ensembles of Prediction Methods for Subcellular Localization APBC 2008 Johannes

Monte Carlo Methods Ensembles (Chapter 5) Biased Sampling (Chapter 14) Practical Aspects

Ensemble methods CS 446 Why ensembles? Standard machine learning setup: We have some data.

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Scaling up Link Prediction with Ensembles Liang Duan 1 , Charu Aggarwal 2 , Shuai Ma 1 , Renjun

Binary Tree Traversal Methods Preorder Inorder In a traversal of a binary tree, each

Binary Tree Traversal Methods Preorder Inorder In a traversal of a binary tree, each

Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio

Enhancing HIV Testing and Linkage to Care With Peer Recruitment, Financial Incentives, and

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining

The More You Know: Linkage of Public Health Datasets and All-Payer Claims to Further

Measuring the impact of STEM learning in afterschool The webinar will begin shortly.

Privacy leakage on the Internet Balachander Krishnamurthy AT&T LabsResearch

Human Factors Emily Winter [1], David in the Study of Bowes [1], Steve Counsell [2], Tracy Hall

Conceptual spaces as a Nathan Oseroff framework for pedagogy in Kings College London

New Prediction Methods for Tree Ensembles with Applications in - PowerPoint PPT Presentation

New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura Rebecca Nugent Department of Statistics Carnegie Mellon University June 11, 2015 45th Symposium on the Interface Computing Science and Statistics

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

Monte Carlo in different ensembles Daan Frenkel Different Ensembles Ensemble Name Constant

COS424 Scribe Notes Lecture 14: Ensembles Donghun Lee April 8, 2010 1 Ensembles A set of

Coulomb gas ensembles in 2D H. Hedenmalm December 11, 2015 H. Hedenmalm Coulomb gas ensembles

ENSEMBLES FOR TIME SERIES FORECASTING Mariana Oliveira &amp; Lus Torgo Ensembles for Time

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Supervised Ensembles of Prediction Methods for Subcellular Localization APBC 2008 Johannes

Monte Carlo Methods Ensembles (Chapter 5) Biased Sampling (Chapter 14) Practical Aspects

Ensemble methods CS 446 Why ensembles? Standard machine learning setup: We have some data.

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Scaling up Link Prediction with Ensembles Liang Duan 1 , Charu Aggarwal 2 , Shuai Ma 1 , Renjun

Binary Tree Traversal Methods Preorder Inorder In a traversal of a binary tree, each

Binary Tree Traversal Methods Preorder Inorder In a traversal of a binary tree, each

Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio

Enhancing HIV Testing and Linkage to Care With Peer Recruitment, Financial Incentives, and

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining

The More You Know: Linkage of Public Health Datasets and All-Payer Claims to Further

Measuring the impact of STEM learning in afterschool The webinar will begin shortly.

Privacy leakage on the Internet Balachander Krishnamurthy AT&amp;T LabsResearch

Human Factors Emily Winter [1], David in the Study of Bowes [1], Steve Counsell [2], Tracy Hall

Conceptual spaces as a Nathan Oseroff framework for pedagogy in Kings College London

ENSEMBLES FOR TIME SERIES FORECASTING Mariana Oliveira & Lus Torgo Ensembles for Time

Privacy leakage on the Internet Balachander Krishnamurthy AT&T LabsResearch