New Prediction Methods for Tree Ensembles with Applications in - - PowerPoint PPT Presentation

new prediction methods for tree ensembles with
SMART_READER_LITE
LIVE PREVIEW

New Prediction Methods for Tree Ensembles with Applications in - - PowerPoint PPT Presentation

New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura Rebecca Nugent Department of Statistics Carnegie Mellon University June 11, 2015 45th Symposium on the Interface Computing Science and Statistics


slide-1
SLIDE 1

New Prediction Methods for Tree Ensembles with Applications in Record Linkage

Samuel L. Ventura

Rebecca Nugent

Department of Statistics Carnegie Mellon University

June 11, 2015 45th Symposium on the Interface Computing Science and Statistics

1 / 26

slide-2
SLIDE 2

Why Do We Need Record Linkage?

What happens if we search “Michael Jordan Statistics” on Google?

2 / 26

slide-3
SLIDE 3

Google: “Michael Jordan Statistics”

3 / 26

slide-4
SLIDE 4

What is Record Linkage?

Record Linkage: Match records corresponding to unique entities within and across data sources Fellegi & Sunter (1969) introduced several important early concepts:

◮ Similarity scores to quantify similarity of names, addresses, etc ◮ Theoretical framework for estimating probabilities of matching ◮ Examining effect reducing the comparison space has on error rates

Many extensions and alternatives to the Fellegi-Sunter methodology:

◮ Larsen & Rubin (2001): Mixture models for automated RL ◮ Sadinle & Fienberg (2013): Extend Fellegi-Sunter to 3+ files ◮ Steorts et al (2015): Bayesian approach to graphical RL ◮ Ventura et al (2014, 2015): Supervised learning approaches for RL

4 / 26

slide-5
SLIDE 5

Example: United States Patent & Trademark Office

Inventors often have similar identifying information Last First Mid City St/Co Assignee Zarian James R. Corona CA Lumentye Zarian James N/A Corona CA Lumentye Corp. Zarian Jamshid C. Woodland Hills CA Lumentye De Groot Peter J. Middletown CT Zygo de Groot P. N/A Middleton CT Boeing de Groot Paul N/A Grenoble FR Thomson CSF

Six records from USPTO database (Akinsanmi et al, 2014; Ventura et al, 2015)

How do we know which records refer to the same person? How do we compare strings? Allow for typos?

5 / 26

slide-6
SLIDE 6

Picture: Our Record Linkage Framework

0.0 0.1 0.2 0.3 0.4 0.5

Hierarchical Clustering Dendrogram Example USPTO Inventors

Dissimilarity = 1 − P(Match) Jamshid Zarian James Zarian James Zarian Paul de Groot Peter De Groot P. de Groot

Within (across) blocks, records are similar (dissimilar)

6 / 26

slide-7
SLIDE 7

Outline: Our Record Linkage Framework

  • 1. Partition the data into groups of similar records, called “blocks”

◮ Reducing the comparison space more efficiently ◮ Preserving false negative error rates

  • 2. Within blocks: Estimate probability that each record-pair matches

◮ Quantify the similarity of record-pairs ◮ Classifier ensembles when training data is prohibitively large ◮ Improving predictions for classifier ensembles with

distributions of estimated probabilities

  • 3. Within blocks: Identify unique entities

◮ Convert estimated probabilities to dissimilarities ◮ Hierarchical clustering to link groups of records 7 / 26

slide-8
SLIDE 8

Quantify the Similarity of each Record-Pair: γij

Let γij = γij1, ..., γijM be the similarity profile for records xi, xj Calculate similarity scores γijm for each field m = 1, ..., M

i j Last First Mid City St Assignee 1 4 0.93 1.00 0.75 1.00 1 0.50 1 7 0.93 1.00 0.00 0.42 0.50

Need to calculate n

2

  • pairwise comparisons for n records

◮ 1 million records ≈ 500 billion comparisons ◮ Computational tools in place to reduce complexity

Does γij help us separate matching and non-matching pairs?

8 / 26

slide-9
SLIDE 9

Distributions of Similarity Scores given Match/Non-Match

1

Distribution of Similarity Scores for nstate Conditional on Match vs. Non−Match

state: Exact Match? 1 if fields exactly match, 0 otherwise Density 0.0 0.2 0.4 0.6 Matching Pairs Non−Matching Pairs

9 / 26

slide-10
SLIDE 10

Distributions of Similarity Scores given Match/Non-Match

1

Distribution of Similarity Scores for nsuffix Conditional on Match vs. Non−Match

suffix: Exact Match? 1 if fields exactly match, 0 otherwise Density 0.0 0.2 0.4 0.6 0.8 Matching Pairs Non−Matching Pairs

10 / 26

slide-11
SLIDE 11

Distributions of Similarity Scores given Match/Non-Match

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8

Distribution of Similarity Scores for ncity Conditional on Match vs. Non−Match

Similarity Scores for city Density Match Non−Match

11 / 26

slide-12
SLIDE 12

Distributions of Similarity Scores given Match/Non-Match

0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80

Distribution of Similarity Scores for nfirst Conditional on Match vs. Non−Match

Similarity Scores for first Density Match Non−Match

12 / 26

slide-13
SLIDE 13

Find the Probability that Record-Pairs Match: ˆ pij

Our approach: supervised learning to estimate P(xi = xj)

◮ Train a classifier on comparisons of labeled records ◮ Use the classifier to predict whether record-pairs match ◮ Result: ˆ

pij for any record-pair i j Last First Mid City St Assignee Match? 1 4 0.93 1.00 0.75 1.00 1 0.50 Yes 1 7 0.93 1.00 0.00 0.42 0.50 No Given a classifier m, find P(xi = xj) = ˆ pij = m(γij)

13 / 26

slide-14
SLIDE 14

Outline: Our Record Linkage Framework

  • 1. Partition the data into groups of similar records, called “blocks”

◮ Reducing the comparison space more efficiently ◮ Preserving false negative error rates

  • 2. Within blocks: Estimate probability that each record-pair matches

◮ Quantify the similarity of record-pairs ◮ Classifier ensembles when training data is prohibitively large ◮ Improving predictions for classifier ensembles with

distributions of estimated probabilities

  • 3. Within blocks: Identify unique entities

◮ Convert estimated probabilities to dissimilarities ◮ Hierarchical clustering to link groups of records 14 / 26

slide-15
SLIDE 15

Ensembles: Why do we have multiple estimates of ˆ pij?

Often computationally infeasible to train single classifier USPTO RL application: over 20 million training data observations

  • ● ●

10000 20000 30000 40000 50000 0.35 0.40 0.45 0.50

Misclassification Rate vs. Size of Training Dataset Random Forest with 200 Trees, 10 Variables

Number of Training Data Observations Misclassification Rate (%)

Error rates stabilize as number of training data observations increase

15 / 26

slide-16
SLIDE 16

Ensembles: Why do we have multiple estimates of ˆ pij?

Some classifiers are, by definition, ensembles (e.g. random forests) Majority Vote of trees (Breiman, 2001)

◮ Predicted probability = Proportion of ensemble’s votes for each class ◮ Predicted class = Majority vote of classifiers in the ensemble

Mean Probability (Bauer & Kohavi, 1999)

◮ Predicted probability = Mean of all tree probabilities ◮ Predicted class = 1 if predicted probability ≥ 0.5; 0 otherwise

16 / 26

slide-17
SLIDE 17

Distribution of R Predicted Probabilities

The good:

0.0 0.2 0.4 0.6 0.8 1.0 10 20 30

Distribution of Tree Probabilities

Truth: Non−Match Density 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 100 120

Distribution of Tree Probabilities

Truth: Match Density

(random forest with 500 underlying classification trees)

17 / 26

slide-18
SLIDE 18

Distribution of R Predicted Probabilities

The bad:

0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 1.2 1.4

Distribution of Tree Probabilities

Truth: Match Density 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5

Distribution of Tree Probabilities

Truth: Match Density

(random forest with 500 underlying classification trees)

18 / 26

slide-19
SLIDE 19

Distribution of R Predicted Probabilities

The ugly:

0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 1.2

Distribution of Tree Probabilities

Truth: Match Density 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0

Distribution of Tree Probabilities

Truth: Non−Match Density

(random forest with 500 underlying classification trees)

19 / 26

slide-20
SLIDE 20

Idea: Set Aside Some Training Data

Remember: Our training datasets are large!

◮ E.g., USPTO RL dataset has over 20 million training observations ◮ We don’t need all of the training data to build an ensemble ◮ Set aside some training data and use it later...

Use approach similar to stacked generalization/stacking (Wolpert, 1992):

◮ Split training data into two pieces ◮ On first piece, build the classifier ensemble ◮ In second piece, treat each model/predictor as a covariate ◮ Build logistic regression model to weight predictions from ensemble

Use stacking for random forests?

◮ Issue for stacking with RF: Logistic regression with 500+ covariates? ◮ Alternative: Use distribution summary statistics as covariates

20 / 26

slide-21
SLIDE 21

PRediction with Ensembles using Distribution Summaries

PREDS: Use both pieces of training data (X1, X2) for better prediction

  • 1. Build a classifier ensemble, {F1,r}R

r=1, on X1

  • 2. Apply {F1,r}R

r=1 to X2

This yields a distribution of predictions for each observation in X2

  • 3. Train a new model, F2, using:

◮ Covariates: Features of the distribution of predictions for X2 ◮ Response: The actual 0-1 response in X2

PREDS: When estimating the probability for a new observation

  • 1. Apply {F1,r}R

r=1 to the test data

  • 2. Apply F2 to the resulting distribution of predictions
  • 3. Use F2’s resulting estimated probability as the final estimate

21 / 26

slide-22
SLIDE 22

Distribution Summary Statistics

Flexible method: Can use any approach for summarizing the distribution

◮ Mean of the distribution ◮ “Majority vote” of the distribution ◮ Location of the largest mode(s) ◮ Skew of the distribution ◮ Mass of the distribution above/below a threshold

22 / 26

slide-23
SLIDE 23

PREDS on USPTO Dataset, Varying Number of Classifiers

  • 200

400 600 800 1000 0.00200 0.00205 0.00210 0.00215

Random Forest Out−of−Sample Misclassification Error Rate By Number of Trees, USPTO Dataset

Number of Trees Misclassification Error Rate (%)

  • Mean Tree Prob.

PREDS

PREDS lowers error rates vs. mean tree probability and majority vote

23 / 26

slide-24
SLIDE 24

Linking Death Records from the Syrian Civil War

HRDAG compiles datasets of death records from the Syrian Civil War

◮ 6+ organizations creating death records on the ground in Syria ◮ Over 200,000 total records ◮ Some lists have large overlap ◮ Fields: name, date of death, gender, governorate ◮ Field information can be missing, subject to error ◮ Some record-pairs labeled as match/non-match

Goal: Identify duplicate records

24 / 26

slide-25
SLIDE 25

PREDS on Syria Dataset, Varying Number of Classifiers

  • 200

400 600 800 1000 0.0562 0.0564 0.0566

Random Forest Out−of−Sample Misclassification Error Rate By Number of Trees, Syria Dataset

Number of Trees Misclassification Error Rate (%)

  • Mean Tree Prob.

PREDS

For all values of R, PREDS yields lower misclassification rates than majority vote and mean of tree probabilities

25 / 26

slide-26
SLIDE 26

Conclusions & Future Work

In large-scale training data scenarios, improvement in out-of-sample misclassification rates given extra training data is minimal How can we use the “extra” training data more efficiently?

◮ Partition training data into two pieces ◮ Train classifier ensemble on one piece, predict on the other ◮ Summarize the distribution of predicted probabilities ◮ Build a new model: match vs. non-match | distribution summaries

Future work:

◮ Size of training datasets: depend on application, # of covariates? ◮ Compare to other stacking approaches for ensembles ◮ Study error properties associated stacking for record linkage

26 / 26