Using a Probabilistic Model to Assist Merging of Large-scale - PowerPoint PPT Presentation

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political Methodology Meeting January 11, 2018 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 1 / 18

Motivation In any given project, social scientists often rely on multiple data sets Cutting-edge empirical research often merges large-scale administrative records with other types of data We can easily merge data sets if there is a common unique identifier � e.g. Use the merge function in R or Stata How should we merge data sets if no unique identifier exists? � must use variables: names, birthdays, addresses, etc. Variables often have measurement error and missing values � cannot use exact matching What if we have millions of records? � cannot merge “by hand” Merging data sets is an uncertain process � quantify uncertainty and error rates Solution: Probabilistic Model Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 2 / 18

Data Merging Can be Consequential Turnout validation for the American National Election Survey 2012 Election: self-reported turnout (78%) ≫ actual turnout (59%) Ansolabehere and Hersh (2012, Political Analysis ): “electronic validation of survey responses with commercial records provides a far more accurate picture of the American electorate than survey responses alone.” Berent, Krosnick, and Lupia (2016, Public Opinion Quarterly ): “Matching errors ... drive down “validated” turnout estimates. As a result, ... the apparent accuracy [of validated turnout estimates] is likely an illusion.” Challenge: Find 2500 survey respondents in 160 million registered voters (less than 0.001%) � finding needles in a haystack Problem: match � = registered voter, non-match � = non-voter Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 3 / 18

Probabilistic Model of Record Linkage Many social scientists use deterministic methods: match “similar” observations (e.g., Ansolabehere and Hersh, 2016; Berent, Krosnick, and Lupia, 2016) proprietary methods (e.g., Catalist) Problems: not robust to measurement error and missing data 1 no principled way of deciding how similar is similar enough 2 lack of transparency 3 Probabilistic model of record linkage: originally proposed by Fellegi and Sunter (1969, JASA ) enables the control of error rates Problems: current implementations do not scale 1 missing data treated in ad-hoc ways 2 does not incorporate auxiliary information 3 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 4 / 18

The Fellegi-Sunter Model Two data sets: A and B with N A and N B observations K variables in common We need to compare all N A × N B pairs Agreement vector for a pair ( i , j ): γ ( i , j )  0 different   1     . . γ k ( i , j ) = . similar  L k − 2     L k − 1 identical  Latent variable: � 0 non-match M i , j = 1 match Missingness indicator: δ k ( i , j ) = 1 if γ k ( i , j ) is missing Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 5 / 18

How to Construct Agreement Patterns Jaro-Winkler distance with default thresholds for string variables Name Address First Middle Last House Street Data set A 1 James V Smith 780 Devereux St. 2 John NA Martin 780 Devereux St. Data set B 1 Michael F Martinez 4 16th St. 2 James NA Smith 780 Dvereuux St. Agreement patterns A . 1 − B . 1 0 0 0 0 0 A . 1 − B . 2 2 NA 2 2 1 A . 2 − B . 1 0 NA 1 0 0 A . 2 − B . 2 0 NA 0 2 1 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 6 / 18

Independence assumptions for computational efficiency: Independence across pairs 1 Independence across variables: γ k ( i , j ) ⊥ ⊥ γ k ′ ( i , j ) | M ij 2 Missing at random: δ k ( i , j ) ⊥ ⊥ γ k ( i , j ) | M ij 3 Nonparametric mixture model: � 1 − δ k ( i , j )   � L k − 1 N A N B 1 K   π 1 { γ k ( i , j )= ℓ } � � � λ m (1 − λ ) 1 − m � � km ℓ  m =0  i =1 j =1 k =1 ℓ =0 where λ = P ( M ij = 1) is the proportion of true matches and π km ℓ = Pr( γ k ( i , j ) = ℓ | M ij = m ) Fast implementation of the EM algorithm ( R package fastLink ) EM algorithm produces the posterior matching probability ξ ij Deduping to enforce one-to-one matching Choose the pairs with ξ ij > c for a threshold c 1 Use Jaro’s linear sum assignment algorithm to choose the best matches 2 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 7 / 18

Controlling Error Rates 1 False negative rate (FNR): # true matches in the data = P ( M ij = 1 | unmatched ) P ( unmatched ) # true matches not found P ( M ij = 1) 2 False discovery rate (FDR): # false matches found = P ( M ij = 0 | matched) # matches found We can compute FDR and FNR for any given posterior matching probability threshold c Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 8 / 18

Simulation Studies 2006 voter files from California (female only; 8 million records) Validation data: records with no missing data (340k records) Linkage fields: first name, middle name, last name, date of birth, address (house number and street name), and zip code 2 scenarios: Unequal size: 1:100, 10:100, and 50:100, larger data 100k records 1 Equal size (100k records each): 20%, 50%, and 80% matched 2 3 missing data mechanisms: Missing completely at random (MCAR) 1 Missing at random (MAR) 2 Missing not at random (MNAR) 3 3 levels of missingness: 5%, 10%, 15% Noise is added to first name, last name, and address Results below are with 10% missingness and no noise Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 9 / 18

Error Rates and Estimation Error for Turnout 80% Overlap 50% Overlap 20% Overlap 1 fastLink False Negative Rate partial match (ADGN) 0.75 exact match 0.5 0.25 0 MCAR MAR MNAR MCAR MAR MNAR MCAR MAR MNAR 15 Absolute Estimation Error fastLink (percentage point) partial match (ADGN) exact match 10 5 0 MCAR MAR MNAR MCAR MAR MNAR MCAR MAR MNAR Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 10 / 18

Runtime Comparisons Equal size Unequal Size 20 150 ● Record Linkage (Python) ● Record Linkage (Python) ● Time elapsed (minutes) 15 Time elapsed (minutes) ● 100 RecordLinkage (R) RecordLinkage (R) ● ● ● 10 ● ● 50 5 ● ● fastLink (R) ● fastLink (R) ● ● ● ● ● ● ● ● 0 0 1 5 10 15 20 25 30 1 5 10 15 20 25 30 35 40 Dataset size (thousands) Largest dataset size (thousands) No blocking, single core (parallelization possible with fastLink ) Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 11 / 18

Application: Merging Survey with Administrative Record Hill and Huber (2017, Political Behavior ) study differences between donors and non-donors among CCES (2012) respondents CCES respondents are matched with DIME donors (2010, 2012) Use of a proprietary method, treating non-matches as non-donors Donation amount coarsened and small noise added 4,432 (8.1%) matched out of 54,535 CCES respondents We asked YouGov to apply fastLink for merging the two data sets We signed the NDA form � no coarsening, no noise Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 12 / 18

Merging Process DIME: 5 million unique contributors CCES: 51,184 respondents (YouGov panel only) Exact matching: 0.33% match rate Blocking: 102 blocks using state and gender Linkage fields: first name, middle name, last name, address (house number, street name), zip code Took 1 hour using a dual-core laptop Examples from the output of one block: Name Address First Middle Last Street House Zip Posterior agree 1.00 agree agree agree agree agree agree 0.93 similar NA Agree similar agree 0.01 agree NA Agree disagree disagree NA Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 13 / 18

Merge Results Threshold 0.75 0.85 0.95 Proprietary All 4945 4794 4573 4534 Number of matches Female 2198 2156 2067 2210 Male 2747 2638 2506 2324 Overlap fastLink All 3958 3935 3880 and proprietary Female 1878 1867 1845 method Male 2080 2068 2035 All 1.24 0.65 0.21 False discovery rate Female 0.91 0.52 0.14 (FDR; %) Male 1.49 0.75 0.27 All 15.25 17.35 20.81 False negative rate Female 5.34 6.79 10.29 (FNR; %) Male 21.84 24.37 27.81 Enamorado, Fifield, and Imai (Princeton) Merging Large Data Sets Seoul (January 11, 2018) 14 / 18

Using a Probabilistic Model to Assist Merging of Large-scale - PowerPoint PPT Presentation

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political Methodology Meeting January 11, 2018

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Optimal Merging in Quantum k -xor and k -sum Algorithms Mara Naya-Plasencia, Andr

Comparison Based Merging Upper and Lower bounds EMADS Fall 2003: Comparison Based Merging Page 1

Yet Another Approach To Model Merging merge and diff relations and very short version rules

Track Filtering/Quality/Merging A proposal for data format of track quality and track merging in

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging & Non-Perturbative

Identity Linking Identity Linking An Alternative to Merging An Alternative to Merging A

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Buying, Selling, Merging Buying, Selling, Merging and Valuation and Valuation Sponsored by: US

Automatic Merging of Automatic Merging of Pedigree Information Pedigree Information Annual

Pin Merging in Planar Body Frameworks Rudi Penne rudi.penne@kdg.be Karel de Grote-Hogeschool

Evaluation of Ontology Evaluation of Ontology Merging Tools in Merging Tools in Bioinformatics

Quantum Merging Algorithms Mara Naya-Plasencia 2 , Andr Schrottenloher 2 Joint work with Andr

Lecture 7 Rebasing Sign in on the attendance sheet! Today Review of merging Rebasing

Using Hierarchical Modeling to Assist Using Hierarchical Modeling to Assist Effects Based

Tokyo Climate Change Strategy - A Basic Policy for the 10-Year Project for a Carbon-Minus Tokyo -

The EFTA Statistical Office: Building a common language for the EEA EEA Seminar EEA Seminar

Comprehensive 3R Policy Framework towards a Sound Material Cycle Society in Japan Masahito Fukami

Realistic simulations w/ exact chiral symmetry T. Kaneko for the JLQCD/TWQCD collaborations 1 High

How does voicing stop? Want to consider the mechanism by which we stop the voicing

PT Mega Manunggal Property Tbk 1 PT Mega Manunggal Property Tbk 2 PT Mega Manunggal Property

PT JASA ARMADA INDONESIA Tbk FIRST QUARTER (Q1) 2020 RESULTS Jakarta, 27 April 2020 1

Asosiasi Penyelengara Jasa Internet Indonesia (APJII) Chief of Indonesia Network Information

Using a Probabilistic Model to Assist Merging of Large-scale - PowerPoint PPT Presentation

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political Methodology Meeting January 11, 2018

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Optimal Merging in Quantum k -xor and k -sum Algorithms Mara Naya-Plasencia, Andr

Comparison Based Merging Upper and Lower bounds EMADS Fall 2003: Comparison Based Merging Page 1

Yet Another Approach To Model Merging merge and diff relations and very short version rules

Track Filtering/Quality/Merging A proposal for data format of track quality and track merging in

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging &amp; Non-Perturbative

Identity Linking Identity Linking An Alternative to Merging An Alternative to Merging A

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Buying, Selling, Merging Buying, Selling, Merging and Valuation and Valuation Sponsored by: US

Automatic Merging of Automatic Merging of Pedigree Information Pedigree Information Annual

Pin Merging in Planar Body Frameworks Rudi Penne rudi.penne@kdg.be Karel de Grote-Hogeschool

Evaluation of Ontology Evaluation of Ontology Merging Tools in Merging Tools in Bioinformatics

Quantum Merging Algorithms Mara Naya-Plasencia 2 , Andr Schrottenloher 2 Joint work with Andr

Lecture 7 Rebasing Sign in on the attendance sheet! Today Review of merging Rebasing

Using Hierarchical Modeling to Assist Using Hierarchical Modeling to Assist Effects Based

Tokyo Climate Change Strategy - A Basic Policy for the 10-Year Project for a Carbon-Minus Tokyo -

The EFTA Statistical Office: Building a common language for the EEA EEA Seminar EEA Seminar

Comprehensive 3R Policy Framework towards a Sound Material Cycle Society in Japan Masahito Fukami

Realistic simulations w/ exact chiral symmetry T. Kaneko for the JLQCD/TWQCD collaborations 1 High

How does voicing stop? Want to consider the mechanism by which we stop the voicing

PT Mega Manunggal Property Tbk 1 PT Mega Manunggal Property Tbk 2 PT Mega Manunggal Property

PT JASA ARMADA INDONESIA Tbk FIRST QUARTER (Q1) 2020 RESULTS Jakarta, 27 April 2020 1

Asosiasi Penyelengara Jasa Internet Indonesia (APJII) Chief of Indonesia Network Information

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging & Non-Perturbative