Development of pseudonymised matching methods for linking - - PowerPoint PPT Presentation
Development of pseudonymised matching methods for linking - - PowerPoint PPT Presentation
Development of pseudonymised matching methods for linking administrative datasets Pete Jones Office for National Statistics The Beyond 2011 Programme Office for National Statistics (ONS) conducted a review (Beyond 2011 Programme) for the
- Office for National Statistics (ONS) conducted a review (Beyond 2011
Programme) for the future approach to the census and population statistics in England and Wales
- National Statistician made a recommendation to Government in March
2014 that there should be a predominantly online census in 2021
- This will be supplemented with increased use of administrative data
and surveys to enhance census outputs and annual statistics
- Part of our research leading up to the recommendation was to explore
an administrative data option for producing population statistics
- Involved the development of algorithms to link pseudonymised
administrative datasets and surveys
- Work continues within the 2021 Census Transformation Programme
(Beyond 2021 Research & Development Team)
The Beyond 2011 Programme
- There are additional challenges associated with the use of admin data
- Data quality – particularly lags in data being up to date
- Efficiency – need to match datasets with 60 million + records
- Public acceptability - ONS unique in holding multiple admin sources in
- ne place
- Made the decision that names, dates of birth and addresses will be
anonymised with a hashing algorithm (SHA-256)
- Converts original identifiers into meaningless hashed values
(e.g. John hashes to XY143257461)
- Consistently maps same entities to the same hashed value
Pseudonymised Linkage Model
- Hashing data makes many of the traditional methods for
resolving inconsistencies redundant
- Cannot run direct string comparison algorithms
- Cannot use clerical resolution
- Developed alternative ways of tackling data capture inconsistencies
(1) The development of match-keys that can be derived in pre- processing and hashed before linking two datasets
Methodological Developments
Pseudonymised Match-Keys
Key Type Unique records on EPR (%) 1 Forename, Surname, DoB, Sex, Postcode 100.00% 2 Forename initial , Surname initial, DoB, Sex, Postcode District 99.55% 3 Forename bi-gram, Surname bi-gram, DoB, Sex, Postcode Area 99.44% 4 Forename initial, DoB, Sex, Postcode 99.84% 5 Surname initial, DoB, Sex, Postcode 99.44% 6 Forename, Surname, Age, Sex, Postcode Area 99.46% 7 Forename, Surname, Sex, Postcode 99.19% 8 Forename, Surname, DoB, Sex 98.87% 9 Forename, Surname, DoB, Postcode 99.52% 10 Surname, Forename, DoB, Sex, Postcode (matched on key 1) 100.00% 11 Middle name, Surname, DoB, Sex, Postcode (matched on key 1) 99.90%
- Constructing during pre-processing to support score-based methods
that involve string comparison
- Non-disclosive to match single variables in isolation prior to encryption
(2) Similarity Tables
Data Storage Area Forename Surname DoB PostCode John Davis 02/04/1993 B1 2TG John Thomas 23/07/1986 M2 1JH John Smith 16/06/2003 BH12 1LT Jon Reed 19/09/1993 DT8 4PB Jon Ellis 16/06/2008 KT1 1LL John Jonny Johnson 06/01/2002 N7 4ER Jon Jonny Daniels 21/10/1949 LN22 1AR Jonny Jonny Barker 14/10/1974 PO11 7TG Jonathan Jonny King 26/02/1998 SO1 4KW …… Jonathan Khan 03/06/1999 E1 2BB Jonathan Wright 11/10/2004 CR21 2JJ Jonathan Walker 10/07/2002 W5 6AD … … … … Original Dataset (Source 2) List of unique Reception Server (Data Import Area) Extract list of unique forenames
- Follow the same process for the 2nd dataset import
Similarity Tables
Data Storage Area Forename Surname DoB PostCode John Davis 02/04/1993 B1 2TG John Thomas 23/07/1986 M2 1JH John Smith 16/06/2003 BH12 1LT Jon Reed 19/09/1993 DT8 4PB Jon Ellis 16/06/2008 KT1 1LL John Jonny Johnson 06/01/2002 N7 4ER Jon Jonny Daniels 21/10/1949 LN22 1AR Jonny Jonnie Barker 14/10/1974 PO11 7TG Jonathan Jonny King 26/02/1998 SO1 4KW Jonnie Jonathan Khan 03/06/1999 E1 2BB …… Jonathan Wright 11/10/2004 CR21 2JJ Jonathan Walker 10/07/2002 W5 6AD … … … … Reception Server (Data Import Area) Source 2 Dataset Identify any additional names not on list List of unique forenames
- Run string comparison algorithm between all names on the list
Similarity Tables
Forename Matches Score John John 1 John John John Jonny 0.88 Jon Jon John Jon 0.91 Jonny Jonny John Jonathan 0.82 Jonathan Jonathan Jonny Jonny 1 Jonnie Jonnie Jonny John 0.88 …… …… Jonny Jon 0.89 Jonny Jonathan 0.79 Jon Jon 1 Jon John 0.91 Jon Jonny 0.89 Jon Jonathan 0.81 Jonathan Jonathan 1 Jonathan John 0.82 Jonathan Jon 0.81 Jonathan Jonny 0.79 List of unique List of unique forenames String comparison algorithm
Similarity Tables (example)
PR_Forename PR_Surname PR_DoB PR_Sex PR_Pcode SC_Forename SC_Surname SC_DoB SC_Sex SC_Pcode Jon Smyth 13/02/1965 M PO15 5RR John Smith 09/02/1965 M PO15 5RR PR_Forename SC_Forename Similarity Score PR_Surname SC_Surname Similarity Score PR_DoB SC_DoB Similarity Score John John 1 Smith Smyth 0.93 13/02/1965 08/02/1965 0.67 John Jonny 0.88 Smith Smithers 0.87 13/02/1965 09/02/1965 0.67 John Jon 0.91 Smith Smithson 0.85 13/02/1965 10/02/1965 0.67 John Jonathan 0.82 Smith Smith 1 13/02/1965 11/02/1965 0.67 Jonny Jonny 1 Smyth Smith 0.93 13/02/1965 12/02/1965 0.67 Jonny John 0.88 Smyth Smithers 0.9 13/02/1965 13/02/1965 1 Jonny Jon 0.89 Smyth Smithson 0.83 13/02/1965 14/02/1965 0.67 Jonny Jonathan 0.79 Smyth Smyth 1 13/02/1965 15/02/1965 0.67 Jon Jon 1 Smithers Smith 0.87 13/02/1965 16/02/1965 0.67 Jon John 0.91 Smithers Smyth 0.9 13/02/1965 17/02/1965 0.67 Jon Jonny 0.89 Smithers Smithson 0.92 13/02/1965 18/02/1965 0.67 Jon Jonathan 0.81 Smithers Smithers 1 13/02/1965 13/01/1965 0.67 Jonathan Jonathan 1 Smithson Smith 0.85 13/02/1965 13/03/1995 0.67 Jonathan John 0.82 Smithson Smyth 0.83 13/02/1965 13/04/1995 0.67 Jonathan Jon 0.81 Smithson Smithers 0.92 13/02/1965 13/05/1995 0.67 Jonathan Jonny 0.79 Smithson Smithson 1 13/02/1965 13/06/1995 0.67 … … … … … … … … …
ames
- The similarity tables identify all the candidate pairs that achieve a
specified similarity threshold on forename, surname and DoB
- The researcher will only ever see the hashed fields
- Hashed variables are now redundant (can delete them)
- The only usable information is the scores themselves
- But what do you do with the scores?
Candidate Matches
Source 1 Forename Source 2 Forename Forename Score Source 1 Surname Source 2 Surname Surname Score Source 1 DoB Source 2 DoB Source 1 DoB Overall Score EFIJ2465 ZASG1635 0.78 CTYG0289 XHDK5456 0.93 GXCX6714 AFIQ8834 0.33 0.68 EFIJ2465 VRXM2613 0.91 CTYG0289 XHDK5456 0.93 GXCX6714 LRQP3671 0.67 0.84 EFIJ2465 HDNR3167 0.69 CTYG0289 CTYG0289 1 GXCX6714 EYGI9391 0.33 0.67
- Impractical to rely on clerical review when linking datasets at national
level
- Clerical review is redundant when records are hash encoded
- ONS have arrangements to re-identify hashed values for small samples
- f match candidates for clerical review (approx. 1000 records)
- Clerically matched samples can then be used as the basis of
supervised matching models
- Logistic regression has produced good results
- Dependent variable is the clerical decision (binary Y/N)
- Covariates are the similarity metrics (name / dob) , name
commonality, geographic distances
- Logistic regression provides a single threshold for designating matches
/ non-matches (p = 0.5)
Training models with clerical review
t tables
Match Classification with Logistic Regression
Classification Tablea No Yes Percentage Correct Match No 78 3 96.3 Yes 2 283 99.3 Overall Percentage 98.6
- a. The cut value is .500
Predicted Match Observed
- Undertook record level comparison with links made by the 2011
Census QA Team (used exact / probabilistic / clerical matching)
- Linked 1% of sample of records from the NHS Patient Register to the
2011 Census
Testing the Algorithms
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Comparison of PR to Census/CCS match rates: Census QA and B2011
Census QA match rate B2011 match rate B2011 Precision B2011 Recall
Summary Tables
Local Authority Census QA match rate Pseudonymised match rate Pseudonymised false positive rate Pseudonymised false negative rate Birmingham 82.0% 81.0% 0.4% 1.7% Westminster 65.1% 64.2% 0.4% 1.9% Lambeth 64.0% 63.5% 0.8% 1.6% Newham 68.3% 67.1% 0.5% 2.2% Southwark 66.3% 65.0% 0.4% 2.3% Powys 94.3% 93.4% 0.2% 1.2% Aylesbury Vale 89.9% 89.6% 0.3% 0.6% Mid Devon 88.6% 88.6% 0.2% 0.2% Total 72.7% 71.7% 0.5% 1.9% Local Authority Type Census QA match rate Pseudonymised match rate Pseudonymised false positive rate Pseudonymised false negative rate City Local Authorities 71.3% 70.3% 0.5% 2.0% Rural Local Authorities 91.2% 90.7% 0.3% 0.8% Total 72.7% 71.7% 0.5% 1.9%
- Causes of false negatives – developing new blocking strategies to
identify a higher number of match candidates
- Testing the use of unsupervised probabilistic matching
- Fellegi-Sunter framework
- EM algorithm for match probabilities
- Duplicate link method for threshold setting (Blakely & Salmond 2002)
- Exploring its potential application in 2021 Census to Coverage Survey
Matching
- Method to be used in the new Admin Data Research Centre (led by
Southampton University)
- Looking to collaborate with other NSIs and government departments