Development of pseudonymised matching methods for linking - - PowerPoint PPT Presentation

▶

Apr 10, 2023 161 likes •323 views

Development of pseudonymised matching methods for linking administrative datasets Pete Jones Office for National Statistics The Beyond 2011 Programme Office for National Statistics (ONS) conducted a review (Beyond 2011 Programme) for the

SLIDE 1

Development of pseudonymised matching methods for linking administrative datasets

Pete Jones Office for National Statistics

SLIDE 2

Office for National Statistics (ONS) conducted a review (Beyond 2011

Programme) for the future approach to the census and population statistics in England and Wales

National Statistician made a recommendation to Government in March

2014 that there should be a predominantly online census in 2021

This will be supplemented with increased use of administrative data

and surveys to enhance census outputs and annual statistics

Part of our research leading up to the recommendation was to explore

an administrative data option for producing population statistics

Involved the development of algorithms to link pseudonymised

administrative datasets and surveys

Work continues within the 2021 Census Transformation Programme

(Beyond 2021 Research & Development Team)

The Beyond 2011 Programme

SLIDE 3

There are additional challenges associated with the use of admin data
Data quality – particularly lags in data being up to date
Efficiency – need to match datasets with 60 million + records
Public acceptability - ONS unique in holding multiple admin sources in
ne place
Made the decision that names, dates of birth and addresses will be

anonymised with a hashing algorithm (SHA-256)

Converts original identifiers into meaningless hashed values

(e.g. John hashes to XY143257461)

Consistently maps same entities to the same hashed value

Pseudonymised Linkage Model

SLIDE 4

Hashing data makes many of the traditional methods for

resolving inconsistencies redundant

Cannot run direct string comparison algorithms
Cannot use clerical resolution
Developed alternative ways of tackling data capture inconsistencies

(1) The development of match-keys that can be derived in pre- processing and hashed before linking two datasets

Methodological Developments

SLIDE 5

Pseudonymised Match-Keys

Key Type Unique records on EPR (%) 1 Forename, Surname, DoB, Sex, Postcode 100.00% 2 Forename initial , Surname initial, DoB, Sex, Postcode District 99.55% 3 Forename bi-gram, Surname bi-gram, DoB, Sex, Postcode Area 99.44% 4 Forename initial, DoB, Sex, Postcode 99.84% 5 Surname initial, DoB, Sex, Postcode 99.44% 6 Forename, Surname, Age, Sex, Postcode Area 99.46% 7 Forename, Surname, Sex, Postcode 99.19% 8 Forename, Surname, DoB, Sex 98.87% 9 Forename, Surname, DoB, Postcode 99.52% 10 Surname, Forename, DoB, Sex, Postcode (matched on key 1) 100.00% 11 Middle name, Surname, DoB, Sex, Postcode (matched on key 1) 99.90%

SLIDE 6

Constructing during pre-processing to support score-based methods

that involve string comparison

Non-disclosive to match single variables in isolation prior to encryption

(2) Similarity Tables

Data Storage Area Forename Surname DoB PostCode John Davis 02/04/1993 B1 2TG John Thomas 23/07/1986 M2 1JH John Smith 16/06/2003 BH12 1LT Jon Reed 19/09/1993 DT8 4PB Jon Ellis 16/06/2008 KT1 1LL John Jonny Johnson 06/01/2002 N7 4ER Jon Jonny Daniels 21/10/1949 LN22 1AR Jonny Jonny Barker 14/10/1974 PO11 7TG Jonathan Jonny King 26/02/1998 SO1 4KW …… Jonathan Khan 03/06/1999 E1 2BB Jonathan Wright 11/10/2004 CR21 2JJ Jonathan Walker 10/07/2002 W5 6AD … … … … Original Dataset (Source 2) List of unique Reception Server (Data Import Area) Extract list of unique forenames

SLIDE 7

Follow the same process for the 2nd dataset import

Similarity Tables

Data Storage Area Forename Surname DoB PostCode John Davis 02/04/1993 B1 2TG John Thomas 23/07/1986 M2 1JH John Smith 16/06/2003 BH12 1LT Jon Reed 19/09/1993 DT8 4PB Jon Ellis 16/06/2008 KT1 1LL John Jonny Johnson 06/01/2002 N7 4ER Jon Jonny Daniels 21/10/1949 LN22 1AR Jonny Jonnie Barker 14/10/1974 PO11 7TG Jonathan Jonny King 26/02/1998 SO1 4KW Jonnie Jonathan Khan 03/06/1999 E1 2BB …… Jonathan Wright 11/10/2004 CR21 2JJ Jonathan Walker 10/07/2002 W5 6AD … … … … Reception Server (Data Import Area) Source 2 Dataset Identify any additional names not on list List of unique forenames

SLIDE 8

Run string comparison algorithm between all names on the list

Similarity Tables

Forename Matches Score John John 1 John John John Jonny 0.88 Jon Jon John Jon 0.91 Jonny Jonny John Jonathan 0.82 Jonathan Jonathan Jonny Jonny 1 Jonnie Jonnie Jonny John 0.88 …… …… Jonny Jon 0.89 Jonny Jonathan 0.79 Jon Jon 1 Jon John 0.91 Jon Jonny 0.89 Jon Jonathan 0.81 Jonathan Jonathan 1 Jonathan John 0.82 Jonathan Jon 0.81 Jonathan Jonny 0.79 List of unique List of unique forenames String comparison algorithm

SLIDE 9

Similarity Tables (example)

PR_Forename PR_Surname PR_DoB PR_Sex PR_Pcode SC_Forename SC_Surname SC_DoB SC_Sex SC_Pcode Jon Smyth 13/02/1965 M PO15 5RR John Smith 09/02/1965 M PO15 5RR PR_Forename SC_Forename Similarity Score PR_Surname SC_Surname Similarity Score PR_DoB SC_DoB Similarity Score John John 1 Smith Smyth 0.93 13/02/1965 08/02/1965 0.67 John Jonny 0.88 Smith Smithers 0.87 13/02/1965 09/02/1965 0.67 John Jon 0.91 Smith Smithson 0.85 13/02/1965 10/02/1965 0.67 John Jonathan 0.82 Smith Smith 1 13/02/1965 11/02/1965 0.67 Jonny Jonny 1 Smyth Smith 0.93 13/02/1965 12/02/1965 0.67 Jonny John 0.88 Smyth Smithers 0.9 13/02/1965 13/02/1965 1 Jonny Jon 0.89 Smyth Smithson 0.83 13/02/1965 14/02/1965 0.67 Jonny Jonathan 0.79 Smyth Smyth 1 13/02/1965 15/02/1965 0.67 Jon Jon 1 Smithers Smith 0.87 13/02/1965 16/02/1965 0.67 Jon John 0.91 Smithers Smyth 0.9 13/02/1965 17/02/1965 0.67 Jon Jonny 0.89 Smithers Smithson 0.92 13/02/1965 18/02/1965 0.67 Jon Jonathan 0.81 Smithers Smithers 1 13/02/1965 13/01/1965 0.67 Jonathan Jonathan 1 Smithson Smith 0.85 13/02/1965 13/03/1995 0.67 Jonathan John 0.82 Smithson Smyth 0.83 13/02/1965 13/04/1995 0.67 Jonathan Jon 0.81 Smithson Smithers 0.92 13/02/1965 13/05/1995 0.67 Jonathan Jonny 0.79 Smithson Smithson 1 13/02/1965 13/06/1995 0.67 … … … … … … … … …

SLIDE 10

ames

The similarity tables identify all the candidate pairs that achieve a

specified similarity threshold on forename, surname and DoB

The researcher will only ever see the hashed fields
Hashed variables are now redundant (can delete them)
The only usable information is the scores themselves
But what do you do with the scores?

Candidate Matches

Source 1 Forename Source 2 Forename Forename Score Source 1 Surname Source 2 Surname Surname Score Source 1 DoB Source 2 DoB Source 1 DoB Overall Score EFIJ2465 ZASG1635 0.78 CTYG0289 XHDK5456 0.93 GXCX6714 AFIQ8834 0.33 0.68 EFIJ2465 VRXM2613 0.91 CTYG0289 XHDK5456 0.93 GXCX6714 LRQP3671 0.67 0.84 EFIJ2465 HDNR3167 0.69 CTYG0289 CTYG0289 1 GXCX6714 EYGI9391 0.33 0.67

SLIDE 11

Impractical to rely on clerical review when linking datasets at national

level

Clerical review is redundant when records are hash encoded
ONS have arrangements to re-identify hashed values for small samples
f match candidates for clerical review (approx. 1000 records)
Clerically matched samples can then be used as the basis of

supervised matching models

Logistic regression has produced good results
Dependent variable is the clerical decision (binary Y/N)
Covariates are the similarity metrics (name / dob) , name

commonality, geographic distances

Logistic regression provides a single threshold for designating matches

/ non-matches (p = 0.5)

Training models with clerical review

SLIDE 12

t tables

Match Classification with Logistic Regression

Classification Tablea No Yes Percentage Correct Match No 78 3 96.3 Yes 2 283 99.3 Overall Percentage 98.6

a. The cut value is .500

Predicted Match Observed

SLIDE 13

Undertook record level comparison with links made by the 2011

Census QA Team (used exact / probabilistic / clerical matching)

Linked 1% of sample of records from the NHS Patient Register to the

2011 Census

Testing the Algorithms

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Comparison of PR to Census/CCS match rates: Census QA and B2011

Census QA match rate B2011 match rate B2011 Precision B2011 Recall

SLIDE 14

Summary Tables

Local Authority Census QA match rate Pseudonymised match rate Pseudonymised false positive rate Pseudonymised false negative rate Birmingham 82.0% 81.0% 0.4% 1.7% Westminster 65.1% 64.2% 0.4% 1.9% Lambeth 64.0% 63.5% 0.8% 1.6% Newham 68.3% 67.1% 0.5% 2.2% Southwark 66.3% 65.0% 0.4% 2.3% Powys 94.3% 93.4% 0.2% 1.2% Aylesbury Vale 89.9% 89.6% 0.3% 0.6% Mid Devon 88.6% 88.6% 0.2% 0.2% Total 72.7% 71.7% 0.5% 1.9% Local Authority Type Census QA match rate Pseudonymised match rate Pseudonymised false positive rate Pseudonymised false negative rate City Local Authorities 71.3% 70.3% 0.5% 2.0% Rural Local Authorities 91.2% 90.7% 0.3% 0.8% Total 72.7% 71.7% 0.5% 1.9%

SLIDE 15

Causes of false negatives – developing new blocking strategies to

identify a higher number of match candidates

Testing the use of unsupervised probabilistic matching
Fellegi-Sunter framework
EM algorithm for match probabilities
Duplicate link method for threshold setting (Blakely & Salmond 2002)
Exploring its potential application in 2021 Census to Coverage Survey

Development of pseudonymised matching methods for linking administrative datasets

Pete Jones Office for National Statistics

Programme) for the future approach to the census and population statistics in England and Wales

2014 that there should be a predominantly online census in 2021

and surveys to enhance census outputs and annual statistics

an administrative data option for producing population statistics

administrative datasets and surveys

(Beyond 2021 Research & Development Team)

The Beyond 2011 Programme

anonymised with a hashing algorithm (SHA-256)

(e.g. John hashes to XY143257461)

Pseudonymised Linkage Model

resolving inconsistencies redundant

(1) The development of match-keys that can be derived in pre- processing and hashed before linking two datasets

Methodological Developments

Pseudonymised Match-Keys

that involve string comparison

(2) Similarity Tables

Similarity Tables

Similarity Tables

Similarity Tables (example)

ames

specified similarity threshold on forename, surname and DoB

Candidate Matches

level

supervised matching models

commonality, geographic distances

/ non-matches (p = 0.5)

Training models with clerical review

t tables

Match Classification with Logistic Regression

Classification Tablea No Yes Percentage Correct Match No 78 3 96.3 Yes 2 283 99.3 Overall Percentage 98.6

Predicted Match Observed

Census QA Team (used exact / probabilistic / clerical matching)

2011 Census

Testing the Algorithms

Summary Tables

identify a higher number of match candidates

Matching

Southampton University)

Further research