Development of pseudonymised matching methods for linking - - PowerPoint PPT Presentation

development of pseudonymised matching methods for linking
SMART_READER_LITE
LIVE PREVIEW

Development of pseudonymised matching methods for linking - - PowerPoint PPT Presentation

Development of pseudonymised matching methods for linking administrative datasets Pete Jones Office for National Statistics The Beyond 2011 Programme Office for National Statistics (ONS) conducted a review (Beyond 2011 Programme) for the


slide-1
SLIDE 1

Development of pseudonymised matching methods for linking administrative datasets

Pete Jones Office for National Statistics

slide-2
SLIDE 2
  • Office for National Statistics (ONS) conducted a review (Beyond 2011

Programme) for the future approach to the census and population statistics in England and Wales

  • National Statistician made a recommendation to Government in March

2014 that there should be a predominantly online census in 2021

  • This will be supplemented with increased use of administrative data

and surveys to enhance census outputs and annual statistics

  • Part of our research leading up to the recommendation was to explore

an administrative data option for producing population statistics

  • Involved the development of algorithms to link pseudonymised

administrative datasets and surveys

  • Work continues within the 2021 Census Transformation Programme

(Beyond 2021 Research & Development Team)

The Beyond 2011 Programme

slide-3
SLIDE 3
  • There are additional challenges associated with the use of admin data
  • Data quality – particularly lags in data being up to date
  • Efficiency – need to match datasets with 60 million + records
  • Public acceptability - ONS unique in holding multiple admin sources in
  • ne place
  • Made the decision that names, dates of birth and addresses will be

anonymised with a hashing algorithm (SHA-256)

  • Converts original identifiers into meaningless hashed values

(e.g. John hashes to XY143257461)

  • Consistently maps same entities to the same hashed value

Pseudonymised Linkage Model

slide-4
SLIDE 4
  • Hashing data makes many of the traditional methods for

resolving inconsistencies redundant

  • Cannot run direct string comparison algorithms
  • Cannot use clerical resolution
  • Developed alternative ways of tackling data capture inconsistencies

(1) The development of match-keys that can be derived in pre- processing and hashed before linking two datasets

Methodological Developments

slide-5
SLIDE 5

Pseudonymised Match-Keys

Key Type Unique records on EPR (%) 1 Forename, Surname, DoB, Sex, Postcode 100.00% 2 Forename initial , Surname initial, DoB, Sex, Postcode District 99.55% 3 Forename bi-gram, Surname bi-gram, DoB, Sex, Postcode Area 99.44% 4 Forename initial, DoB, Sex, Postcode 99.84% 5 Surname initial, DoB, Sex, Postcode 99.44% 6 Forename, Surname, Age, Sex, Postcode Area 99.46% 7 Forename, Surname, Sex, Postcode 99.19% 8 Forename, Surname, DoB, Sex 98.87% 9 Forename, Surname, DoB, Postcode 99.52% 10 Surname, Forename, DoB, Sex, Postcode (matched on key 1) 100.00% 11 Middle name, Surname, DoB, Sex, Postcode (matched on key 1) 99.90%

slide-6
SLIDE 6
  • Constructing during pre-processing to support score-based methods

that involve string comparison

  • Non-disclosive to match single variables in isolation prior to encryption

(2) Similarity Tables

Data Storage Area Forename Surname DoB PostCode John Davis 02/04/1993 B1 2TG John Thomas 23/07/1986 M2 1JH John Smith 16/06/2003 BH12 1LT Jon Reed 19/09/1993 DT8 4PB Jon Ellis 16/06/2008 KT1 1LL John Jonny Johnson 06/01/2002 N7 4ER Jon Jonny Daniels 21/10/1949 LN22 1AR Jonny Jonny Barker 14/10/1974 PO11 7TG Jonathan Jonny King 26/02/1998 SO1 4KW …… Jonathan Khan 03/06/1999 E1 2BB Jonathan Wright 11/10/2004 CR21 2JJ Jonathan Walker 10/07/2002 W5 6AD … … … … Original Dataset (Source 2) List of unique Reception Server (Data Import Area) Extract list of unique forenames

slide-7
SLIDE 7
  • Follow the same process for the 2nd dataset import

Similarity Tables

Data Storage Area Forename Surname DoB PostCode John Davis 02/04/1993 B1 2TG John Thomas 23/07/1986 M2 1JH John Smith 16/06/2003 BH12 1LT Jon Reed 19/09/1993 DT8 4PB Jon Ellis 16/06/2008 KT1 1LL John Jonny Johnson 06/01/2002 N7 4ER Jon Jonny Daniels 21/10/1949 LN22 1AR Jonny Jonnie Barker 14/10/1974 PO11 7TG Jonathan Jonny King 26/02/1998 SO1 4KW Jonnie Jonathan Khan 03/06/1999 E1 2BB …… Jonathan Wright 11/10/2004 CR21 2JJ Jonathan Walker 10/07/2002 W5 6AD … … … … Reception Server (Data Import Area) Source 2 Dataset Identify any additional names not on list List of unique forenames

slide-8
SLIDE 8
  • Run string comparison algorithm between all names on the list

Similarity Tables

Forename Matches Score John John 1 John John John Jonny 0.88 Jon Jon John Jon 0.91 Jonny Jonny John Jonathan 0.82 Jonathan Jonathan Jonny Jonny 1 Jonnie Jonnie Jonny John 0.88 …… …… Jonny Jon 0.89 Jonny Jonathan 0.79 Jon Jon 1 Jon John 0.91 Jon Jonny 0.89 Jon Jonathan 0.81 Jonathan Jonathan 1 Jonathan John 0.82 Jonathan Jon 0.81 Jonathan Jonny 0.79 List of unique List of unique forenames String comparison algorithm

slide-9
SLIDE 9

Similarity Tables (example)

PR_Forename PR_Surname PR_DoB PR_Sex PR_Pcode SC_Forename SC_Surname SC_DoB SC_Sex SC_Pcode Jon Smyth 13/02/1965 M PO15 5RR John Smith 09/02/1965 M PO15 5RR PR_Forename SC_Forename Similarity Score PR_Surname SC_Surname Similarity Score PR_DoB SC_DoB Similarity Score John John 1 Smith Smyth 0.93 13/02/1965 08/02/1965 0.67 John Jonny 0.88 Smith Smithers 0.87 13/02/1965 09/02/1965 0.67 John Jon 0.91 Smith Smithson 0.85 13/02/1965 10/02/1965 0.67 John Jonathan 0.82 Smith Smith 1 13/02/1965 11/02/1965 0.67 Jonny Jonny 1 Smyth Smith 0.93 13/02/1965 12/02/1965 0.67 Jonny John 0.88 Smyth Smithers 0.9 13/02/1965 13/02/1965 1 Jonny Jon 0.89 Smyth Smithson 0.83 13/02/1965 14/02/1965 0.67 Jonny Jonathan 0.79 Smyth Smyth 1 13/02/1965 15/02/1965 0.67 Jon Jon 1 Smithers Smith 0.87 13/02/1965 16/02/1965 0.67 Jon John 0.91 Smithers Smyth 0.9 13/02/1965 17/02/1965 0.67 Jon Jonny 0.89 Smithers Smithson 0.92 13/02/1965 18/02/1965 0.67 Jon Jonathan 0.81 Smithers Smithers 1 13/02/1965 13/01/1965 0.67 Jonathan Jonathan 1 Smithson Smith 0.85 13/02/1965 13/03/1995 0.67 Jonathan John 0.82 Smithson Smyth 0.83 13/02/1965 13/04/1995 0.67 Jonathan Jon 0.81 Smithson Smithers 0.92 13/02/1965 13/05/1995 0.67 Jonathan Jonny 0.79 Smithson Smithson 1 13/02/1965 13/06/1995 0.67 … … … … … … … … …

slide-10
SLIDE 10

ames

  • The similarity tables identify all the candidate pairs that achieve a

specified similarity threshold on forename, surname and DoB

  • The researcher will only ever see the hashed fields
  • Hashed variables are now redundant (can delete them)
  • The only usable information is the scores themselves
  • But what do you do with the scores?

Candidate Matches

Source 1 Forename Source 2 Forename Forename Score Source 1 Surname Source 2 Surname Surname Score Source 1 DoB Source 2 DoB Source 1 DoB Overall Score EFIJ2465 ZASG1635 0.78 CTYG0289 XHDK5456 0.93 GXCX6714 AFIQ8834 0.33 0.68 EFIJ2465 VRXM2613 0.91 CTYG0289 XHDK5456 0.93 GXCX6714 LRQP3671 0.67 0.84 EFIJ2465 HDNR3167 0.69 CTYG0289 CTYG0289 1 GXCX6714 EYGI9391 0.33 0.67

slide-11
SLIDE 11
  • Impractical to rely on clerical review when linking datasets at national

level

  • Clerical review is redundant when records are hash encoded
  • ONS have arrangements to re-identify hashed values for small samples
  • f match candidates for clerical review (approx. 1000 records)
  • Clerically matched samples can then be used as the basis of

supervised matching models

  • Logistic regression has produced good results
  • Dependent variable is the clerical decision (binary Y/N)
  • Covariates are the similarity metrics (name / dob) , name

commonality, geographic distances

  • Logistic regression provides a single threshold for designating matches

/ non-matches (p = 0.5)

Training models with clerical review

slide-12
SLIDE 12

t tables

Match Classification with Logistic Regression

Classification Tablea No Yes Percentage Correct Match No 78 3 96.3 Yes 2 283 99.3 Overall Percentage 98.6

  • a. The cut value is .500

Predicted Match Observed

slide-13
SLIDE 13
  • Undertook record level comparison with links made by the 2011

Census QA Team (used exact / probabilistic / clerical matching)

  • Linked 1% of sample of records from the NHS Patient Register to the

2011 Census

Testing the Algorithms

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Comparison of PR to Census/CCS match rates: Census QA and B2011

Census QA match rate B2011 match rate B2011 Precision B2011 Recall

slide-14
SLIDE 14

Summary Tables

Local Authority Census QA match rate Pseudonymised match rate Pseudonymised false positive rate Pseudonymised false negative rate Birmingham 82.0% 81.0% 0.4% 1.7% Westminster 65.1% 64.2% 0.4% 1.9% Lambeth 64.0% 63.5% 0.8% 1.6% Newham 68.3% 67.1% 0.5% 2.2% Southwark 66.3% 65.0% 0.4% 2.3% Powys 94.3% 93.4% 0.2% 1.2% Aylesbury Vale 89.9% 89.6% 0.3% 0.6% Mid Devon 88.6% 88.6% 0.2% 0.2% Total 72.7% 71.7% 0.5% 1.9% Local Authority Type Census QA match rate Pseudonymised match rate Pseudonymised false positive rate Pseudonymised false negative rate City Local Authorities 71.3% 70.3% 0.5% 2.0% Rural Local Authorities 91.2% 90.7% 0.3% 0.8% Total 72.7% 71.7% 0.5% 1.9%

slide-15
SLIDE 15
  • Causes of false negatives – developing new blocking strategies to

identify a higher number of match candidates

  • Testing the use of unsupervised probabilistic matching
  • Fellegi-Sunter framework
  • EM algorithm for match probabilities
  • Duplicate link method for threshold setting (Blakely & Salmond 2002)
  • Exploring its potential application in 2021 Census to Coverage Survey

Matching

  • Method to be used in the new Admin Data Research Centre (led by

Southampton University)

  • Looking to collaborate with other NSIs and government departments

Further research