Privacy Preserving Probabilistic Record Linkage Duncan Smith - PowerPoint PPT Presentation

Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n ° 262608 (DwB - Data without Boundaries). 1 1

Topics Covered Introduction • Probabilistic Record Linkage • String Anonymisation • Putting the probabilities back into Privacy – • Preserving Record Linkage Experiment • Discussion • 2 2

Introduction Probabilistic record linkage developed by Fellegi and • Sunter, 1969 Administrative sources are being used to improve the • quality of surveys or to replace traditional censuses Traditionally, all datasets in one location (NSI) and • matching variables (first name, last name, address) used to link data without the need for anonymisation Data on individuals may be in distinct databases and • may be owned by different custodians: Alice (A) and Bob (B) Privacy restrictions prevent the release of certain • variables or information is suppressed/coarsened that uniquely identifies an individual 3 3

Introduction CS Literature, techniques for anonymising identifying • variables Third party (Carole) only sees matching variables and • returns pairs of unique record IDs (assigned by Alice and Bob) Two possible scenarios (there are more…): • Trusted Carole – sees the true values of single • matching variable Non-trusted Carole – sees anonymised values of • single matching variable Privacy preserving record linkage (PPRL) allows exact • matching and can allow linkage based on similarity scores generated from anonymised values F&S probabilistic record linkage typically not used in • PPRL 4 4

Introduction Alice and Bob clean, harmonize and standardize • data and anonymise matching variables (using the same method and seed) In our new approach, we apply probabilistic record • linkage to anonymised values to obtain a probability of a correct match (PPPRL) Motivation: • Data can be held within an archive, users can • carry out PPPRL within a ‘black box’ for dynamic database integration Three party Alice, Bob, Carole scenario as set out • in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators In PPRL, no possibility of clerical review and links • classified into 2 classes: true matches and false matches 5 5

Probabilistic Record Linkage F&S probabilistic record linkage uses a Binomial EM • algorithm based on an agree/disagree indicator to estimate likelihood ratio    i  { i 1 .. p }) , Matching score based on the sum of the log of the •   m  likelihood ratio: where is the m ( ) / u ( ) ( ) u  probability of agree given it’s a match and ( ) probability of agree given its not a match String comparators, eg. Jaro-Winkler, are used to • adjust the matching score based on partial agreements, eg. typing errors, etc. 6 6

String Anonymisation String anonymisation can use hash functions on bigrams: • 'john' → {'jo', 'oh', 'hn'} → {21299418, 21496024, 20971735} 'jon' → {'jo', 'on'} → {21299418, 21889246} Minwise hashing (Broder 1997) generates a random • permutation of a set of elements and returns the hash for the first ordered element The probability of a hash collision on the first ordered • element is the Jaccard similarity score:  | A B |  J A , B  | A B | Estimate of Jaccard similarity score based on many hash • values where the number of collisions is distributed: (m number of hash functions) n ~ Bin ( m , J ) A , B n And estimated by ˆ •  J A , B m 7 7

String Anonymisation Proposed method: concatenated 1-bit minwise hashing • Estimation of the Jaccard similarity score is: • n ˆ   J 2 1 A , B m Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1 ={’jo’,’oh’,’hn’} and S 2 = {’jo’,’on’} H1 H2 H3 H4 H5 … Hm S1 451153726 1123790273 2501120381 2030682762 965995804 S2 797504823 1123790273 262296169 1744666338 965995804 … Sn H1 H2 H3 H4 H5 … Hm S1 0 1 1 0 0 S2 1 1 1 0 0 … Sn With 5 hash functions, estimate of the Jaccard similarity • score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4 8 8

String Anonymisation Simulation Study: File A 300 names, File B obtained by • perturbing File A to simulate typographical errors Tokenized bigrams with leading and trailing underscores • True Jaccard scores compared with estimated scores on all pairs • in A x B Bias in Bloom filter • approaches Smaller variance in • minwise hash compared to concatenated 1-bit hash but requires more storage Concatenated hash • approximately same MSE as Bloom filter Precision can be • controlled by choice of m – the number of hash functions 9 9

Privacy Preserving Probabilistic Record Linkage Extend Binomial EM Algorithm to K categories, k=1,…,K • where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1] Element in agreement vector for variable q of pair j •  q  j with similarity score in category k , , otherwise 0 1 , k Multinomial EM algorithm to estimate matching • ˆ parameters: , and ˆ ˆ p m u q , k q , k Blocking: In PPRL literature methods include: canopy • clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more... 10 10

Experiment 1000 records from a Census database with attached • English names (File A) File B generated by perturbing File A under a • probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name 4 Perturbed datasets perturbed at different levels of • perturbation A random sample of 700 records from File A and a • random sample of 400 records from perturbed files used for matching No blocking was carried out • 11 11

Experiment PPPRL: Binary EM: standard EM approach based on exact matching of • strings. No similarity score used LR weighted: outputs of Binary EM and downweight likelihood • ratios Log LR weighted: outputs of Binary EM and downweight log • likelihood ratios EM (8): multinomial EM approach with 9 bins having upper • bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams) EM (15): multinomial EM approach with 15 bins having upper • bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL: Jaro: multinomial EM approach with 8 bins using Jaro string • comparator Jaro-Winkler: As above but with Jaro-Winkler string comparator • 12 12

Experiment Correct links identified and used to construct precision- • recall plots Plots show for any given threshold the precision and • recall based on false positives , true positives, false negatives , true negatives, and can be used to compare approaches Good approaches will produce curves in the upper right • of the plot tp  Pr ecision  tp fp tp  Re call  tp fn 13 13

Experiment low perturbation high perturbation All approaches perform better with low level of perturbation • Binary EM without similarity scores performs the worst • Down weighting log likelihood ratios outperforms down weighting • of likelihood ratios Multinomial EM outperforms Binary EM with no clear difference • between 8 category and 15 category Jaccard score schemes Jaro schemes provide the best performance, although these are • 14 14 not privacy preserving

Discussion • PPPRL does not allow clerical review and one threshold is determined based on posterior probability of a correct link • PPPRL can be tailored to different types of variables via the choice/design of the tokenization scheme • So far dealt with 1 to 1 matching • Multinomial EM offers improved classification over the unweighted and weighted binary EM schemes • Under trusted Carole, the Jaro and Jaro-Winkler schemes outperformed the padded bigram tokenization scheme under PPPRL 15 15

Thank you for your attention 16 16

Privacy Preserving Probabilistic Record Linkage Duncan Smith - PowerPoint PPT Presentation

Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

A Tutorial on Techniques for Scalable Privacy-preserving Record Linkage Peter Christen 1 ,

Attack methods on privacy-preserving record linkage Peter Christen 1 , Rainer Schnell 2 , Dinusha

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

What is data (or record) linkage? Recent interest in data linkage The process of linking and

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

Political Elements Developments regarding Research and Innovation in the ESS EU 2020

TCI Express Limited Investors Presentation November 2016 Strictly Private and confidential

K-State Libraries Vision: Open access database of the literature about agricultural

Defeated Budget Process 2018-19 Budgets April 2018 1 New Jersey DEPARTMENT OF EDUCATION

2018/19 BUDGET INTRODUCTION Sources Historical Data Assumptions Projections

A New Port A New Economy Agenda The Port Today; The Future Port; Port of Cork and Energy.

HUP CALL (COLO. RIVER MAINSTEM OPERATIONS CALL) NOVEMBER 2, 2018 Whats in a Name?

Tabor Mountain Recreation Society Who We Are What We Do Regional Park The Ask Photo by Ken Hodges