privacy preserving probabilistic record linkage
play

Privacy Preserving Probabilistic Record Linkage Duncan Smith - PowerPoint PPT Presentation

Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these


  1. Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n ° 262608 (DwB - Data without Boundaries). 1 1

  2. Topics Covered Introduction • Probabilistic Record Linkage • String Anonymisation • Putting the probabilities back into Privacy – • Preserving Record Linkage Experiment • Discussion • 2 2

  3. Introduction Probabilistic record linkage developed by Fellegi and • Sunter, 1969 Administrative sources are being used to improve the • quality of surveys or to replace traditional censuses Traditionally, all datasets in one location (NSI) and • matching variables (first name, last name, address) used to link data without the need for anonymisation Data on individuals may be in distinct databases and • may be owned by different custodians: Alice (A) and Bob (B) Privacy restrictions prevent the release of certain • variables or information is suppressed/coarsened that uniquely identifies an individual 3 3

  4. Introduction CS Literature, techniques for anonymising identifying • variables Third party (Carole) only sees matching variables and • returns pairs of unique record IDs (assigned by Alice and Bob) Two possible scenarios (there are more…): • Trusted Carole – sees the true values of single • matching variable Non-trusted Carole – sees anonymised values of • single matching variable Privacy preserving record linkage (PPRL) allows exact • matching and can allow linkage based on similarity scores generated from anonymised values F&S probabilistic record linkage typically not used in • PPRL 4 4

  5. Introduction Alice and Bob clean, harmonize and standardize • data and anonymise matching variables (using the same method and seed) In our new approach, we apply probabilistic record • linkage to anonymised values to obtain a probability of a correct match (PPPRL) Motivation: • Data can be held within an archive, users can • carry out PPPRL within a ‘black box’ for dynamic database integration Three party Alice, Bob, Carole scenario as set out • in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators In PPRL, no possibility of clerical review and links • classified into 2 classes: true matches and false matches 5 5

  6. Probabilistic Record Linkage F&S probabilistic record linkage uses a Binomial EM • algorithm based on an agree/disagree indicator to estimate likelihood ratio    i  { i 1 .. p }) , Matching score based on the sum of the log of the •   m  likelihood ratio: where is the m ( ) / u ( ) ( ) u  probability of agree given it’s a match and ( ) probability of agree given its not a match String comparators, eg. Jaro-Winkler, are used to • adjust the matching score based on partial agreements, eg. typing errors, etc. 6 6

  7. String Anonymisation String anonymisation can use hash functions on bigrams: • 'john' → {'jo', 'oh', 'hn'} → {21299418, 21496024, 20971735} 'jon' → {'jo', 'on'} → {21299418, 21889246} Minwise hashing (Broder 1997) generates a random • permutation of a set of elements and returns the hash for the first ordered element The probability of a hash collision on the first ordered • element is the Jaccard similarity score:  | A B |  J A , B  | A B | Estimate of Jaccard similarity score based on many hash • values where the number of collisions is distributed: (m number of hash functions) n ~ Bin ( m , J ) A , B n And estimated by ˆ •  J A , B m 7 7

  8. String Anonymisation Proposed method: concatenated 1-bit minwise hashing • Estimation of the Jaccard similarity score is: • n ˆ   J 2 1 A , B m Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1 ={’jo’,’oh’,’hn’} and S 2 = {’jo’,’on’} H1 H2 H3 H4 H5 … Hm S1 451153726 1123790273 2501120381 2030682762 965995804 S2 797504823 1123790273 262296169 1744666338 965995804 … Sn H1 H2 H3 H4 H5 … Hm S1 0 1 1 0 0 S2 1 1 1 0 0 … Sn With 5 hash functions, estimate of the Jaccard similarity • score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4 8 8

  9. String Anonymisation Simulation Study: File A 300 names, File B obtained by • perturbing File A to simulate typographical errors Tokenized bigrams with leading and trailing underscores • True Jaccard scores compared with estimated scores on all pairs • in A x B Bias in Bloom filter • approaches Smaller variance in • minwise hash compared to concatenated 1-bit hash but requires more storage Concatenated hash • approximately same MSE as Bloom filter Precision can be • controlled by choice of m – the number of hash functions 9 9

  10. Privacy Preserving Probabilistic Record Linkage Extend Binomial EM Algorithm to K categories, k=1,…,K • where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1] Element in agreement vector for variable q of pair j •  q  j with similarity score in category k , , otherwise 0 1 , k Multinomial EM algorithm to estimate matching • ˆ parameters: , and ˆ ˆ p m u q , k q , k Blocking: In PPRL literature methods include: canopy • clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more... 10 10

  11. Experiment 1000 records from a Census database with attached • English names (File A) File B generated by perturbing File A under a • probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name 4 Perturbed datasets perturbed at different levels of • perturbation A random sample of 700 records from File A and a • random sample of 400 records from perturbed files used for matching No blocking was carried out • 11 11

  12. Experiment PPPRL: Binary EM: standard EM approach based on exact matching of • strings. No similarity score used LR weighted: outputs of Binary EM and downweight likelihood • ratios Log LR weighted: outputs of Binary EM and downweight log • likelihood ratios EM (8): multinomial EM approach with 9 bins having upper • bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams) EM (15): multinomial EM approach with 15 bins having upper • bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL: Jaro: multinomial EM approach with 8 bins using Jaro string • comparator Jaro-Winkler: As above but with Jaro-Winkler string comparator • 12 12

  13. Experiment Correct links identified and used to construct precision- • recall plots Plots show for any given threshold the precision and • recall based on false positives , true positives, false negatives , true negatives, and can be used to compare approaches Good approaches will produce curves in the upper right • of the plot tp  Pr ecision  tp fp tp  Re call  tp fn 13 13

  14. Experiment low perturbation high perturbation All approaches perform better with low level of perturbation • Binary EM without similarity scores performs the worst • Down weighting log likelihood ratios outperforms down weighting • of likelihood ratios Multinomial EM outperforms Binary EM with no clear difference • between 8 category and 15 category Jaccard score schemes Jaro schemes provide the best performance, although these are • 14 14 not privacy preserving

  15. Discussion • PPPRL does not allow clerical review and one threshold is determined based on posterior probability of a correct link • PPPRL can be tailored to different types of variables via the choice/design of the tokenization scheme • So far dealt with 1 to 1 matching • Multinomial EM offers improved classification over the unweighted and weighted binary EM schemes • Under trusted Carole, the Jaro and Jaro-Winkler schemes outperformed the padded bigram tokenization scheme under PPPRL 15 15

  16. Thank you for your attention 16 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend