Privacy Preserving Probabilistic Record Linkage Duncan Smith - - PowerPoint PPT Presentation

privacy preserving probabilistic record linkage
SMART_READER_LITE
LIVE PREVIEW

Privacy Preserving Probabilistic Record Linkage Duncan Smith - - PowerPoint PPT Presentation

Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester The research leading to these


slide-1
SLIDE 1

1 1

Privacy Preserving Probabilistic Record Linkage

Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of Manchester

The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 262608 (DwB - Data without Boundaries).

slide-2
SLIDE 2

2 2

Topics Covered

  • Introduction
  • Probabilistic Record Linkage
  • String Anonymisation
  • Putting the probabilities back into Privacy–

Preserving Record Linkage

  • Experiment
  • Discussion
slide-3
SLIDE 3

3 3

  • Probabilistic record linkage developed by Fellegi and

Sunter, 1969

  • Administrative sources are being used to improve the

quality of surveys or to replace traditional censuses

  • Traditionally, all datasets in one location (NSI) and

matching variables (first name, last name, address) used to link data without the need for anonymisation

  • Data on individuals may be in distinct databases and

may be owned by different custodians: Alice (A) and Bob (B)

  • Privacy restrictions prevent the release of certain

variables or information is suppressed/coarsened that uniquely identifies an individual

Introduction

slide-4
SLIDE 4

4 4

  • CS Literature, techniques for anonymising identifying

variables

  • Third party (Carole) only sees matching variables and

returns pairs of unique record IDs (assigned by Alice and Bob)

  • Two possible scenarios (there are more…):
  • Trusted Carole –sees the true values of single

matching variable

  • Non-trusted Carole –sees anonymised values of

single matching variable

  • Privacy preserving record linkage (PPRL) allows exact

matching and can allow linkage based on similarity scores generated from anonymised values

  • F&S probabilistic record linkage typically not used in

PPRL

Introduction

slide-5
SLIDE 5

5 5

  • Alice and Bob clean, harmonize and standardize

data and anonymise matching variables (using the same method and seed)

  • In our new approach, we apply probabilistic record

linkage to anonymised values to obtain a probability

  • f a correct match (PPPRL)
  • Motivation:
  • Data can be held within an archive, users can

carry out PPPRL within a ‘black box’ for dynamic database integration

  • Three party Alice, Bob, Carole scenario as set out

in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators

  • In PPRL, no possibility of clerical review and links

classified into 2 classes: true matches and false matches

Introduction

slide-6
SLIDE 6

6 6

  • F&S probabilistic record linkage uses a Binomial EM

algorithm based on an agree/disagree indicator to estimate likelihood ratio

  • Matching score based on the sum of the log of the

likelihood ratio: where is the probability of agree given it’s a match and probability of agree given its not a match

  • String comparators, eg. Jaro-Winkler, are used to

adjust the matching score based on partial agreements, eg. typing errors, etc.

Probabilistic Record Linkage

) ( u / ) ( m   }) p .. 1 i {

, i 

   ) ( m  ) ( u 

slide-7
SLIDE 7

7 7

  • String anonymisation can use hash functions on bigrams:
  • Minwise hashing (Broder 1997) generates a random

permutation of a set of elements and returns the hash for the first ordered element

  • The probability of a hash collision on the first ordered

element is the Jaccard similarity score:

  • Estimate of Jaccard similarity score based on many hash

values where the number of collisions is distributed: (m number of hash functions)

  • And estimated by

String Anonymisation

'john' → {'jo', 'oh', 'hn'} → {21299418, 21496024, 20971735} 'jon' → {'jo', 'on'} → {21299418, 21889246}

| B A | | B A | J

B , A

   ) J , m ( Bin ~ n

B , A

m n J ˆ

B , A

slide-8
SLIDE 8

8 8

  • Proposed method: concatenated 1-bit minwise hashing
  • Estimation of the Jaccard similarity score is:

String Anonymisation

H1 H2 H3 H4 H5 … Hm S1

451153726 1123790273 2501120381 2030682762 965995804

S2

797504823 1123790273 262296169 1744666338 965995804

… Sn

Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1={’jo’,’oh’,’hn’} and S2= {’jo’,’on’}

H1 H2 H3 H4 H5 … Hm S1

1 1

S2

1 1 1

… Sn

1 m n 2 J ˆ

B , A

 

  • With 5 hash functions, estimate of the Jaccard similarity

score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4

slide-9
SLIDE 9

9 9

String Anonymisation

  • Bias in Bloom filter

approaches

  • Smaller variance in

minwise hash compared to concatenated 1-bit hash but requires more storage

  • Concatenated hash

approximately same MSE as Bloom filter

  • Precision can be

controlled by choice of m – the number of hash functions

  • Simulation Study: File A 300 names, File B obtained by

perturbing File A to simulate typographical errors

  • Tokenized bigrams with leading and trailing underscores
  • True Jaccard scores compared with estimated scores on all pairs

in A x B

slide-10
SLIDE 10

10 10

  • Extend Binomial EM Algorithm to K categories, k=1,…,K

where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1]

  • Element in agreement vector for variable q of pair j

with similarity score in category k, , otherwise 0

  • Multinomial EM algorithm to estimate matching

parameters: , and

  • Blocking: In PPRL literature methods include: canopy

clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more...

Privacy Preserving Probabilistic Record Linkage

k , q

m ˆ

k , q

u ˆ p ˆ

1

j k , q 

slide-11
SLIDE 11

11 11

  • 1000 records from a Census database with attached

English names (File A)

  • File B generated by perturbing File A under a

probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name

  • 4 Perturbed datasets perturbed at different levels of

perturbation

  • A random sample of 700 records from File A and a

random sample of 400 records from perturbed files used for matching

  • No blocking was carried out

Experiment

slide-12
SLIDE 12

12 12

Experiment

PPPRL:

  • Binary EM: standard EM approach based on exact matching of
  • strings. No similarity score used
  • LR weighted: outputs of Binary EM and downweight likelihood

ratios

  • Log LR weighted: outputs of Binary EM and downweight log

likelihood ratios

  • EM (8): multinomial EM approach with 9 bins having upper

bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams)

  • EM (15): multinomial EM approach with 15 bins having upper

bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL:

  • Jaro: multinomial EM approach with 8 bins using Jaro string

comparator

  • Jaro-Winkler: As above but with Jaro-Winkler string comparator
slide-13
SLIDE 13

13 13

  • Correct links identified and used to construct precision-

recall plots

  • Plots show for any given threshold the precision and

recall based on false positives , true positives, false negatives , true negatives, and can be used to compare approaches

  • Good approaches will produce curves in the upper right
  • f the plot

Experiment

fn tp tp call Re  

fp tp tp ecision Pr  

slide-14
SLIDE 14

14 14

Experiment

low perturbation high perturbation

  • All approaches perform better with low level of perturbation
  • Binary EM without similarity scores performs the worst
  • Down weighting log likelihood ratios outperforms down weighting
  • f likelihood ratios
  • Multinomial EM outperforms Binary EM with no clear difference

between 8 category and 15 category Jaccard score schemes

  • Jaro schemes provide the best performance, although these are

not privacy preserving

slide-15
SLIDE 15

15 15

Discussion

  • PPPRL does not allow clerical review and one

threshold is determined based on posterior probability

  • f a correct link
  • PPPRL can be tailored to different types of variables

via the choice/design of the tokenization scheme

  • So far dealt with 1 to 1 matching
  • Multinomial EM offers improved classification over the

unweighted and weighted binary EM schemes

  • Under trusted Carole, the Jaro and Jaro-Winkler

schemes outperformed the padded bigram tokenization scheme under PPPRL

slide-16
SLIDE 16

16 16

Thank you for your attention