Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham - - PDF document

privacy preserving record linkage linkage
SMART_READER_LITE
LIVE PREVIEW

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham - - PDF document

11/25/2010 Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy Lab Department of Biomedical Informatics Department of Biomedical Informatics Vanderbilt University Wednesday, 24 November, 2010 1 Record


slide-1
SLIDE 1

11/25/2010 1

Privacy‐Preserving Record Linkage Linkage

Elizabeth Ashley Durham

Health Information Privacy Lab Department of Biomedical Informatics

1

Department of Biomedical Informatics Vanderbilt University

Wednesday, 24 November, 2010

Record linkage

Set of records from Vanderbilt Set of records from Emory

First Name Last Name jon smyth taylor swift First Name Last Name john smith lucille ball First Name Last Name john smith lucille ball First Name Last Name jon smyth taylor swift william clinton jon bon jovi

2

bill clinton hillary clinton bill clinton hillary clinton william clinton jon bon jovi

slide-2
SLIDE 2

11/25/2010 2

Privacy‐preserving record linkage (PPRL)

Set of records from Vanderbilt Set of records from Emory

P O L I First Name Last Name jon smyth taylor swift First Name Last Name john smith lucille ball First Name Last Name john smith lucille ball First Name Last Name jon smyth taylor swift I C Y

3

william clinton jon bon jovi bill clinton hillary clinton bill clinton hillary clinton william clinton jon bon jovi

PPRL applications in healthcare

sharing patient data for research

4

slide-3
SLIDE 3

11/25/2010 3

The NIH requires researchers share de‐identified patient data

  • U.S. National Institutes of Health (NIH) data sharing policy
  • “Data should be made as widely & freely available as possible”
  • Researchers who receive  $500,000 must develop a data

sharing plan or describe why data sharing is not possible

  • Derived data must be shared in a manner that is devoid of

“identifiable information”

  • NIH supported genome‐wide association studies policy

5

NIH supported genome wide association studies policy

  • Researchers funded for genome‐wide association studies must

share data

Duplicates: a flaw in the current model for sharing de‐identified data

NIH

Vanderbilt Emory

E1:flu,fatal E2:flu,surv V1:flu,fatal V2:flu,fatal flu fatal flu fatal flu fatal flu surv flu fatal flu fatal flu fatal flu surv

6

ID First Name Last Name

Diag‐ nosis

Out‐ come V1 john smith flu fatal V2 lucille ball flu fatal ID First Name Last Name

Diag‐ nosis

Out‐ come E1 jon smyth flu fatal E2 taylor swift flu surv john

slide-4
SLIDE 4

11/25/2010 4

NIH

Fragmented data: a flaw in the current model for sharing de‐identified data

Vanderbilt Emory

E1:??,fatal E2:flu,surv V1:flu,?? V2:flu,fatal flu ?? ?? fatal flu fatal flu surv flu ?? ?? fatal flu fatal flu surv

7

ID First Name Last Name

Diag‐ nosis

Out‐ come V1 john smith flu ?? V2 lucille ball flu fatal ID First Name Last Name

Diag‐ nosis

Out‐ come E1 jon smyth ?? fatal E2 taylor swift flu surv john

PPRL can improve the model for sharing de‐identified data and enable more effective medical research

V1‐E1 V1:H(john),H(smith) V2:H(lucille),H(ball) E1:H(jon),H(smyth) E2:H(taylor),H(swift) E1:??,fatal E2:flu,surv V1:flu,?? V2:flu,fatal flu fatal flu fatal flu surv

NIH

Vanderbilt Emory

8

ID First Name Last Name

Diag‐ nosis

Out‐ come V1 john smith flu ?? V2 lucille ball flu fatal ID First Name Last Name

Diag‐ nosis

Out‐ come E1 jon smyth ?? fatal E2 taylor swift flu surv where H denotes a hash function

slide-5
SLIDE 5

11/25/2010 5

PPRL applications in healthcare

improving patient care

john …. john ….

john

…. john …. …. …. 9

Other PPRL applications

  • Business
  • Counter‐terrorism efforts

10

slide-6
SLIDE 6

11/25/2010 6

Roadmap

  • Definition
  • Motivation
  • Motivation
  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion

11

Roadmap

  • Definition
  • Motivation
  • Motivation
  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion

12

slide-7
SLIDE 7

11/25/2010 7

Steps in record linkage

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

13

A few assumptions… 1) common schema 2) common method of data standardization 3) records from an institution have been deduplicated (i.e., record linkage has been applied within each institution such that an individual is represented by only a single record within an institution)

Steps in record linkage

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

14

slide-8
SLIDE 8

11/25/2010 8

Blocking: sample dataset

Set of records from Vanderbilt Set of records from Emory

First Name Last Name jon smyth taylor swift First Name Last Name john smith lucille ball

Set of records from Vanderbilt Set of records from Emory

15

william clinton jon bon jovi bill clinton hillary clinton no blocking blocking (first letter of last name)

Blocking

= match = non‐match john smith lucille ball bill clinton

( )

john smith lucille ball bill clinton hillary clinton |Vanderbilt||Emory| = 16 record pair comparisons 5 record pair comparisons

16

hillary clinton

slide-9
SLIDE 9

11/25/2010 9

Blocking: another perspective

S

First Last First Last

B

First Last First Last

2 1

Name Name john smith Name Name jon smyth taylor swift First Name Last Name lucille ball First Name Last Name jon bon jovi

C

2

17

C

First Name Last Name bill clinton hillary clinton First Name Last Name wiliam clinton

Steps in record linkage

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

18

slide-10
SLIDE 10

11/25/2010 10

The field comparison step of record linkage

First Name Last Name Fields: john smith jon smyth Record V1: Record E1: Fields:

Si il it F ti

19

0.75 0.8

Field comparison vector:

Similarity Function

Steps in record linkage

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

20

slide-11
SLIDE 11

11/25/2010 11

The record pair comparison step of record linkage

First Name Last Name Fields: john smith jon smyth Record V1: Record E1: Similarity Function

21

0.79

0.75 0.8 Field comparison vector:

Record pair similarity “score”:

Steps in record linkage

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

22

slide-12
SLIDE 12

11/25/2010 12

The record pair classification step of record linkage

Record pair p similarity “score” Record pair classification

Match

john smith jon smyth taylor swift

+7 +3 Non‐match

john smith Vanderbilt records Emory records

23

Non‐match Non‐match

lucille ball

+0 +0

lucille ball jon smyth taylor swift

Roadmap

  • Definition
  • Motivation
  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion

24

slide-13
SLIDE 13

11/25/2010 13

How do we do all of this in a privacy‐ preserving manner?

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

25

Roadmap

  • Definition
  • Motivation
  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results – Discussion – Open questions in record linkage – Conclusion

26

slide-14
SLIDE 14

11/25/2010 14

Background

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

27

binary approximate Fellegi‐Sunter Winkler FS

Background

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

28

binary approximate Fellegi‐Sunter Winkler FS

slide-15
SLIDE 15

11/25/2010 15

Binary Field Comparison

First Name Last Name City

Fields:

john smith nashville jon smyth nashville

Record V1: Record E1:

xy9l br3f xt0uv nw2 vwer xt0uv S H A

equal?

1

Field Comparison Vector:

29

where SHA refers to the Secure Hash Algorithm

Background

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

30

binary approximate Fellegi‐Sunter Winkler FS

slide-16
SLIDE 16

11/25/2010 16

Approximate Field Comparison

record V1 record E1 john jon john _j jo

  • h

hn n_ _j jo

  • n

n_

1 1 h1 h2 1 1 1 1 1 1 1 1 1 1 1

α: β:

77 . 13 5 * 2 | | | | | | 2 ) , (                    t coefficien Dice

31

Schnell 2009

Background

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

32

binary approximate Fellegi‐Sunter Winkler FS

slide-17
SLIDE 17

11/25/2010 17

Fellegi‐Sunter (FS)

Conditional probability vectors:

First Name Last Name

Fields:

Weight vectors:

0.8 0.9 0.05 0.02

Match: Non‐match:

Fi N L N

Fields:

Fields:

log 0.9 0.02

33

1.2 1.95 First Name Last Name ‐0.68 ‐1

Agreement:

Disagreement:

Fields:

Fellegi 1969 log 1 ‐ 0.9 1 ‐ 0.02

Fellegi‐Sunter (FS)

1.2 1.95 First Name Last Name

Agreement weights: Fields:

‐0.68 ‐1

Disagreement weights: john smith jon smyth Record V1: Record E1: First Name Last Name Fields:

34

jon smyth Record E1: Field comparison vector: ‐0.68 ‐1 Weight vector:

Σ

‐1.68

Fellegi 1969

Record pair similarity score:

Σ

slide-18
SLIDE 18

11/25/2010 18

Background

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

35

binary approximate Fellegi‐Sunter Winkler FS

Winkler’s Modificaton to Fellegi‐Sunter

1.2 1.95 First Name Last Name 75th percentile 80th 0.73 1.36

Agreement weights: Fields:

‐0.68 ‐1 p percentile

Disagreement weights: john smith Record V1: First Name Last Name Fields:

36

0.73

Porter 1997

0.73 1.36 jon smyth 0.75 0.8 Record E1: Field comparison vector: Weight vector:

Σ

2.09 Record pair similarity score:

Σ

slide-19
SLIDE 19

11/25/2010 19

Roadmap

  • Definition
  • Motivation
  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results p – Discussion – Open questions in record linkage – Conclusion

37

Experimental design

  • Dataset: ~6 million records from the North Carolina Voter Registration dataset

x 100

JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH

  • Fields:

Data Corrupter jon smith JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH JOHN SMITH jon smyth

  • Computational resources: 2 GHz dual core PC with 4GB of memory

First name Middle name Last name Birth state Sex Race Street name Street type Street suffix City State

38

Pudjijono 2009

slide-20
SLIDE 20

11/25/2010 20

Roadmap

  • Definition
  • Motivation

Motivation

  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results Experimental results – Discussion – Open questions in record linkage – Conclusion

39

Experimental results: accuracy

results withheld pending publication

40

slide-21
SLIDE 21

11/25/2010 21

Experimental results: run time

results withheld pending publication

41

Roadmap

  • Definition
  • Motivation
  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results Experimental results – Discussion – Open questions in record linkage – Conclusion

42

slide-22
SLIDE 22

11/25/2010 22

Discussion

Discussion

i t fi ld

accuracy: runtime:

binary field comparison & FS approximate field comparison & Winkler‐FS

43

Limitations

  • Controlled environment

Roadmap

  • Definition
  • Motivation

Motivation

  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results Experimental results – Discussion – Open questions in record linkage – Conclusion

44

slide-23
SLIDE 23

11/25/2010 23

Open questions in record linkage

1.Enforcing one‐to‐one linkage

  • 2. Decentralized record linkage

45

Open questions in record linkage

1.Enforcing one‐to‐one linkage

  • 2. Decentralized record linkage

46

slide-24
SLIDE 24

11/25/2010 24

One‐to‐one linkage

Record pair classification Blocking Field comparison Record pair comparison

Matches Non‐matches

47

Reminder: the record pair classification step of record linkage

Record pair p similarity “score” Record pair classification

Match

john smith jon smyth taylor swift

+7 +3 Non‐match

john smith Vanderbilt records Emory records

48

Non‐match Non‐match

lucille ball

+0 +0

lucille ball jon smyth taylor swift

slide-25
SLIDE 25

11/25/2010 25

One‐to‐one linkage: sample dataset

Set of records from Vanderbilt Set of records from Emory

First Name Last Name City jon smyth nashville taylor swift nashville First Name Last Name City john smith nashville bill clinton washington dc

Set of records from Vanderbilt Set of records from Emory

49

william clinton washington dc hillary clinton washington dc

One‐to‐one linkage

bill clinton washington dc william clinton washington dc

Match +2 score classification set of records from Vanderbilt set of records from Emory

hillary clinton washington dc william clinton washington dc

Match +2

william clinton washington dc john smith nashville

Non‐match +0

bill clinton washington dc taylor swift nashville

Non‐match +0

hillary clinton washington dc taylor swift nashville

Non‐match +0

50

taylor swift nashville john smith nashville

Non‐match +1

bill clinton washington dc jon smyth nashville

Non‐match +0

hillary clinton washington dc jon smyth nashville

Non‐match +0

john smith nashville jon smyth nashville

+1 Non‐match

slide-26
SLIDE 26

11/25/2010 26

One‐to‐one linkage

First Name Last Name City jon smyth nashville First Name Last Name City john smith nashville

predicted

First Name Last Name City First Name Last Name City

actual

j y taylor swift nashville william clinton washington dc john smith nashville bill clinton washington dc hillary clinton washington dc

51

Name Name jon smyth nashville taylor swift nashville william clinton washington dc Name Name john smith nashville bill clinton washington dc hillary clinton washington dc

One‐to‐one linkage

First Last City First Last City First Name Last Name City jon smyth nashville taylor swift nashville First Name Last Name City john smith nashville bill clinton washington dc

+1 +1 +0 +0

52

william clinton washington dc hillary clinton washington dc

+2 +2 +0 +0 +0

slide-27
SLIDE 27

11/25/2010 27

Open questions in record linkage

1.Enforcing one‐to‐one linkage

  • 2. Decentralized record linkage

53

Centralized framework

encoded records

54

encoded records

slide-28
SLIDE 28

11/25/2010 28

De‐centralization of record linkage

centralized de‐centralized centralized de‐centralized

55

Roadmap

  • Definition
  • Motivation

Motivation

  • Record linkage
  • Privacy‐preserving record linkage

– Background – Experimental design – Experimental results Experimental results – Discussion – Open questions in record linkage – Conclusion

56

slide-29
SLIDE 29

11/25/2010 29

Conclusion

57

Privacy‐preserving record linkage can inform and improve medical research Privacy‐preserving record linkage can improve patient care

References

  • Christen P, Pudjijono A. Accurate Synthetic Generation
  • f Realistic Personal Information. Proceedings of the

13th Pacific‐Asia Conference on Advances in Knowledge Discovery and Data Mining. 2009.

  • Fellegi I, Sunter A. A theory for record linkage. J Amer

Stat Assoc. 1969; 64: 1183–210.

  • Porter E, Winkler W, Approximate string comparison

and its effect on an advanced record linkage system, Research Report RR97/02 U S Census Bureau 1997

58

Research Report RR97/02, U.S. Census Bureau, 1997.

  • Schnell R, Bachteler T, and Reiher J. Privacy‐preserving

record linkage using Bloom filters. BMC Medical Informatics and Decision Making (9). 2009.

slide-30
SLIDE 30

11/25/2010 30

Acknowledgements

U.S. National Library of Medicine grant 2‐T15LM07450‐06 U.S. National Institutes of Health R01 LM009989

59

Thank You

Contact: ea durham@vanderbilt edu ea.durham@vanderbilt.edu http://hiplab.mc.vanderbilt.edu/ Publications:

  • E Durham, M Kantarcioglu, Y Xue, and B Malin. Private

medical record linkage with approximate matching. g pp g Proceedings

  • f

the American Medical Informatics

  • Association. 2010 November.

60