Linking with sensitive identifiers in a national statistical institute
Rainer Schnell
German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom
German
Using registers and administrative data in Official Statistics - - PowerPoint PPT Presentation
Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C Using registers and
Linking with sensitive identifiers in a national statistical institute
Rainer Schnell
German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom
German
Using registers and administrative data in Official Statistics
statistics.
Fundamental Rights 2018) summarizes the GDPR: It also recognises the importance of the compilation of data in registries for research purposes (. . . ). For this reason, the regulation allows the processing of data for these purposes, without the data subjects’ consent, provided the relevant safeguards are in place.
1/17
Attack models
according to a protocol, but they are curious.
structures.
2/17
The data environment
not be judged based on the data alone.
we have to consider (among other issues):
attack),
data.
"Whether data is anonymous or not (and therefore personal or not) is a function of the relationship between that data and its environment" (Elliot et al. 2018).
might cause more harm than applying a weak encryption for PPRL.
3/17
Bloom Filters for PPRL
SAHRA SARAH SA AH HR RA SA AR RA AH 1
A
1 1 1 1 1 1
Σ 7 |A|
1
∩
1 1 1 1
Σ 5 |A∩ B|
1
B
1 1 1 1 1
Σ 6 |B|
Schnell/Bachteler/Reiher (2009)
4/17
Attacking encryptions
In his textbook Martin (2017) writes: "Probably the most important lesson is (. . . ) that security of a cryptosystem is only ever relative to our understanding of attacks." and a few lines later: "(. . . ) we can never guarantee security against an unknown future."
5/17
Progress in attacking PPRL methods
q-grams of identifiers (Vidanage et al. 2019).
data (ONS 2013) was shown to have some vulnerabilities by Culnane/Rubinstein/Teague (2017) using subgraph matching.
standardized personal identifiers.
6/17
Developing attacks needs time
Bloom Filter PPRL Schnell 2009 CSSP-Attack Kuzu 2011 Bigramm-Frequency- Attack Niedermeyer 2014 Pattern-Mining-Attack Christen 2019 Matching anonymous data ONS 2013 Subgraph-Matching Culnane 2017 ?
7/17
PPRL using Multiple Match-Keys
personal identifiers.
individual.
8/17
MinHash Bloom filters
several minwise hashes (MinHash).
1 First, a random key is hashed and represented as a bit vector. 2 This bit vector is split into sub-parts of a certain length (i.e. 4 ∗ 8 bit splits). 3 For each subpart, a randomly filled lookup table is built. 4 This table contains the hash values. 5 For each element (e.g. a single bigram), this is repeated k independent times, where k is the
desired number of hash functions.
6 The minimum hash value of each of the k steps is retained and XORed with l random
numbers, where l is the desired output bit vector length.
7 The resulting least significant bit of each resulting XORed number is used.
9/17
MinHash Bloom filters
10/17
Encoding numerical values into Bloom filters
interest into a Bloom Filter.
estimated by the intersection of the Bloom Filters.
11/17
Method
parameter choice.
tolerated difference between two numerical value pairs and depends on the intended
dmax = maxv − minv 100 · p.
Li = v − (b − i) · dintv i < b v + (i − b) · dintv i > b v i = b
12/17
Example
Two intervals L1 and L2 are created for the values v1 and v2. The width of the intervals is b = 2, with a maximum tolerated difference of dmax = 4. This gives interval steps dintv of 1. v1 = 25, v2 = 26, dmax = 4, b = 2, dintv = dmax 2 · b = 1 This will generate the lists: L1 = [23,24,25,26,27] L2 = [24,25,26,27,28] These lists of numbers are mapped into Bloom filters.
13/17
Encoding Hierarchical Codes
into Bloom filters.1
1Schnell/Borgs: Encoding hierarchical classification codes for Privacy-preserving Record Linkage using
Bloom filters, DINA 2019
14/17
Constructing Hierarchy Preserving Bloom filters (HPBFs)
To be encoded: 2143 Code length j = 4 Stream length modifier c = 2 BF length l = 40 Positional index i ∈ {j, j − 1, j − 2, . . . , 1} 2143
H1 = HMAC(2143, key)
3
i = 1
S1 = PRNG:
Seed = H1 n = c ∗ i Range = {1 . . . l}
214
H2 = HMAC(214, key)
4
i = 2
S2 = PRNG:
Seed = H2 n = c ∗ i Range = {1 . . . l}
21
H3 = HMAC(21, key)
1
i = 3
S3 = PRNG:
Seed = H3 n = c ∗ i Range = {1 . . . l}
2
H4 = HMAC(2, key)
2
i = 4
S4 = PRNG:
Seed = H4 n = c ∗ i Range = {1 . . . l}
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
S4 = {1, 2, 5, 7, 10, 16, 19, 28}; S3 = {7, 10, 14, 17, 20, 21}; S2 = {17, 21, 26, 27}; S1 = {32, 39} Code += Code[j - i + 1] Code += Code[j - i + 1] Code += Code[j - i + 1] Code = Code[j - i + 1]
Seed Seed Seed Seed 15/17
Technical summary
problem.
POB) and a separate password has been impossible to attack (so far).
anonymisation".
willing to invest years or millions of Euros.
environment for quite a while.
16/17
Conclusion
"Privacy is a goal that we cannot achieve. That is no excuse not to strive for it. We do these same with other goals, such as justice." (Karsten Weber 2018)
17/17
References
Culnane, Chris/Benjamin I. P. Rubinstein/Vanessa Teague (2017): Vulnerabilities in the Use of Similarity Tables in Combination with Pseudonymisation to Preserve Data Privacy in the UK Office for National Statistics’ Privacy-Preserving Record Linkage. online at arXiv. url: https://arxiv.org/abs/1712.00871. Culnane, Chris/Benjamin I. P. Rubinstein/Vanessa Teague (2018): Options for encoding names for data linking at the Australian Bureau of Statistics. arXiv: 1802.07975 [cs.CR]. Elliot, Mark/Elaine Mackey/Kieron O’Hara/Caroline Tudor (2016): The Anonymisation Decision-Making Framework. Manchester: UKAN. Elliot, Mark/Kieron O’Hara/Charles Raab/Christine M. O’Keefe/Elaine Mackey/Chris Dibben/Heather Gowans/Kingsley Purdam/Karen McCullagh (2018): Functional Anonymisation: Personal Data and the Data Environment. In: Computer Law & Security Review 34 (2): 204–221. European Union Agency for Fundamental Rights (2018): Handbook on European Data Protection Law: 2018 Edition. Luxembourg: Publications Office of the European Union.
References
Kuzu, Mehmet/Murat Kantarcioglu/Elizabeth Durham/Bradley Malin (2011): A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In: The 11th Privacy Enhancing Technologies Symposium: 27–29 July 2011; Waterloo, Canada. Martin, Keith (2017): Everyday Cryptography: Fundamental Principles and Applications. Oxford University Press. Oxford. Niedermeyer, Frank/Simone Steinmetzer/Martin Kroll/Rainer Schnell (2014): Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage. In: Journal of Privacy and Confidentiality 6 (2): 59–69. Randall, Sean M./Adrian P. Brown/Anna M. Ferrante/James H. Boyd (2019): Privacy Preserving Linkage Using Multiple Dynamic Match Keys. In: International Journal of Population Data Science 4 (1): 15. Ritchie, Felix/Jim Smith (2018): Confidentiality and Linked Data. Paper published as part of the National Statistican’s Quality Review. London: 1–34. url: https://gss.civilservice.gov.uk/wp-content/uploads/2018/12/12-12- 18_FINAL_Jim_Smith_Felix_Ritchie_article.pdf.
19/17
References
Schnell, Rainer/Tobias Bachteler/Jörg Reiher (2009): Privacy-preserving record linkage using Bloom filters. In: BMC Medical Informatics and Decision Making 9 (41). Smith, Duncan (2017): Secure pseudonymisation for privacy-preserving probabilistic record
Vatsalan, Dinusha/Peter Christen (2016): Privacy-Preserving Matching of Similar Patients. In: Journal of Biomedical Informatics 59: 285–298. Vidanage, Anushka/Thilina Ranbaduge/Peter Christen/Rainer Schnell (2019): Efficient Pattern Mining Based Cryptanalysis for Privacy-Preserving Record Linkage. In: 2019 IEEE 35th International Conference on Data Engineering ICDE 2019. Los Alamitos: IEEE: 1698–1701.
20/17