Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C
Using registers and administrative data in Official Statistics • Population covering databases are increasingly used in Official Statistics. • Linking these databases on a micro-level (persons) is a powerful research tool. • Luckily, the GDPR is quite favourable for research in general and for especially for statistics. • The Handbook on European data protection law (European Union Agency for Fundamental Rights 2018) summarizes the GDPR: It also recognises the importance of the compilation of data in registries for research purposes (. . . ). For this reason, the regulation allows the processing of data for these purposes, without the data subjects’ consent, provided the relevant safeguards are in place. • The GDPR repeatedly refers to the use of pseudonyms as a standard technique. • Linking with pseudonyms is denoted as Privacy-Preserving Record Linkage. • So the problem is the security of PPRL. 1/17
Attack models • A security analysis requires a model for an attack. • We don’t have standard models of attacks for research or NSI settings. • Usually, we hide that with a "honest, but curious" assumption: All parties behave according to a protocol, but they are curious. • What exactly that "curiosity" implies is unclear. • Does that limit the amount of effort needed for an attack? • Do we have to take into account an irrational effort? • Than we have an "intruder model". • Who will attack an NSI? State Actors? Hackers? • This kind of attack is most likely handled by technical measures. • Prevention of insider attacks is challenging and would cause complicated organisational structures. • In the real world (outside academia or NSIs), most data leakages seem to be due to error or human engineering. 2/17
The data environment • The security or privacy of a technique or the result of the application of a technique can not be judged based on the data alone. • In a series of publications Mark Elliot and others Elliot et al. (2016, 2018) made clear that we have to consider (among other issues): • the motivation of an adversary, • the potential consequences of re-identification (which will affect the motivation for an attack), • the auxiliary data that could used for re-identification, • the governance structures, data security and other infrastructural properties surrounding the data. • Therefore: "Whether data is anonymous or not (and therefore personal or not) is a function of the relationship between that data and its environment" (Elliot et al. 2018). • Currently, this insight is rarely used in the evaluation of PPRL solutions. • Hence, the evaluations by cryptographers are often misleading: Not linking data in a NSI might cause more harm than applying a weak encryption for PPRL. 3/17
Bloom Filters for PPRL ∩ A B 1 1 1 1 0 0 SA 0 0 1 SA 0 0 0 1 1 1 AH AR 0 0 0 SAHRA SARAH 0 0 0 1 1 1 HR RA 0 0 0 1 0 0 0 0 0 RA AH 1 1 1 0 0 0 0 0 0 1 1 1 Σ 7 Σ 5 Σ 6 | A | | A ∩ B | | B | Schnell/Bachteler/Reiher (2009) 4/17
Attacking encryptions In his textbook Martin (2017) writes: "Probably the most important lesson is (. . . ) that security of a cryptosystem is only ever relative to our understanding of attacks." and a few lines later: "(. . . ) we can never guarantee security against an unknown future." 5/17
Progress in attacking PPRL methods • The initial attack on Bloom filters was due to Kuzu et al. (2011). • This attack is basically a frequency alignment of cleartext and basic Bloom filters. • The next attack was due to Niedermeyer et al. (2014). • They used a frequency analysis of encoded bigrams and aligned these frequencies. • The most recent attack on Bloom filters used pattern mining of bit patterns set by q-grams of identifiers (Vidanage et al. 2019). • Regarding approaches not based on Bloom filters, the ONS method for linking anonymous data (ONS 2013) was shown to have some vulnerabilities by Culnane/Rubinstein/Teague (2017) using subgraph matching. • This approach is currently under test for attacking methods based on combining hashes of standardized personal identifiers. 6/17
Developing attacks needs time Bigramm-Frequency- Bloom Filter PPRL CSSP-Attack Pattern-Mining-Attack Attack Schnell 2009 Kuzu 2011 Christen 2019 Niedermeyer 2014 Matching Subgraph-Matching anonymous data ? Culnane 2017 ONS 2013 7/17
PPRL using Multiple Match-Keys • Randall et al (2019) suggested a new PPRL method based on multiple match keys. • Based on previous clear-text linkage, a set of keys for linkage is selected. • For each record, a number of hashes are created, each from different sets of concatenated personal identifiers. • These hashes are denoted as match-keys. Each match-key will directly identify an individual. • Missing identifiers causes blank match-keys. • To reduce the number of match-keys, redundant combinations are removed. • Two records with the same value for a particular match-key are designated a match. • The protocol performs not as well as field-level Bloom filters but the authors think it may offer better privacy protection. • However, currently no security analysis is available . 8/17
MinHash Bloom filters • D. Smith (2017) suggested to estimate the Jaccard similarity of two strings by using several minwise hashes (MinHash). • To adapt MinHash to Bloom filters, several steps are required: 1 First, a random key is hashed and represented as a bit vector. 2 This bit vector is split into sub-parts of a certain length (i.e. 4 ∗ 8 bit splits). 3 For each subpart, a randomly filled lookup table is built. 4 This table contains the hash values. 5 For each element (e.g. a single bigram), this is repeated k independent times, where k is the desired number of hash functions. 6 The minimum hash value of each of the k steps is retained and XORed with l random numbers, where l is the desired output bit vector length. 7 The resulting least significant bit of each resulting XORed number is used. • This gives l independent but similarity-preserving bits. 9/17
MinHash Bloom filters • Empirical tests for this approach have not been published yet. • It is theoretically secure if the independence assumption of each bit position holds. • However, as this method retains similarities, frequency attacks might still work. • This is subject of ongoing research. 10/17
Encoding numerical values into Bloom filters • Standard Bloom Filters can compare numerical identifiers only as strings. • This is neither distance- nor order-preserving. • Recently, Vatsalan/Christen (2016) suggested an encoding of numerical values. • This approach allows calculating approximate distances on encrypted data. • The idea consists in mapping a generated interval centered at the numerical value of interest into a Bloom Filter. • The amount of overlap between two intervals as a proxy for the numerical distance is estimated by the intersection of the Bloom Filters. 11/17
Method • An interval of width 2 · b is built for each value v , where b is a domain-dependent parameter choice. • The width of each interval step is given by d intv = d max / ( 2 · b ) , where d max is the highest tolerated difference between two numerical value pairs and depends on the intended application. d max can be calculated as: d max = max v − min v · p . 100 • p is typically set to 5. • The elements L i of the interval L with 0 ≤ i ≤ 2 · b are then given by: v − ( b − i ) · d intv i < b L i = v + ( i − b ) · d intv i > b v i = b 12/17
Example Two intervals L 1 and L 2 are created for the values v 1 and v 2 . The width of the intervals is b = 2 , with a maximum tolerated difference of d max = 4 . This gives interval steps d intv of 1. v 1 = 25 , v 2 = 26 , d max = 4 , b = 2 , d intv = d max 2 · b = 1 This will generate the lists: L 1 = [ 23,24, 25 ,26,27 ] L 2 = [ 24,25, 26 ,27,28 ] These lists of numbers are mapped into Bloom filters. 13/17
Encoding Hierarchical Codes • Sometimes, linkage has to use hierarchical identifiers such as ISCO or ICDs. • This kind of codes is usually • either encoded as unordered set of q-grams • or as an exact hash. • Recently, we suggested an encoding technique (HPBFs) for mapping this kind of codes into Bloom filters. 1 • This mapping preserves similarities of hierarchical codes despite encryption. • Using HPBFs in PPRL settings will improve linkage quality compared to previous methods. 1 Schnell/Borgs: Encoding hierarchical classification codes for Privacy-preserving Record Linkage using Bloom filters, DINA 2019 14/17
Recommend
More recommend