using registers and administrative data in official
play

Using registers and administrative data in Official Statistics - PowerPoint PPT Presentation

Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C Using registers and


  1. Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C

  2. Using registers and administrative data in Official Statistics • Population covering databases are increasingly used in Official Statistics. • Linking these databases on a micro-level (persons) is a powerful research tool. • Luckily, the GDPR is quite favourable for research in general and for especially for statistics. • The Handbook on European data protection law (European Union Agency for Fundamental Rights 2018) summarizes the GDPR: It also recognises the importance of the compilation of data in registries for research purposes (. . . ). For this reason, the regulation allows the processing of data for these purposes, without the data subjects’ consent, provided the relevant safeguards are in place. • The GDPR repeatedly refers to the use of pseudonyms as a standard technique. • Linking with pseudonyms is denoted as Privacy-Preserving Record Linkage. • So the problem is the security of PPRL. 1/17

  3. Attack models • A security analysis requires a model for an attack. • We don’t have standard models of attacks for research or NSI settings. • Usually, we hide that with a "honest, but curious" assumption: All parties behave according to a protocol, but they are curious. • What exactly that "curiosity" implies is unclear. • Does that limit the amount of effort needed for an attack? • Do we have to take into account an irrational effort? • Than we have an "intruder model". • Who will attack an NSI? State Actors? Hackers? • This kind of attack is most likely handled by technical measures. • Prevention of insider attacks is challenging and would cause complicated organisational structures. • In the real world (outside academia or NSIs), most data leakages seem to be due to error or human engineering. 2/17

  4. The data environment • The security or privacy of a technique or the result of the application of a technique can not be judged based on the data alone. • In a series of publications Mark Elliot and others Elliot et al. (2016, 2018) made clear that we have to consider (among other issues): • the motivation of an adversary, • the potential consequences of re-identification (which will affect the motivation for an attack), • the auxiliary data that could used for re-identification, • the governance structures, data security and other infrastructural properties surrounding the data. • Therefore: "Whether data is anonymous or not (and therefore personal or not) is a function of the relationship between that data and its environment" (Elliot et al. 2018). • Currently, this insight is rarely used in the evaluation of PPRL solutions. • Hence, the evaluations by cryptographers are often misleading: Not linking data in a NSI might cause more harm than applying a weak encryption for PPRL. 3/17

  5. Bloom Filters for PPRL ∩ A B 1 1 1 1 0 0 SA 0 0 1 SA 0 0 0 1 1 1 AH AR 0 0 0 SAHRA SARAH 0 0 0 1 1 1 HR RA 0 0 0 1 0 0 0 0 0 RA AH 1 1 1 0 0 0 0 0 0 1 1 1 Σ 7 Σ 5 Σ 6 | A | | A ∩ B | | B | Schnell/Bachteler/Reiher (2009) 4/17

  6. Attacking encryptions In his textbook Martin (2017) writes: "Probably the most important lesson is (. . . ) that security of a cryptosystem is only ever relative to our understanding of attacks." and a few lines later: "(. . . ) we can never guarantee security against an unknown future." 5/17

  7. Progress in attacking PPRL methods • The initial attack on Bloom filters was due to Kuzu et al. (2011). • This attack is basically a frequency alignment of cleartext and basic Bloom filters. • The next attack was due to Niedermeyer et al. (2014). • They used a frequency analysis of encoded bigrams and aligned these frequencies. • The most recent attack on Bloom filters used pattern mining of bit patterns set by q-grams of identifiers (Vidanage et al. 2019). • Regarding approaches not based on Bloom filters, the ONS method for linking anonymous data (ONS 2013) was shown to have some vulnerabilities by Culnane/Rubinstein/Teague (2017) using subgraph matching. • This approach is currently under test for attacking methods based on combining hashes of standardized personal identifiers. 6/17

  8. Developing attacks needs time Bigramm-Frequency- Bloom Filter PPRL CSSP-Attack Pattern-Mining-Attack Attack Schnell 2009 Kuzu 2011 Christen 2019 Niedermeyer 2014 Matching Subgraph-Matching anonymous data ? Culnane 2017 ONS 2013 7/17

  9. PPRL using Multiple Match-Keys • Randall et al (2019) suggested a new PPRL method based on multiple match keys. • Based on previous clear-text linkage, a set of keys for linkage is selected. • For each record, a number of hashes are created, each from different sets of concatenated personal identifiers. • These hashes are denoted as match-keys. Each match-key will directly identify an individual. • Missing identifiers causes blank match-keys. • To reduce the number of match-keys, redundant combinations are removed. • Two records with the same value for a particular match-key are designated a match. • The protocol performs not as well as field-level Bloom filters but the authors think it may offer better privacy protection. • However, currently no security analysis is available . 8/17

  10. MinHash Bloom filters • D. Smith (2017) suggested to estimate the Jaccard similarity of two strings by using several minwise hashes (MinHash). • To adapt MinHash to Bloom filters, several steps are required: 1 First, a random key is hashed and represented as a bit vector. 2 This bit vector is split into sub-parts of a certain length (i.e. 4 ∗ 8 bit splits). 3 For each subpart, a randomly filled lookup table is built. 4 This table contains the hash values. 5 For each element (e.g. a single bigram), this is repeated k independent times, where k is the desired number of hash functions. 6 The minimum hash value of each of the k steps is retained and XORed with l random numbers, where l is the desired output bit vector length. 7 The resulting least significant bit of each resulting XORed number is used. • This gives l independent but similarity-preserving bits. 9/17

  11. MinHash Bloom filters • Empirical tests for this approach have not been published yet. • It is theoretically secure if the independence assumption of each bit position holds. • However, as this method retains similarities, frequency attacks might still work. • This is subject of ongoing research. 10/17

  12. Encoding numerical values into Bloom filters • Standard Bloom Filters can compare numerical identifiers only as strings. • This is neither distance- nor order-preserving. • Recently, Vatsalan/Christen (2016) suggested an encoding of numerical values. • This approach allows calculating approximate distances on encrypted data. • The idea consists in mapping a generated interval centered at the numerical value of interest into a Bloom Filter. • The amount of overlap between two intervals as a proxy for the numerical distance is estimated by the intersection of the Bloom Filters. 11/17

  13. Method • An interval of width 2 · b is built for each value v , where b is a domain-dependent parameter choice. • The width of each interval step is given by d intv = d max / ( 2 · b ) , where d max is the highest tolerated difference between two numerical value pairs and depends on the intended application. d max can be calculated as: d max = max v − min v · p . 100 • p is typically set to 5. • The elements L i of the interval L with 0 ≤ i ≤ 2 · b are then given by:  v − ( b − i ) · d intv i < b  L i = v + ( i − b ) · d intv i > b v i = b  12/17

  14. Example Two intervals L 1 and L 2 are created for the values v 1 and v 2 . The width of the intervals is b = 2 , with a maximum tolerated difference of d max = 4 . This gives interval steps d intv of 1. v 1 = 25 , v 2 = 26 , d max = 4 , b = 2 , d intv = d max 2 · b = 1 This will generate the lists: L 1 = [ 23,24, 25 ,26,27 ] L 2 = [ 24,25, 26 ,27,28 ] These lists of numbers are mapped into Bloom filters. 13/17

  15. Encoding Hierarchical Codes • Sometimes, linkage has to use hierarchical identifiers such as ISCO or ICDs. • This kind of codes is usually • either encoded as unordered set of q-grams • or as an exact hash. • Recently, we suggested an encoding technique (HPBFs) for mapping this kind of codes into Bloom filters. 1 • This mapping preserves similarities of hierarchical codes despite encryption. • Using HPBFs in PPRL settings will improve linkage quality compared to previous methods. 1 Schnell/Borgs: Encoding hierarchical classification codes for Privacy-preserving Record Linkage using Bloom filters, DINA 2019 14/17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend