Using registers and administrative data in Official Statistics - PowerPoint PPT Presentation

Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C

Using registers and administrative data in Official Statistics • Population covering databases are increasingly used in Official Statistics. • Linking these databases on a micro-level (persons) is a powerful research tool. • Luckily, the GDPR is quite favourable for research in general and for especially for statistics. • The Handbook on European data protection law (European Union Agency for Fundamental Rights 2018) summarizes the GDPR: It also recognises the importance of the compilation of data in registries for research purposes (. . . ). For this reason, the regulation allows the processing of data for these purposes, without the data subjects’ consent, provided the relevant safeguards are in place. • The GDPR repeatedly refers to the use of pseudonyms as a standard technique. • Linking with pseudonyms is denoted as Privacy-Preserving Record Linkage. • So the problem is the security of PPRL. 1/17

Attack models • A security analysis requires a model for an attack. • We don’t have standard models of attacks for research or NSI settings. • Usually, we hide that with a "honest, but curious" assumption: All parties behave according to a protocol, but they are curious. • What exactly that "curiosity" implies is unclear. • Does that limit the amount of effort needed for an attack? • Do we have to take into account an irrational effort? • Than we have an "intruder model". • Who will attack an NSI? State Actors? Hackers? • This kind of attack is most likely handled by technical measures. • Prevention of insider attacks is challenging and would cause complicated organisational structures. • In the real world (outside academia or NSIs), most data leakages seem to be due to error or human engineering. 2/17

The data environment • The security or privacy of a technique or the result of the application of a technique can not be judged based on the data alone. • In a series of publications Mark Elliot and others Elliot et al. (2016, 2018) made clear that we have to consider (among other issues): • the motivation of an adversary, • the potential consequences of re-identification (which will affect the motivation for an attack), • the auxiliary data that could used for re-identification, • the governance structures, data security and other infrastructural properties surrounding the data. • Therefore: "Whether data is anonymous or not (and therefore personal or not) is a function of the relationship between that data and its environment" (Elliot et al. 2018). • Currently, this insight is rarely used in the evaluation of PPRL solutions. • Hence, the evaluations by cryptographers are often misleading: Not linking data in a NSI might cause more harm than applying a weak encryption for PPRL. 3/17

Bloom Filters for PPRL ∩ A B 1 1 1 1 0 0 SA 0 0 1 SA 0 0 0 1 1 1 AH AR 0 0 0 SAHRA SARAH 0 0 0 1 1 1 HR RA 0 0 0 1 0 0 0 0 0 RA AH 1 1 1 0 0 0 0 0 0 1 1 1 Σ 7 Σ 5 Σ 6 | A | | A ∩ B | | B | Schnell/Bachteler/Reiher (2009) 4/17

Attacking encryptions In his textbook Martin (2017) writes: "Probably the most important lesson is (. . . ) that security of a cryptosystem is only ever relative to our understanding of attacks." and a few lines later: "(. . . ) we can never guarantee security against an unknown future." 5/17

Progress in attacking PPRL methods • The initial attack on Bloom filters was due to Kuzu et al. (2011). • This attack is basically a frequency alignment of cleartext and basic Bloom filters. • The next attack was due to Niedermeyer et al. (2014). • They used a frequency analysis of encoded bigrams and aligned these frequencies. • The most recent attack on Bloom filters used pattern mining of bit patterns set by q-grams of identifiers (Vidanage et al. 2019). • Regarding approaches not based on Bloom filters, the ONS method for linking anonymous data (ONS 2013) was shown to have some vulnerabilities by Culnane/Rubinstein/Teague (2017) using subgraph matching. • This approach is currently under test for attacking methods based on combining hashes of standardized personal identifiers. 6/17

Developing attacks needs time Bigramm-Frequency- Bloom Filter PPRL CSSP-Attack Pattern-Mining-Attack Attack Schnell 2009 Kuzu 2011 Christen 2019 Niedermeyer 2014 Matching Subgraph-Matching anonymous data ? Culnane 2017 ONS 2013 7/17

PPRL using Multiple Match-Keys • Randall et al (2019) suggested a new PPRL method based on multiple match keys. • Based on previous clear-text linkage, a set of keys for linkage is selected. • For each record, a number of hashes are created, each from different sets of concatenated personal identifiers. • These hashes are denoted as match-keys. Each match-key will directly identify an individual. • Missing identifiers causes blank match-keys. • To reduce the number of match-keys, redundant combinations are removed. • Two records with the same value for a particular match-key are designated a match. • The protocol performs not as well as field-level Bloom filters but the authors think it may offer better privacy protection. • However, currently no security analysis is available . 8/17

MinHash Bloom filters • D. Smith (2017) suggested to estimate the Jaccard similarity of two strings by using several minwise hashes (MinHash). • To adapt MinHash to Bloom filters, several steps are required: 1 First, a random key is hashed and represented as a bit vector. 2 This bit vector is split into sub-parts of a certain length (i.e. 4 ∗ 8 bit splits). 3 For each subpart, a randomly filled lookup table is built. 4 This table contains the hash values. 5 For each element (e.g. a single bigram), this is repeated k independent times, where k is the desired number of hash functions. 6 The minimum hash value of each of the k steps is retained and XORed with l random numbers, where l is the desired output bit vector length. 7 The resulting least significant bit of each resulting XORed number is used. • This gives l independent but similarity-preserving bits. 9/17

MinHash Bloom filters • Empirical tests for this approach have not been published yet. • It is theoretically secure if the independence assumption of each bit position holds. • However, as this method retains similarities, frequency attacks might still work. • This is subject of ongoing research. 10/17

Encoding numerical values into Bloom filters • Standard Bloom Filters can compare numerical identifiers only as strings. • This is neither distance- nor order-preserving. • Recently, Vatsalan/Christen (2016) suggested an encoding of numerical values. • This approach allows calculating approximate distances on encrypted data. • The idea consists in mapping a generated interval centered at the numerical value of interest into a Bloom Filter. • The amount of overlap between two intervals as a proxy for the numerical distance is estimated by the intersection of the Bloom Filters. 11/17

Method • An interval of width 2 · b is built for each value v , where b is a domain-dependent parameter choice. • The width of each interval step is given by d intv = d max / ( 2 · b ) , where d max is the highest tolerated difference between two numerical value pairs and depends on the intended application. d max can be calculated as: d max = max v − min v · p . 100 • p is typically set to 5. • The elements L i of the interval L with 0 ≤ i ≤ 2 · b are then given by:  v − ( b − i ) · d intv i < b  L i = v + ( i − b ) · d intv i > b v i = b  12/17

Example Two intervals L 1 and L 2 are created for the values v 1 and v 2 . The width of the intervals is b = 2 , with a maximum tolerated difference of d max = 4 . This gives interval steps d intv of 1. v 1 = 25 , v 2 = 26 , d max = 4 , b = 2 , d intv = d max 2 · b = 1 This will generate the lists: L 1 = [ 23,24, 25 ,26,27 ] L 2 = [ 24,25, 26 ,27,28 ] These lists of numbers are mapped into Bloom filters. 13/17

Encoding Hierarchical Codes • Sometimes, linkage has to use hierarchical identifiers such as ISCO or ICDs. • This kind of codes is usually • either encoded as unordered set of q-grams • or as an exact hash. • Recently, we suggested an encoding technique (HPBFs) for mapping this kind of codes into Bloom filters. 1 • This mapping preserves similarities of hierarchical codes despite encryption. • Using HPBFs in PPRL settings will improve linkage quality compared to previous methods. 1 Schnell/Borgs: Encoding hierarchical classification codes for Privacy-preserving Record Linkage using Bloom filters, DINA 2019 14/17

Using registers and administrative data in Official Statistics - PowerPoint PPT Presentation

Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C Using registers and

OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL The OCS NEC Group

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Registers vs. Memory Arithmetic instructions operands must be registers, only 32

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

Techniques for Digital Systems Registers, shift registers, counters SOURCE:

Chapter 3 Registers, Counters, Shift Registers Process Control Flaxer Eli - Process Control Ch

Chapter 7 Clocking Dynamic Latches Registers Sequential Logic q g - C 2 MOS - NORA - TSPC

Chapter 7 Clocking Dynamic Latches Registers Sequential Logic q g - C 2 MOS - NORA - TSPC

Prsentation gnrale Official service providers Official service providers Official service

The use of non-official sources in official international economic and financial statistics

2019 OFFICIAL VISITORS GUIDE 2019 OFFICIAL VISITORS GUIDE The guide serves as the official

The Administrative Conference: Review of Research Projects Emily Bremer, Attorney Advisor

x86 Assembly Crash Course Don Porter Registers Only variables available in assembly

Trade registers in Russia; Trade registers in Russia; Putting business or Putting business or

Subroutines Parameters Passing Parameter Passing p. 1/7 Methods of Passing Parameters In

ECED2200 Digital Circuits Serial Protocols, Registers, Shift Registers 23/07/2012 Colin

Algebraic Coding Theory Ramsey Rossmann May 7, 2017 University of Puget Sound Motivation Goal

Video Error Concealment: A Brief Presentation Rui Fernandes 1 1 Instituto Polit ecnico de

The Sound Group Joe Bota Aaron Camm Alex Cueto Brief Overview The Physics of Sound Audio

Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for

Air Traffic Management Sebastian Wandelt, Department of Computer Science, Humboldt-Universitt zu

DNA Compression Challenge Revisited: a Dynamic Programming Approach Behshad Behzadi and Fabrice

Effjcient Message Serialization for Inter-Service Communication in dCache Evaluating a Replacement

Random Access Codes Laura Maninska & M aris Ozols University of Latvia Our supervisors:

Using registers and administrative data in Official Statistics - PowerPoint PPT Presentation

Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C Using registers and

OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL The OCS NEC Group

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Registers vs. Memory Arithmetic instructions operands must be registers, only 32

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

Techniques for Digital Systems Registers, shift registers, counters SOURCE:

Chapter 3 Registers, Counters, Shift Registers Process Control Flaxer Eli - Process Control Ch

Chapter 7 Clocking Dynamic Latches Registers Sequential Logic q g - C 2 MOS - NORA - TSPC

Chapter 7 Clocking Dynamic Latches Registers Sequential Logic q g - C 2 MOS - NORA - TSPC

Prsentation gnrale Official service providers Official service providers Official service

The use of non-official sources in official international economic and financial statistics

2019 OFFICIAL VISITORS GUIDE 2019 OFFICIAL VISITORS GUIDE The guide serves as the official

The Administrative Conference: Review of Research Projects Emily Bremer, Attorney Advisor

x86 Assembly Crash Course Don Porter Registers Only variables available in assembly

Trade registers in Russia; Trade registers in Russia; Putting business or Putting business or

Subroutines Parameters Passing Parameter Passing p. 1/7 Methods of Passing Parameters In

ECED2200 Digital Circuits Serial Protocols, Registers, Shift Registers 23/07/2012 Colin

Algebraic Coding Theory Ramsey Rossmann May 7, 2017 University of Puget Sound Motivation Goal

Video Error Concealment: A Brief Presentation Rui Fernandes 1 1 Instituto Polit ecnico de

The Sound Group Joe Bota Aaron Camm Alex Cueto Brief Overview The Physics of Sound Audio

Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for

Air Traffic Management Sebastian Wandelt, Department of Computer Science, Humboldt-Universitt zu

DNA Compression Challenge Revisited: a Dynamic Programming Approach Behshad Behzadi and Fabrice

Effjcient Message Serialization for Inter-Service Communication in dCache Evaluating a Replacement

Random Access Codes Laura Maninska &amp; M aris Ozols University of Latvia Our supervisors:

Random Access Codes Laura Maninska & M aris Ozols University of Latvia Our supervisors: