Using registers and administrative data in Official Statistics - - PowerPoint PPT Presentation

using registers and administrative data in official
SMART_READER_LITE
LIVE PREVIEW

Using registers and administrative data in Official Statistics - - PowerPoint PPT Presentation

Linking with sensitive identifiers in a national statistical institute Rainer Schnell German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom German RL C Using registers and


slide-1
SLIDE 1

Linking with sensitive identifiers in a national statistical institute

Rainer Schnell

German Record Linkage Center University of Duisburg-Essen GSS Data Linking Symposium Wed 23 Oct 2019 London, United Kingdom

German

RLC

slide-2
SLIDE 2

Using registers and administrative data in Official Statistics

  • Population covering databases are increasingly used in Official Statistics.
  • Linking these databases on a micro-level (persons) is a powerful research tool.
  • Luckily, the GDPR is quite favourable for research in general and for especially for

statistics.

  • The Handbook on European data protection law (European Union Agency for

Fundamental Rights 2018) summarizes the GDPR: It also recognises the importance of the compilation of data in registries for research purposes (. . . ). For this reason, the regulation allows the processing of data for these purposes, without the data subjects’ consent, provided the relevant safeguards are in place.

  • The GDPR repeatedly refers to the use of pseudonyms as a standard technique.
  • Linking with pseudonyms is denoted as Privacy-Preserving Record Linkage.
  • So the problem is the security of PPRL.

1/17

slide-3
SLIDE 3

Attack models

  • A security analysis requires a model for an attack.
  • We don’t have standard models of attacks for research or NSI settings.
  • Usually, we hide that with a "honest, but curious" assumption: All parties behave

according to a protocol, but they are curious.

  • What exactly that "curiosity" implies is unclear.
  • Does that limit the amount of effort needed for an attack?
  • Do we have to take into account an irrational effort?
  • Than we have an "intruder model".
  • Who will attack an NSI? State Actors? Hackers?
  • This kind of attack is most likely handled by technical measures.
  • Prevention of insider attacks is challenging and would cause complicated organisational

structures.

  • In the real world (outside academia or NSIs), most data leakages seem to be due to error
  • r human engineering.

2/17

slide-4
SLIDE 4

The data environment

  • The security or privacy of a technique or the result of the application of a technique can

not be judged based on the data alone.

  • In a series of publications Mark Elliot and others Elliot et al. (2016, 2018) made clear that

we have to consider (among other issues):

  • the motivation of an adversary,
  • the potential consequences of re-identification (which will affect the motivation for an

attack),

  • the auxiliary data that could used for re-identification,
  • the governance structures, data security and other infrastructural properties surrounding the

data.

  • Therefore:

"Whether data is anonymous or not (and therefore personal or not) is a function of the relationship between that data and its environment" (Elliot et al. 2018).

  • Currently, this insight is rarely used in the evaluation of PPRL solutions.
  • Hence, the evaluations by cryptographers are often misleading: Not linking data in a NSI

might cause more harm than applying a weak encryption for PPRL.

3/17

slide-5
SLIDE 5

Bloom Filters for PPRL

SAHRA SARAH SA AH HR RA SA AR RA AH 1

A

1 1 1 1 1 1

Σ 7 |A|

1

1 1 1 1

Σ 5 |A∩ B|

1

B

1 1 1 1 1

Σ 6 |B|

Schnell/Bachteler/Reiher (2009)

4/17

slide-6
SLIDE 6

Attacking encryptions

In his textbook Martin (2017) writes: "Probably the most important lesson is (. . . ) that security of a cryptosystem is only ever relative to our understanding of attacks." and a few lines later: "(. . . ) we can never guarantee security against an unknown future."

5/17

slide-7
SLIDE 7

Progress in attacking PPRL methods

  • The initial attack on Bloom filters was due to Kuzu et al. (2011).
  • This attack is basically a frequency alignment of cleartext and basic Bloom filters.
  • The next attack was due to Niedermeyer et al. (2014).
  • They used a frequency analysis of encoded bigrams and aligned these frequencies.
  • The most recent attack on Bloom filters used pattern mining of bit patterns set by

q-grams of identifiers (Vidanage et al. 2019).

  • Regarding approaches not based on Bloom filters, the ONS method for linking anonymous

data (ONS 2013) was shown to have some vulnerabilities by Culnane/Rubinstein/Teague (2017) using subgraph matching.

  • This approach is currently under test for attacking methods based on combining hashes of

standardized personal identifiers.

6/17

slide-8
SLIDE 8

Developing attacks needs time

Bloom Filter PPRL Schnell 2009 CSSP-Attack Kuzu 2011 Bigramm-Frequency- Attack Niedermeyer 2014 Pattern-Mining-Attack Christen 2019 Matching anonymous data ONS 2013 Subgraph-Matching Culnane 2017 ?

7/17

slide-9
SLIDE 9

PPRL using Multiple Match-Keys

  • Randall et al (2019) suggested a new PPRL method based on multiple match keys.
  • Based on previous clear-text linkage, a set of keys for linkage is selected.
  • For each record, a number of hashes are created, each from different sets of concatenated

personal identifiers.

  • These hashes are denoted as match-keys. Each match-key will directly identify an

individual.

  • Missing identifiers causes blank match-keys.
  • To reduce the number of match-keys, redundant combinations are removed.
  • Two records with the same value for a particular match-key are designated a match.
  • The protocol performs not as well as field-level Bloom filters but the authors think it may
  • ffer better privacy protection.
  • However, currently no security analysis is available .

8/17

slide-10
SLIDE 10

MinHash Bloom filters

  • D. Smith (2017) suggested to estimate the Jaccard similarity of two strings by using

several minwise hashes (MinHash).

  • To adapt MinHash to Bloom filters, several steps are required:

1 First, a random key is hashed and represented as a bit vector. 2 This bit vector is split into sub-parts of a certain length (i.e. 4 ∗ 8 bit splits). 3 For each subpart, a randomly filled lookup table is built. 4 This table contains the hash values. 5 For each element (e.g. a single bigram), this is repeated k independent times, where k is the

desired number of hash functions.

6 The minimum hash value of each of the k steps is retained and XORed with l random

numbers, where l is the desired output bit vector length.

7 The resulting least significant bit of each resulting XORed number is used.

  • This gives l independent but similarity-preserving bits.

9/17

slide-11
SLIDE 11

MinHash Bloom filters

  • Empirical tests for this approach have not been published yet.
  • It is theoretically secure if the independence assumption of each bit position holds.
  • However, as this method retains similarities, frequency attacks might still work.
  • This is subject of ongoing research.

10/17

slide-12
SLIDE 12

Encoding numerical values into Bloom filters

  • Standard Bloom Filters can compare numerical identifiers only as strings.
  • This is neither distance- nor order-preserving.
  • Recently, Vatsalan/Christen (2016) suggested an encoding of numerical values.
  • This approach allows calculating approximate distances on encrypted data.
  • The idea consists in mapping a generated interval centered at the numerical value of

interest into a Bloom Filter.

  • The amount of overlap between two intervals as a proxy for the numerical distance is

estimated by the intersection of the Bloom Filters.

11/17

slide-13
SLIDE 13

Method

  • An interval of width 2 · b is built for each value v, where b is a domain-dependent

parameter choice.

  • The width of each interval step is given by dintv = dmax/(2 · b), where dmax is the highest

tolerated difference between two numerical value pairs and depends on the intended

  • application. dmax can be calculated as:

dmax = maxv − minv 100 · p.

  • p is typically set to 5.
  • The elements Li of the interval L with 0 ≤ i ≤ 2 · b are then given by:

Li =    v − (b − i) · dintv i < b v + (i − b) · dintv i > b v i = b

12/17

slide-14
SLIDE 14

Example

Two intervals L1 and L2 are created for the values v1 and v2. The width of the intervals is b = 2, with a maximum tolerated difference of dmax = 4. This gives interval steps dintv of 1. v1 = 25, v2 = 26, dmax = 4, b = 2, dintv = dmax 2 · b = 1 This will generate the lists: L1 = [23,24,25,26,27] L2 = [24,25,26,27,28] These lists of numbers are mapped into Bloom filters.

13/17

slide-15
SLIDE 15

Encoding Hierarchical Codes

  • Sometimes, linkage has to use hierarchical identifiers such as ISCO or ICDs.
  • This kind of codes is usually
  • either encoded as unordered set of q-grams
  • or as an exact hash.
  • Recently, we suggested an encoding technique (HPBFs) for mapping this kind of codes

into Bloom filters.1

  • This mapping preserves similarities of hierarchical codes despite encryption.
  • Using HPBFs in PPRL settings will improve linkage quality compared to previous methods.

1Schnell/Borgs: Encoding hierarchical classification codes for Privacy-preserving Record Linkage using

Bloom filters, DINA 2019

14/17

slide-16
SLIDE 16

Constructing Hierarchy Preserving Bloom filters (HPBFs)

To be encoded: 2143 Code length j = 4 Stream length modifier c = 2 BF length l = 40 Positional index i ∈ {j, j − 1, j − 2, . . . , 1} 2143

H1 = HMAC(2143, key)

3

i = 1

S1 = PRNG:

Seed = H1 n = c ∗ i Range = {1 . . . l}

214

H2 = HMAC(214, key)

4

i = 2

S2 = PRNG:

Seed = H2 n = c ∗ i Range = {1 . . . l}

21

H3 = HMAC(21, key)

1

i = 3

S3 = PRNG:

Seed = H3 n = c ∗ i Range = {1 . . . l}

2

H4 = HMAC(2, key)

2

i = 4

S4 = PRNG:

Seed = H4 n = c ∗ i Range = {1 . . . l}

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

S4 = {1, 2, 5, 7, 10, 16, 19, 28}; S3 = {7, 10, 14, 17, 20, 21}; S2 = {17, 21, 26, 27}; S1 = {32, 39} Code += Code[j - i + 1] Code += Code[j - i + 1] Code += Code[j - i + 1] Code = Code[j - i + 1]

Seed Seed Seed Seed 15/17

slide-17
SLIDE 17

Technical summary

  • Encrypting error-prone identifiers in similarity preserving way has proven to be a nontrivial

problem.

  • Many different techniques for encryption and hardening have been suggested.
  • Most of the suggested techniques have not been subject to systematic attacks.
  • Testing even the currently available techniques will take years.
  • Applying different encryptions depending on a stable identifier (for example, DOB or

POB) and a separate password has been impossible to attack (so far).

  • The privacy standard set by the German Federal Data Protection laws is "de facto

anonymisation".

  • Here, the gain of a successful attack is considered as low compared to the effort needed.
  • Current available PPRL will perhaps not withstand an attack of a "motivated attacker"

willing to invest years or millions of Euros.

  • But it will withstand attacks of a bored analyst or system administrator in a controlled

environment for quite a while.

  • And the GDPR does not require more than that for official statistics.

16/17

slide-18
SLIDE 18

Conclusion

"Privacy is a goal that we cannot achieve. That is no excuse not to strive for it. We do these same with other goals, such as justice." (Karsten Weber 2018)

17/17

slide-19
SLIDE 19

References

Culnane, Chris/Benjamin I. P. Rubinstein/Vanessa Teague (2017): Vulnerabilities in the Use of Similarity Tables in Combination with Pseudonymisation to Preserve Data Privacy in the UK Office for National Statistics’ Privacy-Preserving Record Linkage. online at arXiv. url: https://arxiv.org/abs/1712.00871. Culnane, Chris/Benjamin I. P. Rubinstein/Vanessa Teague (2018): Options for encoding names for data linking at the Australian Bureau of Statistics. arXiv: 1802.07975 [cs.CR]. Elliot, Mark/Elaine Mackey/Kieron O’Hara/Caroline Tudor (2016): The Anonymisation Decision-Making Framework. Manchester: UKAN. Elliot, Mark/Kieron O’Hara/Charles Raab/Christine M. O’Keefe/Elaine Mackey/Chris Dibben/Heather Gowans/Kingsley Purdam/Karen McCullagh (2018): Functional Anonymisation: Personal Data and the Data Environment. In: Computer Law & Security Review 34 (2): 204–221. European Union Agency for Fundamental Rights (2018): Handbook on European Data Protection Law: 2018 Edition. Luxembourg: Publications Office of the European Union.

slide-20
SLIDE 20

References

Kuzu, Mehmet/Murat Kantarcioglu/Elizabeth Durham/Bradley Malin (2011): A constraint satisfaction cryptanalysis of Bloom filters in private record linkage. In: The 11th Privacy Enhancing Technologies Symposium: 27–29 July 2011; Waterloo, Canada. Martin, Keith (2017): Everyday Cryptography: Fundamental Principles and Applications. Oxford University Press. Oxford. Niedermeyer, Frank/Simone Steinmetzer/Martin Kroll/Rainer Schnell (2014): Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage. In: Journal of Privacy and Confidentiality 6 (2): 59–69. Randall, Sean M./Adrian P. Brown/Anna M. Ferrante/James H. Boyd (2019): Privacy Preserving Linkage Using Multiple Dynamic Match Keys. In: International Journal of Population Data Science 4 (1): 15. Ritchie, Felix/Jim Smith (2018): Confidentiality and Linked Data. Paper published as part of the National Statistican’s Quality Review. London: 1–34. url: https://gss.civilservice.gov.uk/wp-content/uploads/2018/12/12-12- 18_FINAL_Jim_Smith_Felix_Ritchie_article.pdf.

19/17

slide-21
SLIDE 21

References

Schnell, Rainer/Tobias Bachteler/Jörg Reiher (2009): Privacy-preserving record linkage using Bloom filters. In: BMC Medical Informatics and Decision Making 9 (41). Smith, Duncan (2017): Secure pseudonymisation for privacy-preserving probabilistic record

  • linkage. In: Journal of Information Security and Applications 34 (2): 271–279.

Vatsalan, Dinusha/Peter Christen (2016): Privacy-Preserving Matching of Similar Patients. In: Journal of Biomedical Informatics 59: 285–298. Vidanage, Anushka/Thilina Ranbaduge/Peter Christen/Rainer Schnell (2019): Efficient Pattern Mining Based Cryptanalysis for Privacy-Preserving Record Linkage. In: 2019 IEEE 35th International Conference on Data Engineering ICDE 2019. Los Alamitos: IEEE: 1698–1701.

20/17