. S. Verykios 2 and P A. Karakasidis 1 , V . Christen 3 1 Department - PowerPoint PPT Presentation

DPM 2011 - Leuven, Belgium 15-16 September 2011 . S. Verykios 2 and P A. Karakasidis 1 , V . Christen 3 1 Department of Computer and Communication Engineering University of Thessaly Volos, Greece akarakasidis@inf.uth.gr 2 School of Science and T echnology Hellenic Open University Patras, Greece verykios@eap.gr 3 ANU College of Engineering and Computer Science The Australian National University Canberra, Australia peter .christen@anu.edu.au A. Karakasidis - University of Thessaly 1

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Approximate matching without common unique identifiers � Integration without compromising privacy � Examples: � Merging medical data � Locating tax evaders A. Karakasidis - University of Thessaly 2

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Let µ be a privacy metric for PPRL. � A plain text database D pt and its ciphered equivalent D c . � µ represents the ability to infer data from D pt using data from D c � Higher values of µ � � higher inference ability. � � A. Karakasidis - University of Thessaly 3

DPM 2011 - Leuven, Belgium 15-16 September 2011 � A PPRL method is considered to offer sufficient privacy guaranties, if the value of its privacy metric µ does not exceed a predetermined privacy threshold δ . A. Karakasidis - University of Thessaly 4

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Considering data sources A, B, we wish to perform record matching between datasets R A and R B in a way that at the end of the process the privacy metric for source A, µ Α will not exceed δ A . � More flexible definition A. Karakasidis - University of Thessaly 5

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Three ways for providing privacy � Suppression � Perturbation � Generalization A. Karakasidis - University of Thessaly 6

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Inherent Generalization Characteristics � Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y. � Replace consonants with digits as follows (after the first letter): � b, f, p, v => 1 � c, g, j, k, q, s, x, z => 2 � d, t => 3 � l => 4 � m, n => 5 � r => 6 � T wo adjacent letters with the same number are coded as a single number . � Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers. A. Karakasidis - University of Thessaly 7

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Based on Soundex inherent privacy � Using a trusted third party � Fake codes to enhance privacy A. Karakasidis - University of Thessaly 8

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 B A A. Karakasidis - University of Thessaly 9

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 B A Sources send data to the third party A. Karakasidis - University of Thessaly 10

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 JOIN JOIN 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 B A The third party joins the Soundex codes A. Karakasidis - University of Thessaly 11

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 A[1,2] B[2,3] B A The third party returns the matching identifiers A. Karakasidis - University of Thessaly 12

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 B A Sources determine identifiers A. Karakasidis - University of Thessaly 13

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 A requests: J525 B A B requests: J525, F632 Sources ask directly data from each other A. Karakasidis - University of Thessaly 14

DPM 2011 - Leuven, Belgium 15-16 September 2011 ID Sndx Surname 1 A100 ID Sndx Surname 2 F632 Fortson 1 F632 3 J525 Johnsen 2 J525 Johnson 4 M346 3 K364 Miller C 4 M460 A delivers data B A B delivers data Sources deliver data A. Karakasidis - University of Thessaly 15

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Need for a Privacy Metric � Use of Information Theory � Calculation of Entropy � Calculation of Information Gain � Calculation of Relative Information Gain A. Karakasidis - University of Thessaly 16

DPM 2011 - Leuven, Belgium 15-16 September 2011 � The amount of information in a message. � Entropy provides a degree of a set’s predictability � Low entropy of X means low uncertainty and as a result, high predictability of X’s values. A. Karakasidis - University of Thessaly 17

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Quantification of the amount of uncertainty in predicting the value of the discrete random variable Y given X. A. Karakasidis - University of Thessaly 18

DPM 2011 - Leuven, Belgium 15-16 September 2011 � The difficulty of inferring the original text (Y), knowing its enciphered version (X) � How the knowledge of X’s value can reduce the uncertainty of inferring Y . � Lower Information Gain means that it is difficult to infer the original text from the cipher. A. Karakasidis - University of Thessaly 19

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Information Gain depends on the size of the measured dataset. � Relative Information Gain on the other hand, provides a normalized scale. A. Karakasidis - University of Thessaly 20

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Uniform Ciphertext / Uniform Plaintext � Uniform Ciphertexts by Swapping Plaintexts � k-anonymous Ciphertexts A. Karakasidis - University of Thessaly 21

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Intuitive approach � To reduce RIG, plaintexts and ciphertexts appear equal number of times � Inject fake records so that all ciphers map to an equal number of surnames A. Karakasidis - University of Thessaly 22

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Calculate the average number of plaintext occurrences ⌈ K ⌉ for each Soundex code � For Soundex codes with more than ⌈ K ⌉ occurrences, remove the plaintexts redundant occurrences � Add an equal number of fake occurrences for Soundex codes with less than ⌈ K ⌉ appearances, � E ach Soundex code appears exactly ⌈ K ⌉ times. A. Karakasidis - University of Thessaly 23

DPM 2011 - Leuven, Belgium 15-16 September 2011 � For: � Avoid oversized datasets � Against: � Removed plaintexts will have to be separately matched. A. Karakasidis - University of Thessaly 24

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Same intuition with Sweeney’s k-anonymity � Create datasets so that each Soundex code reflects to at least k Surnames. � Parametric approach with k as its tuning parameter. � For each Soundex code with less than k Surnames we inject fake surnames. � T unable by means of the k parameter. A. Karakasidis - University of Thessaly 25

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Four datasets with different distributions � Real world and synthetic data � Study on a single (Surname) field A. Karakasidis - University of Thessaly 26

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Assess the amount of information hidden by Soundex � Calculate � Entropy H(Surname) and � Conditional Entropy H(Surname|Soundex) A. Karakasidis - University of Thessaly 27

DPM 2011 - Leuven, Belgium 15-16 September 2011 A. Karakasidis - University of Thessaly 28

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Drop in RIG represents how much privacy we gain. � Quantitatively measure the inherent reduction in RIG that the Soundex algorithm provides A. Karakasidis - University of Thessaly 29

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Use fake records in order to further reduce RIG � Results for � UCUP � UCSP � kaC A. Karakasidis - University of Thessaly 31

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Determine privacy gain by each fake injection strategy � Measure results for all four distributions A. Karakasidis - University of Thessaly 32

DPM 2011 - Leuven, Belgium 15-16 September 2011 � Determine impact on data quality � Estimate the number of additional records required by each strategy � Gather results for all four distributions A. Karakasidis - University of Thessaly 34

. S. Verykios 2 and P A. Karakasidis 1 , V . Christen 3 1 Department - PowerPoint PPT Presentation

DPM 2011 - Leuven, Belgium 15-16 September 2011 . S. Verykios 2 and P A. Karakasidis 1 , V . Christen 3 1 Department of Computer and Communication Engineering University of Thessaly Volos, Greece akarakasidis@inf.uth.gr 2 School of Science and T

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College /

A Tutorial on Techniques for Scalable Privacy-preserving Record Linkage Peter Christen 1 ,

DNS and Security DNS and Security DNS and Security DNS and Security DNS and Security DNS and

Ubiquitous and Secure Networks and Services Ubiquitous and Secure Networks and Services

Schizophrenia and Schizophrenia and Schizophrenia and Schizophrenia and Schizophrenia and

ENTREPRENEURSHIP and MSE DEVELOPMENT IN TRINIDAD AND TOBAGO 2014 and Beyond OVERVIEW AND

GREEN AREAS AND SCULPTURES HANGAR AND GENERAL VIEWS SCULPTURES COMMEMORATIVE MONUMENT AND PATHWAY

Fiscal and Contract Law I and I I : The Basics and Deployment I ssues The Basics and Deployment

Phase 1 and Phase 2 Upgrades Phase 1 and Phase 2 Upgrades and prospects for Higgs and EWK and

Webinar Agenda Employers and Employers and Employer and Employer and the LGPS the LGPS Fund

Developing Developing and Developing and Developing and researching and researching

Family and Community Engagement Pioneers and Best Practice RUSD Office of Family and Community

Building an Authentic Following 1 Your WHAT and WHY -Passion and Purpose- Your WHAT and WHY

To serve God and my country, honest and fair, To help people at all times, friendly and helpful,

Grif Griffin T Griffin T Grif Griffin T Grif Griffin T Grif n Tools and Supply n Tools and

Cosine (1.2 continued) Objectives: 1. Determine the range and period for sine and cosine and use

Kernels and Regularization on Discrete Domains Alexander J. Smola and Risi I. Kondor

Stata: A key strategic statistical tool of choice in major impact evaluations of socioeconomic

EI331 Signals and Systems Lecture 19 Bo Jiang John Hopcroft Center for Computer Science

/dev/world/2012 25-26 September Rydges Bell City Responsive web design Matt Gray & Scott

Need for Informatics/ Analytics in Retail May 15, 2012 Charlotte Informatics 2012 / May 15 2012

Area Report of e Research of APAN 32 (23 26, Aug 2011) Koji OKAMURA <oka@ec.kyushu

Stanislaw Lojasiewicz Lecture Optimal Transportation in the Twenty First Century Neil. S.

Identity in Humanist Workflows Project Bamboo Infrastructure for collaboration, content access, and

Sambuz

Useful Links

Newsletter

Mail Us

. S. Verykios 2 and P A. Karakasidis 1 , V . Christen 3 1 Department - PowerPoint PPT Presentation

DPM 2011 - Leuven, Belgium 15-16 September 2011 . S. Verykios 2 and P A. Karakasidis 1 , V . Christen 3 1 Department of Computer and Communication Engineering University of Thessaly Volos, Greece akarakasidis@inf.uth.gr 2 School of Science and T

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College /

A Tutorial on Techniques for Scalable Privacy-preserving Record Linkage Peter Christen 1 ,

DNS and Security DNS and Security DNS and Security DNS and Security DNS and Security DNS and

Ubiquitous and Secure Networks and Services Ubiquitous and Secure Networks and Services

Schizophrenia and Schizophrenia and Schizophrenia and Schizophrenia and Schizophrenia and

ENTREPRENEURSHIP and MSE DEVELOPMENT IN TRINIDAD AND TOBAGO 2014 and Beyond OVERVIEW AND

GREEN AREAS AND SCULPTURES HANGAR AND GENERAL VIEWS SCULPTURES COMMEMORATIVE MONUMENT AND PATHWAY

Fiscal and Contract Law I and I I : The Basics and Deployment I ssues The Basics and Deployment

Phase 1 and Phase 2 Upgrades Phase 1 and Phase 2 Upgrades and prospects for Higgs and EWK and

Webinar Agenda Employers and Employers and Employer and Employer and the LGPS the LGPS Fund

Developing Developing and Developing and Developing and researching and researching

Family and Community Engagement Pioneers and Best Practice RUSD Office of Family and Community

Building an Authentic Following 1 Your WHAT and WHY -Passion and Purpose- Your WHAT and WHY

To serve God and my country, honest and fair, To help people at all times, friendly and helpful,

Grif Griffin T Griffin T Grif Griffin T Grif Griffin T Grif n Tools and Supply n Tools and

Cosine (1.2 continued) Objectives: 1. Determine the range and period for sine and cosine and use

Kernels and Regularization on Discrete Domains Alexander J. Smola and Risi I. Kondor

Stata: A key strategic statistical tool of choice in major impact evaluations of socioeconomic

EI331 Signals and Systems Lecture 19 Bo Jiang John Hopcroft Center for Computer Science

/dev/world/2012 25-26 September Rydges Bell City Responsive web design Matt Gray &amp; Scott

Need for Informatics/ Analytics in Retail May 15, 2012 Charlotte Informatics 2012 / May 15 2012

Area Report of e Research of APAN 32 (23 26, Aug 2011) Koji OKAMURA &lt;oka@ec.kyushu

Stanislaw Lojasiewicz Lecture Optimal Transportation in the Twenty First Century Neil. S.

Identity in Humanist Workflows Project Bamboo Infrastructure for collaboration, content access, and

Sambuz

Useful Links

Newsletter

Mail Us

/dev/world/2012 25-26 September Rydges Bell City Responsive web design Matt Gray & Scott

Area Report of e Research of APAN 32 (23 26, Aug 2011) Koji OKAMURA <oka@ec.kyushu