[PPT] - CS573 Data Privacy and Security Statistical Databases Statistical PowerPoint Presentation

SLIDE 1

CS573 Data Privacy and Security Statistical Databases Statistical Databases

Li Xiong

SLIDE 2

Today

Statistical databases

Definitions Early query restriction methods Output perturbation and differential privacy Output perturbation and differential privacy

SLIDE 3

Statistical Data Release

Age

city 20 30 40 50 Population count

Diagnosis

40 50

Release statistical summary of the data (vs. individual records)
Useful for analysis and learning
Medical statistics
Query log statistics – frequent search terms
Still need rigorous inference control

SLIDE 4

A statistical database is a database which provides statistics on subsets of records Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of

Statistical Database

MEAN, MEDIAN, COUNT, MAX AND MIN of records Inference control to prevent inference from statistics to individual records

SLIDE 5

Methods

Data perturbation/anonymization Query restriction Output perturbation

SLIDE 6

Data Perturbation

SLIDE 7

Query Resitrction

SLIDE 8

!

"#

Output Perturbation

Query Results

"

$
Query

Results

SLIDE 9

Methods

Data perturbation/anonymization Query restriction

Query set size control Query set overlap control Query set overlap control Query auditing

Output perturbation

SLIDE 10

Query Set Size Control

A query-set size control limit the number of records that must be in the result set Allows the query results to be displayed only if the size of the query set |C| satisfies the the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

SLIDE 11

Query Set Size Control

SLIDE 12

Tracker

Q1: Count ( Sex = Female ) = A
Q2: Count ( Sex = Female OR

(Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1? What if B = A+1?

SLIDE 13

Tracker

Q1: Count ( Sex = Female ) = A
Q2: Count ( Sex = Female OR

(Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1

Q3: Count ( Sex = Female OR

(Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

SLIDE 14

Query set size control

If the threshold value k is large, then it will restrict too many queries

And still does not guarantee protection from compromise compromise

The database can be easily compromised within a frame of 4-5 queries

SLIDE 15

Basic idea: successive queries must be checked against the number of common records. If the number of common records in any

Query Set Overlap Control

If the number of common records in any query exceeds a given threshold, the requested statistic is not released. A query q(C) is only allowed if: | q (C ) ^ q (D) | ≤ r, r > 0 Where r is set by the administrator

SLIDE 16

Query-set-overlap control

Statistics for a set and its subset cannot be released – limiting usefulness High processing overhead – every new query compared with all previous ones compared with all previous ones Multiple users - need to keep user profile, need to consider collusion between users Still no formal privacy guarantee

SLIDE 17

Auditing

Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued Excessive computation and storage Excessive computation and storage requirements Only “efficient” methods for special types of queries

SLIDE 18

Audit Expert (Chin 1982)

Query auditing method for SUM queries
A SUM query can be considered as a linear equation

where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result sensitive value, and q is the query result

A set of SUM queries can be thought of as a system of linear

equations

Maintains the binary matrix representing linearly independent

queries and update it when a new query is issued

A row with all 0s except for ith column indicates disclosure

SLIDE 19

Audit Expert

Only stores linearly independent queries Not all queries are linearly independent

Q1: Sum(Sex=M) Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

SLIDE 20

Audit Expert

O(L2) time complexity Further work reduced to O(L) time and space when number of queries < L Only for SUM queries Only for SUM queries

SLIDE 21

Auditing – recent developments

Online auditing

“Detect and deny” queries that violate privacy requirement Denial themselves may implicitly disclose sensitive Denial themselves may implicitly disclose sensitive information

Offline auditing

Check if a privacy requirement has been violated after the queries have been executed Not to prevent

SLIDE 22

Methods

Data perturbation/anonymization Query restriction Output perturbation

Differential privacy

SLIDE 23

Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set

E.g.: Q = select count() where Age = [20,30] and Diagnosis = B

Differential Privacy

= B

Output Perturbation D2 Bob out

User

Q D1 Bob in A(D2) A(D1)

SLIDE 24

Differential privacy

Laplace mechanism

Q(D) + Y where Y is drawn from

Query sensitivity

Differential Privacy

Query sensitivity

Differentially Private Interface D2 Bob out

User

Q D1 Bob in A(D1) = Q(D1) + Y1 A(D2) = Q(D2) + Y2

SLIDE 25

Composition of Differential Privacy

Sequential composition [McSherry SIGMOD 09]

Let Mi each provides differential privacy. The sequence of Mi provides differential privacy

Parallel composition

If Di are disjoint subsets of the original database and Mi provides differential privacy for each Di, then the sequence of provides differential privacy for each Di, then the sequence of Mi provides differential privacy.

Differentially Private Interface D2 Bob out

User

Q1,Q2, … D1 Bob in A1(D2), A2(D2), … A1(D1), A2(D1), …

SLIDE 26

Differential Privacy

Is unfettered access to raw data truly essential? Is released data sufficient (provide sufficient utility guarantee)?

Privacy Raw Data Released Data

User

Privacy mechanism

Diagnosis

Age

city count

SLIDE 27

Challenges

Differential privacy cost accumulates quickly with number of queries

Typical tasks require multiple queries or multiple steps steps Need to support multiple users

Impossible to guarantee utility for all (any) data or all (any) applications

SLIDE 28

Possible Middle Ground

Guaranteed utility for certain applications

Counting queries, classification, logistic regression

Guaranteed utility for certain kinds of data

Use prior or domain knowledge about data Use prior or domain knowledge about data Use intermediate results (differentially private)

Raw Data Released Data

User

Privacy mechanism Prior or domain knowledge Target Applications Intermediate Result

SLIDE 29

Our Research: Adaptive Differentially Private Data Release

Data knowledge
Dense and “smooth” data
High dimensional and sparse data
Dynamic data
Application knowledge
Query workload
Specific tasks
Specific tasks

SLIDE 30

Histogram Example

?

SLIDE 31

Strategy I: Baseline Cell Partitioning

diagnosis Age

% #% % &%

A B 20 30

Diagnosis Age

%' #%' %' &%'

20 30 A B

Q1: count() where Age = 20, Diagnosis = A Q2: count() where Age = 20, Diagnosis = B …

Q alpha DP

Goal: to release a differentially private histogram to support random

predicate queries

Q: select count() where Age = [20,30] and Income = 40K
If a query predicate consists of multiple cells or partitions, it will have

aggregated perturbation error

SLIDE 32

Strategy II: Hierarchical Partitioning

%%'

20 30 A B

(%'

20 A B

alpha/3 alpha/3 diagnosis Age

% #% % &%

A B 20 30

Large perturbation error due to small divided privacy budget at each level

#)%'

30

%' #%' %' &%'

20 30 A B

alpha/3

SLIDE 33

DPCube Strategy: Two phase partitioning

Age

#%%' #%' &%'

20 30 A B

diagnosis Age

% #% % &%

A B 20 30

If a query predicate is contained in a published partition, the answer has to

be estimated typically based on a uniform distribution assumption. This introduces an approximation error.

SLIDE 34

DPCube Strategy: Two phase partitioning

%' #%' %' &%'

20 30

Cell histogram diagnosis Age

% #% % &%

A B 20 30

1. Cell Partitioning
2. Multi-dimensional

Partitioning

A B A B

%' %' #%' &%'

20 30

#%%' #%' &%'

20 30

partition histogram

% &%

30

Partitioning

A B

SLIDE 35

Partitioning Algorithm

Define a uniformity (randomness) measure for a partition H(Dt)

information gain, variance

Recursive algorithm Partition(Dt) for a given partition Dt

Find the best splitting point (e.g. largest information gain) and Partition the data into Dt1 and Dt2 Partition(Dt1) and Partition(Dt2) Partition(Dt1) and Partition(Dt2)

SLIDE 36

Privacy and Utility of the Released Histogram

The released data satisfies -differential privacy Support for count queries and other OLAP queries and learning tasks Formal utility results (epsilon,delta) - usefulness (epsilon,delta) - usefulness Experimental results for partition histogram

CENSUS dataset, 1M tuples, 4 attributes: Age (79), Education (14), Occupation (23), and Income (100) Report absolute error and relative error for random count queries

SLIDE 37

DPCube Result Example

Original histogram

Diff. Private Cell histogram
Diff. private partition histogram
Diff. Private Estimated Cell histogram

SLIDE 38

Experimental Results: Comparison with other partitioning strategies

Higher alpha (lower privacy) results in lower error (higher utility)
Kd tree based approach outperforms others
Cell partitioning is comparable in absolute error but suffers in relative error

due to the sparsity of the data

SLIDE 39

High dimensional sparse data

Many real-world data are high dimensional and sparse Web search log data, web transactions, etc. A direct application of the 2-phase approach A direct application of the 2-phase approach

Cell histogram highly inaccurate Computationally not scalable

39

SLIDE 40

Top-down recursive partitioning

Recursively partition the spaces that have sufficient density Use a context free taxonomy tree Dynamically allocate and keep track of the budget

SLIDE 41

Adaptive Hierarchical Strategy

1a. Overall count
1b. Partitioning of

non-sparse regions

Data is sparse and Highly dimensional

n. Partition count
2a. Partition count
2b. Partitioning of

non-sparse regions

SLIDE 42