CS573 Data Privacy and Security Statistical Databases Statistical - - PowerPoint PPT Presentation
CS573 Data Privacy and Security Statistical Databases Statistical - - PowerPoint PPT Presentation
CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong Today Statistical databases Definitions Early query restriction methods Output perturbation and differential privacy Output perturbation and
Today
Statistical databases
Definitions Early query restriction methods Output perturbation and differential privacy Output perturbation and differential privacy
Statistical Data Release
- Age
city 20 30 40 50 Population count
- Diagnosis
40 50
- Release statistical summary of the data (vs. individual records)
- Useful for analysis and learning
- Medical statistics
- Query log statistics – frequent search terms
- Still need rigorous inference control
A statistical database is a database which provides statistics on subsets of records Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of
Statistical Database
MEAN, MEDIAN, COUNT, MAX AND MIN of records Inference control to prevent inference from statistics to individual records
Methods
Data perturbation/anonymization Query restriction Output perturbation
Data Perturbation
Query Resitrction
!
"#
- Output Perturbation
Query Results
"
- $
- Query
Results
Methods
Data perturbation/anonymization Query restriction
Query set size control Query set overlap control Query set overlap control Query auditing
Output perturbation
Query Set Size Control
A query-set size control limit the number of records that must be in the result set Allows the query results to be displayed only if the size of the query set |C| satisfies the the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2
Query Set Size Control
Tracker
- Q1: Count ( Sex = Female ) = A
- Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1? What if B = A+1?
Tracker
- Q1: Count ( Sex = Female ) = A
- Q2: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1
- Q3: Count ( Sex = Female OR
(Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!
Query set size control
If the threshold value k is large, then it will restrict too many queries
And still does not guarantee protection from compromise compromise
The database can be easily compromised within a frame of 4-5 queries
Basic idea: successive queries must be checked against the number of common records. If the number of common records in any
Query Set Overlap Control
If the number of common records in any query exceeds a given threshold, the requested statistic is not released. A query q(C) is only allowed if: | q (C ) ^ q (D) | ≤ r, r > 0 Where r is set by the administrator
Query-set-overlap control
Statistics for a set and its subset cannot be released – limiting usefulness High processing overhead – every new query compared with all previous ones compared with all previous ones Multiple users - need to keep user profile, need to consider collusion between users Still no formal privacy guarantee
Auditing
Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued Excessive computation and storage Excessive computation and storage requirements Only “efficient” methods for special types of queries
Audit Expert (Chin 1982)
- Query auditing method for SUM queries
- A SUM query can be considered as a linear equation
where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result sensitive value, and q is the query result
- A set of SUM queries can be thought of as a system of linear
equations
- Maintains the binary matrix representing linearly independent
queries and update it when a new query is issued
- A row with all 0s except for ith column indicates disclosure
Audit Expert
Only stores linearly independent queries Not all queries are linearly independent
Q1: Sum(Sex=M) Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)
Audit Expert
O(L2) time complexity Further work reduced to O(L) time and space when number of queries < L Only for SUM queries Only for SUM queries
Auditing – recent developments
Online auditing
“Detect and deny” queries that violate privacy requirement Denial themselves may implicitly disclose sensitive Denial themselves may implicitly disclose sensitive information
Offline auditing
Check if a privacy requirement has been violated after the queries have been executed Not to prevent
Methods
Data perturbation/anonymization Query restriction Output perturbation
Differential privacy
Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set
E.g.: Q = select count() where Age = [20,30] and Diagnosis = B
Differential Privacy
= B
Output Perturbation D2 Bob out
User
Q D1 Bob in A(D2) A(D1)
- Differential privacy
Laplace mechanism
Q(D) + Y where Y is drawn from
Query sensitivity
Differential Privacy
Query sensitivity
Differentially Private Interface D2 Bob out
User
Q D1 Bob in A(D1) = Q(D1) + Y1 A(D2) = Q(D2) + Y2
Composition of Differential Privacy
- Sequential composition [McSherry SIGMOD 09]
Let Mi each provides differential privacy. The sequence of Mi provides differential privacy
- Parallel composition
If Di are disjoint subsets of the original database and Mi provides differential privacy for each Di, then the sequence of provides differential privacy for each Di, then the sequence of Mi provides differential privacy.
Differentially Private Interface D2 Bob out
User
Q1,Q2, … D1 Bob in A1(D2), A2(D2), … A1(D1), A2(D1), …
Differential Privacy
Is unfettered access to raw data truly essential? Is released data sufficient (provide sufficient utility guarantee)?
Privacy Raw Data Released Data
User
Privacy mechanism
- Diagnosis
Age
city count
Challenges
Differential privacy cost accumulates quickly with number of queries
Typical tasks require multiple queries or multiple steps steps Need to support multiple users
Impossible to guarantee utility for all (any) data or all (any) applications
Possible Middle Ground
Guaranteed utility for certain applications
Counting queries, classification, logistic regression
Guaranteed utility for certain kinds of data
Use prior or domain knowledge about data Use prior or domain knowledge about data Use intermediate results (differentially private)
Raw Data Released Data
User
Privacy mechanism Prior or domain knowledge Target Applications Intermediate Result
Our Research: Adaptive Differentially Private Data Release
- Data knowledge
- Dense and “smooth” data
- High dimensional and sparse data
- Dynamic data
- Application knowledge
- Query workload
- Specific tasks
- Specific tasks
Histogram Example
?
Strategy I: Baseline Cell Partitioning
diagnosis Age
% #% % &%
A B 20 30
Diagnosis Age
%' #%' %' &%'
20 30 A B
Q1: count() where Age = 20, Diagnosis = A Q2: count() where Age = 20, Diagnosis = B …
Q alpha DP
- Goal: to release a differentially private histogram to support random
predicate queries
- Q: select count() where Age = [20,30] and Income = 40K
- If a query predicate consists of multiple cells or partitions, it will have
aggregated perturbation error
Strategy II: Hierarchical Partitioning
%%'
20 30 A B
(%'
20 A B
alpha/3 alpha/3 diagnosis Age
% #% % &%
A B 20 30
- Large perturbation error due to small divided privacy budget at each level
#)%'
30
%' #%' %' &%'
20 30 A B
alpha/3
DPCube Strategy: Two phase partitioning
Age
#%%' #%' &%'
20 30 A B
diagnosis Age
% #% % &%
A B 20 30
- If a query predicate is contained in a published partition, the answer has to
be estimated typically based on a uniform distribution assumption. This introduces an approximation error.
DPCube Strategy: Two phase partitioning
%' #%' %' &%'
20 30
Cell histogram diagnosis Age
% #% % &%
A B 20 30
- 1. Cell Partitioning
- 2. Multi-dimensional
Partitioning
A B A B
%' %' #%' &%'
20 30
#%%' #%' &%'
20 30
partition histogram
% &%
30
Partitioning
A B
Partitioning Algorithm
Define a uniformity (randomness) measure for a partition H(Dt)
information gain, variance
Recursive algorithm Partition(Dt) for a given partition Dt
Find the best splitting point (e.g. largest information gain) and Partition the data into Dt1 and Dt2 Partition(Dt1) and Partition(Dt2) Partition(Dt1) and Partition(Dt2)
Privacy and Utility of the Released Histogram
The released data satisfies -differential privacy Support for count queries and other OLAP queries and learning tasks Formal utility results (epsilon,delta) - usefulness (epsilon,delta) - usefulness Experimental results for partition histogram
CENSUS dataset, 1M tuples, 4 attributes: Age (79), Education (14), Occupation (23), and Income (100) Report absolute error and relative error for random count queries
DPCube Result Example
Original histogram
- Diff. Private Cell histogram
- Diff. private partition histogram
- Diff. Private Estimated Cell histogram
Experimental Results: Comparison with other partitioning strategies
- Higher alpha (lower privacy) results in lower error (higher utility)
- Kd tree based approach outperforms others
- Cell partitioning is comparable in absolute error but suffers in relative error
due to the sparsity of the data
High dimensional sparse data
Many real-world data are high dimensional and sparse Web search log data, web transactions, etc. A direct application of the 2-phase approach A direct application of the 2-phase approach
Cell histogram highly inaccurate Computationally not scalable
39
Top-down recursive partitioning
Recursively partition the spaces that have sufficient density Use a context free taxonomy tree Dynamically allocate and keep track of the budget
Adaptive Hierarchical Strategy
- 1a. Overall count
- 1b. Partitioning of
non-sparse regions
Data is sparse and Highly dimensional
- n. Partition count
- 2a. Partition count
- 2b. Partitioning of
non-sparse regions