Privacy Cognizant Information Privacy Cognizant Information Systems Systems
Rakesh Rakesh Agrawal Agrawal IBM IBM Almaden Almaden Research Center Research Center
- Jt. work with Srikant, Kiernan, Xu & Evfimievski
Privacy Cognizant Information Privacy Cognizant Information Systems - - PowerPoint PPT Presentation
Privacy Cognizant Information Privacy Cognizant Information Systems Systems Rakesh Agrawal Agrawal Rakesh IBM Almaden Almaden Research Center Research Center IBM Jt. work with Srikant, Kiernan, Xu & Evfimievski Evfimievski Thesis
ƒ ƒ protect the privacy and ownership of information protect the privacy and ownership of information ƒ ƒ do not impede the flow of information do not impede the flow of information
– – U.S. and international regulations U.S. and international regulations – – Legal proceedings against businesses Legal proceedings against businesses
– – Consumer privacy apprehensions continue to plague the Consumer privacy apprehensions continue to plague the Web … these fears will hold back roughly $15 billion in e Web … these fears will hold back roughly $15 billion in e-
Commerce revenue.” Forrester Research, 2001 – – Most consumers are “privacy pragmatists.” Westin Most consumers are “privacy pragmatists.” Westin Surveys Surveys
– – The right to privacy: the most cherished of human The right to privacy: the most cherished of human freedom freedom --
Warren & Brandeis, 1890
– – Web Web-
commerce, e.g. recommendation service
– – Must not slow Must not slow-
down the speed of client interaction – – Must scale to very large number of clients Must scale to very large number of clients
– – Ship model to the clients Ship model to the clients – – Use oblivious computations Use oblivious computations
35 95,000 J.S. Bach painting nasa 35 95,000 J.S. Bach painting nasa 45 60,000
baseball cnn 45 60,000
baseball cnn 42 85,000
camping microsoft 42 85,000
camping microsoft 45 60,000
baseball cnn 35 95,000 J.S. Bach painting nasa
42 85,000
camping, microsoft
35 95,000 J.S. Bach painting nasa 35 95,000 J.S. Bach painting nasa 45 60,000
baseball cnn 45 60,000
baseball cnn 42 85,000
camping microsoft 42 85,000
camping microsoft 45 60,000
baseball cnn 35 95,000 J.S. Bach painting nasa
42 85,000
camping, microsoft Mining Algorithm Data Mining Model
50 65,000 Metallica painting nasa 50 65,000 Metallica painting nasa 38 90,000
soccer fox 38 90,000
soccer fox 32 55,000
camping linuxware 32 55,000
camping linuxware 45 60,000
baseball cnn 35 95,000 J.S. Bach painting nasa
42 85,000
camping, microsoft
35 becomes 50 (35+15)
Per-record randomization without considering other records Randomization parameters common across users Randomization techniques differ for numeric and categorical data Each attribute randomized independently
50 65,000 Metallica painting nasa 50 65,000 Metallica painting nasa 38 90,000
soccer fox 38 90,000
soccer fox 32 55,000
camping linuxware 32 55,000
camping linuxware 45 60,000
baseball cnn 35 95,000 J.S. Bach painting nasa
42 85,000
camping, microsoft
True values Never Leave the User!
50 65,000 Metallica painting nasa 50 65,000 Metallica painting nasa 38 90,000
soccer fox 38 90,000
soccer fox 32 55,000
camping linuxware 32 55,000
camping linuxware 45 60,000
baseball cnn 35 95,000 J.S. Bach painting nasa
42 85,000
camping, microsoft Data Mining Model Mining Algorithm Recovery
Recovery of distributions, not individual records
1, x
2, ...,
n
1, y
2, ...,
n
1+y
1, x
2+y
2, ...,
n+y
n
f fX
X 0 := Uniform distribution
:= Uniform distribution j := 0 j := 0 repeat repeat f fX
X j+1 j+1(a) :=
(a) := Bayes Bayes’ Rule ’ Rule j := j+1 j := j+1 until (stopping criterion met) until (stopping criterion met)
(R. (R. Agrawal Agrawal & R. & R. Srikant Srikant, SIGMOD 2000) , SIGMOD 2000)
Converges to maximum likelihood estimate.
– – D. Agrawal & C.C. Aggarwal, PODS 2001.
= ∞ ∞ −
n i j X i i Y j X i i Y
1
– – Reconstruct for each attribute once at the beginning Reconstruct for each attribute once at the beginning
– – For each attribute, first split by class, then reconstruct For each attribute, first split by class, then reconstruct separately for each class. separately for each class.
– – Reconstruct at each node Reconstruct at each node See SIGMOD 2000 paper for details. See SIGMOD 2000 paper for details.
– – Original Original: unperturbed data without randomization. : unperturbed data without randomization. – – Randomized Randomized: perturbed data but without making any : perturbed data but without making any corrections for randomization. corrections for randomization.
Fn 1 Fn 2 Fn 3 Fn 4 Fn 5 50 60 70 80 90 100
Original Randomized Reconstructed
10 20 40 60 80 100 150 200
40 50 60 70 80 90 100
Original Randomized Reconstructed
– – Rizvi Rizvi & & Haritsa Haritsa [VLDB 02] [VLDB 02] – Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02]
Privacy Breach Control: Probabilistic limits on what
– Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02] – Evfimievski, Gehrke & Srikant [PODS-03]
How to build a decision-
tree classifier on the union of two private databases ( private databases (Lindell Lindell & & Pinkas Pinkas [Crypto 2000]) [Crypto 2000])
Basic Idea:
Find attribute with highest information gain privately
Independently split on this attribute and recurse recurse
Selecting the Split Attribute
Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2) log (v1 + v2) and output random shares of the answer log (v1 + v2) and output random shares of the answer
Given random shares, use Yao's Yao's protocol protocol [FOCS 84]
[FOCS 84] to compute
to compute information gain. information gain.
Trade-
+ + Accuracy Accuracy – – Performance & scaling Performance & scaling
– – Association rules Association rules – – EM Clustering EM Clustering
– – Trade off between the amount of privacy breach and Trade off between the amount of privacy breach and performance performance – – Examination of other approaches (e.g. randomization Examination of other approaches (e.g. randomization based on swapping) based on swapping)
– – What I may see or hear in the course of treatment … I will What I may see or hear in the course of treatment … I will keep to myself. keep to myself.
– – US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), Australia (2000), Japan (2003) Australia (2000), Japan (2003)
Agrawal, Kiernan, Srikant & Xu: VLDB 2002..
Purpose Specification
Associate with data the Associate with data the purposes for collection purposes for collection
Consent
Obtain donor’s consent on the Obtain donor’s consent on the purposes purposes
Limited Collection
Collect minimum necessary Collect minimum necessary data data
Limited Use
Run only queries that are Run only queries that are consistent with the purposes consistent with the purposes
Limited Disclosure
Do not release data without Do not release data without donor’s consent donor’s consent
Limited Retention
Do not retain data beyond Do not retain data beyond necessary necessary
Accuracy
Keep data accurate and up Keep data accurate and up-
to-
date
Safety
Protect against theft and other Protect against theft and other misappropriations misappropriations
Openness
Allow donor access to data Allow donor access to data about the donor about the donor
Compliance
Verifiable compliance with the Verifiable compliance with the above principles above principles
Privacy Policy Privacy Metadata Creator
Store
Privacy Metadata
For each purpose & piece
Different designs possible. Converts privacy policy into privacy metadata tables.
Limited Disclosure Limited Retention
{mining} {mining} {registration} {registration} {registration} {registration} {shipping} {shipping} {shipping, charge} {shipping, charge}
Authorized Authorized-
users
10 years 10 years empty empty book book
recommend recommend ations ations 3 years 3 years empty empty email email customer customer register register 3 years 3 years empty empty name name customer customer register register 1 month 1 month empty empty email email customer customer purchase purchase 1 month 1 month {delivery, {delivery, credit credit-
card} name name customer customer purchase purchase
Retention Retention External External-
recipients Attribute Attribute Table Table Purpose Purpose
Data Collection
Store
Privacy Constraint Validator Audit Info Audit Trail Privacy Metadata
Privacy policy compatible with user’s privacy preference? Audit trail for compliance.
Compliance Consent
Data Collection
Store
Privacy Constraint Validator Data Accuracy Analyzer Audit Info Audit Trail Privacy Metadata
Data cleansing, e.g., errors in address.
Record Access Control
Associate set of purposes with each record.
Purpose Specification Accuracy
Queries
Store
Attribute Access Control Privacy Metadata Record Access Control
“telemarketing” cannot see credit card info.
include “telemarketing” in set of purposes.
Safety Limited Use
cannot issue query tagged “charge”.
Safety
Queries
Store
Audit Info Audit Trail Query Intrusion Detector Attribute Access Control Privacy Metadata Record Access Control
Telemarketing query that asks for all phone numbers.
query intrusion detector
Safety Compliance
Store
Privacy Metadata Other Data Retention Manager Encryption Support
Delete items in accordance with privacy policy. Additional security for sensitive data.
Data Collection Analyzer
Analyze queries to identify unnecessary collection, retention & authorizations.
Limited Retention Limited Collection Safety
Privacy Policy Data Collection Queries Privacy Metadata Creator
Store
Privacy Constraint Validator Data Accuracy Analyzer Audit Info Audit Info Audit Trail Query Intrusion Detector Attribute Access Control Privacy Metadata Other Data Retention Manager Record Access Control Encryption Support Data Collection Analyzer
Statistical Databases
– – Provide statistical information (sum, count, etc.) without Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89] compromising sensitive information about individuals, [AW89]
Multilevel Secure Databases
– – Multilevel relations, e.g., records tagged “secret”, “confidenti Multilevel relations, e.g., records tagged “secret”, “confidential”, al”,
Need to protect privacy in transactional databases that support daily operations. support daily operations.
– – Cannot restrict queries to statistical queries. Cannot restrict queries to statistical queries. – – Cannot tag all the records “top secret”. Cannot tag all the records “top secret”.
Privacy enforcement requires cell-
level decisions (which may be different for different queries) be different for different queries) – – How to minimize the cost of privacy checking? How to minimize the cost of privacy checking?
Encryption to avoid data theft – – How to index encrypted data for range queries? How to index encrypted data for range queries?
Intrusive queries from authorized users – – Query intrusion detection? Query intrusion detection?
Identifying unnecessary data collection – – Assets info needed only if salary is below a threshold Assets info needed only if salary is below a threshold – – Queries only ask “Salary > threshold” for rent application Queries only ask “Salary > threshold” for rent application
Forgetting data after the purpose is fulfilled – – Databases designed not to lose data Databases designed not to lose data – – Interaction with compliance Interaction with compliance
Mediator Q R Federated Q R Centralized
R Shopping List S Technology List Example 2: Govt. agencies sharing information on a need-to-know basis.
Mayo Clinic DNA Sequences Drug Reactions ? ? ? ? Sequence Absent Sequence Absent ? ? ? ? Sequence Present Sequence Present No Adv. Reaction No Adv. Reaction Adverse Reaction Adverse Reaction
R S
know that S has b & y
know that R has a & x
v v u u
R S
x x v v u u a a y y v v u u b b
R S
Count (R S)
anything except that the result is 2.
Given: – – Two parties (honest Two parties (honest-
but-
curious): R (receiver) and S (sender) (sender) – – Query Q spanning the tables R and S Query Q spanning the tables R and S – – Additional (pre Additional (pre-
specified) categories of information I
Compute the answer to Q and return it to R without revealing any additional information to either party, any additional information to either party, except for the except for the information contained in I information contained in I
– – For intersection, intersection size & For intersection, intersection size & equijoin equijoin, , I = { |R| , |S| } I = { |R| , |S| } – – For For equijoin equijoin size, I also includes the distribution of duplicates & size, I also includes the distribution of duplicates & some subset of information in R some subset of information in R S S
– – Given two parties with inputs x and y, compute Given two parties with inputs x and y, compute f(x,y f(x,y) such ) such that the parties learn only that the parties learn only f(x,y f(x,y) and nothing else. ) and nothing else. – – Can be solved by building a combinatorial circuit, and Can be solved by building a combinatorial circuit, and simulating that circuit [Yao86]. simulating that circuit [Yao86].
– – Intersection of two relations of a million records each Intersection of two relations of a million records each would require 144 days would require 144 days
Commutative encryption F is a computable function Commutative encryption F is a computable function f : Key F X Dom F f : Key F X Dom F -
> Dom F, satisfying:
– – For all e, e’ For all e, e’
Key F,
Key F, f fe
e o
fe
e’ ’ =
= f fe
e’ ’ o
fe
e
(The result of encryption with two different keys is the same, (The result of encryption with two different keys is the same, irrespective of the order of encryption) irrespective of the order of encryption) – – Each Each f fe
e is a
is a bijection bijection. . (Two different values will have different encrypted values) (Two different values will have different encrypted values) – – The distribution of <x, The distribution of <x, f fe
e(x
(x), y, ), y, f fe
e(y
(y)> is indistinguishable from the )> is indistinguishable from the distribution of <x, distribution of <x, f fe
e(x
(x), y, z>; x, y, z ), y, z>; x, y, z
r
r Dom F and e
Dom F and e
r
r Key F.
Key F. (Given a value x and its encryption (Given a value x and its encryption f fe
e(x
(x), for a new value y, we ), for a new value y, we cannot distinguish between cannot distinguish between f fe
e(y
(y) and a random value z. Thus we ) and a random value z. Thus we cannot encrypt y nor decrypt cannot encrypt y nor decrypt f fe
e(y
(y).) ).)
e(x
e mod p
( (x xd
d mod
mod p) p)e
e mod p =
mod p = x xde
de mod p = (
mod p = (x xe
e mod
mod p) p)d
d mod p
mod p
R S Secret key r s fs(S ) We apply fs on h(S), where h is a hash function, not directly
Shorthand for { fs(x) | x S }
R S fs(S) fs(S ) fr(fs(S )) r s fs(fr(S )) Commutative property
R S fr(R ) fr(R ) fs(fr(S )) <y, fs(y)> for y fr(R) r s <x, fs(fr(x))> for x R <y, fs(y)> for y fr(R) Since R knows <x, y=fr(x)>
R S fr(R ) fs(S ) fs(S ) fr(R ) fr(fs(S )) r s fs(fr(R )) fr(fs(R)) R cannot map z fr(fs(R)) back to x R. Not <y, fs(y)> for y fr(R)
– – Oblivious evaluation of n polynomials of degree n each. Oblivious evaluation of n polynomials of degree n each. – – Oblivious evaluation of n Oblivious evaluation of n2
2 polynomials.
polynomials.
– – Intersection protocols are similar to ours, but do not Intersection protocols are similar to ours, but do not provide proofs of security. provide proofs of security.
– – other database operations
– – combination of operations combination of operations
– – the additional information disclosed the additional information disclosed – – approximation approximation
Bawa, R. , R. Bayardo Bayardo, R. , R. Agrawal Agrawal. . Privacy Privacy-
preserving indexing of Documents on the Network
. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003. 2003.
Agrawal, A. , A. Evfimievski Evfimievski, R. , R. Srikant Srikant. . Information Sharing Across Private Databases Information Sharing Across Private Databases. . ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, Calif ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
Evfimievski, J. , J. Gehrke Gehrke, R. , R. Srikant Srikant. . Liming Privacy Breaches in Liming Privacy Breaches in Privacy Preserving Privacy Preserving Data Mining Data Mining. PODS, San Diego, California, June 2003. . PODS, San Diego, California, June 2003.
Agrawal, J. Kiernan, R. , J. Kiernan, R. Srikant Srikant, Y. , Y. Xu Xu. . An An Xpath Xpath Based Preference Language for Based Preference Language for P3P P3P. . 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 20 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. 03.
Agrawal, J. Kiernan, R. , J. Kiernan, R. Srikant Srikant, Y. , Y. Xu Xu. . Implementing P3P Using Database Implementing P3P Using Database Technology Technology. . 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, Mar 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. ch 2003.
Agrawal, J. Kiernan, R. , J. Kiernan, R. Srikant Srikant, Y. , Y. Xu Xu. . Server Centric P3P. Server Centric P3P. W3C Workshop on the W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002. Future of P3P, Dulles, Virginia, Nov. 2002.
Agrawal, J. Kiernan, R. , J. Kiernan, R. Srikant Srikant, Y. , Y. Xu Xu. . Hippocratic Databases Hippocratic Databases. . 28th Int'l Conf. on Very 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. Large Databases (VLDB), Hong Kong, August 2002.
Agrawal, J. Kiernan. , J. Kiernan. Watermarking Relational Databases Watermarking Relational Databases. . 28th Int'l Conf. on Very 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. Expanded version Large Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal in VLDB Journal 2003. 2003.
Evfimievski, R. , R. Srikant Srikant, R. , R. Agrawal Agrawal, J. , J. Gehrke Gehrke. . Mining Association Rules Over Privacy Mining Association Rules Over Privacy Preserving Data Preserving Data. . 8th Int'l Conf. on Knowledge Discovery in Databases and Data Min 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining ing (KDD), Edmonton, Canada, July 2002 (KDD), Edmonton, Canada, July 2002. .
Agrawal, R. , R. Srikant Srikant. . Privacy Preserving Data Mining Privacy Preserving Data Mining. ACM Int’l Conf. On . ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000. Management of Data (SIGMOD), Dallas, Texas, May 2000.