Statistical Databases Query Auditing Li Xiong CS573 Data Privacy - - PowerPoint PPT Presentation

statistical databases query auditing
SMART_READER_LITE
LIVE PREVIEW

Statistical Databases Query Auditing Li Xiong CS573 Data Privacy - - PowerPoint PPT Presentation

Statistical Databases Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin Query Audit Problem Maintaining privacy of data Auditor Q 1 Q 2 Q n Database


slide-1
SLIDE 1

Statistical Databases – Query Auditing

Li Xiong

CS573 Data Privacy and Anonymity

Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin

slide-2
SLIDE 2

slide 2

Query Audit Problem

Database Q1 Auditor

 Maintaining “privacy” of data

… Q2 … Qn A1 A2 … An Q Does answer to Q combined with answers to Q1,…,Qn reveal something? A or “Denied”

slide-3
SLIDE 3

slide 3

Variations of the Problem

Database Q1 Auditor … Q2 … Qn A1 A2 … An List of real, integer, or Boolean values Specifies subset of the variables Min, max, median, sum, average,

  • r count of specified subset

Wants to learn value of some variable

slide-4
SLIDE 4

slide 4

 Offline auditing

 Given a collection of queries and answers to them,

check whether anything “forbidden” was revealed

 Detects privacy breaches after the fact

 Online auditing

 Queries are presented to auditor one at a time;

auditor checks if answering the current query (in combination with past answers) reveals “forbidden” information

 Prevents privacy breaches on-the-fly

Offline vs. Online

slide-5
SLIDE 5

Disclosure measures

 Full disclosure (exact-value disclosure) – the

exact value of a protected attribute is disclosed

 Partial disclosure (interval-based disclosure)

– the disclosed range (difference of the lower and upper bounds) of the protected attribute is smaller than a predefined threshold

 Probability-based disclosure – the posterior

distribution of the data after answering queries is significantly different from its prior distribution

slide-6
SLIDE 6

Offline Auditors for Full Disclosure

 Sum queries  Max and min queries  Median and average queries

slide-7
SLIDE 7

Audit Expert (Chin 1982)

 Query auditing method for SUM queries  A SUM query can be considered as a linear equation

where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result

 A set of SUM queries can be thought of as a system of

linear equations

 Maintains the binary matrix representing linearly

independent queries and update it when a new query is issued

 A row with all 0s except for ith column indicates disclosure

slide-8
SLIDE 8

Offline Auditing for Full Disclosure

 Arbitrary combinations of aggregate queries  It is unlikely there will be efficient on-line

application algorithms for SUM + MAX queries, SUM + MIN queries, or SUM + MAX + MIN queries.

 Example hardness results:

  • Theorem. There is no polynomial time full-

disclosure auditing algorithm for sum and max queries unless P=NP.

slide-9
SLIDE 9

slide 9

Auditing Sum Queries on Booleans

 Database: collection of secret Boolean variables  Query: specifies subset S of variables  Answer: sum of variables in S  Privacy breach: after asking several queries, user

learns the value of some secret variable(s)

 Auditing problem: given a set of Boolean

equations, is there a variable that has the same value in all solutions?

Auditing Boolean attributes, Kleinberg, 2000

slide-10
SLIDE 10

slide 10

 Linear Diaphantine equations  Query can be safe on real-valued, unbounded

data, but reveal information when the data are discrete, with known bounds

 Hardness results: the auditing problem for

Boolean values is coNP-complete. x + y + w = 1 y + z = 1 x + z = 1

Real: multiple solutions, secure Boolean: unique solution, insecure (why?)

Why Is This Interesting?

slide-11
SLIDE 11

Offline Auditor for Partial Disclosure

 Partial disclosure – the disclosed range of the protected

attribute is smaller than a predefined threshold

 Sum queries  Interval-based disclosure: monitoring upper and lower

bounds for all confidential attributes

Auditing interval-based inference. Li et al. 2002

slide-12
SLIDE 12

Offline Auditor for Partial Disclosure

 New query  A series of linear programming problems

slide-13
SLIDE 13

Incremental evaluation of the LP

 Treats the auditing problem as a series of

updation problems and updates the bounds with certain rules

 Horizontal updation – given the same set of

queries, the bounds of one variable, how can the prior result be modified to get the bounds

  • f other variables

 Vertical updatation – given the same set of

variables, and bounds under the previous queries, how can the prior result modified to get the bounds when a new query arrives

An efficient online auditing approach to limit private data disclosure, Lu, 2009

slide-14
SLIDE 14

slide 14

Online Auditing

Database Qi+1 Auditor A or “Denied” Previous queries Q1 … Qi “Denied” if answering Qi+1 would cause a privacy breach

slide-15
SLIDE 15

Online Auditing

 Given a sequence of queries and

corresponding answers, and a new query, determine if the new query should be answered or denied in order to prevent a privacy breach.

 Earliest approaches

 Query set size control  Query set overlap control  Limited data utility

slide-16
SLIDE 16

Offline to Online?

 Can an offline auditor directly solve the online

auditing problem?

 Denials leak information!

slide-17
SLIDE 17

slide 17

“On the advice of my counsel I respectfully and regretfully decline to answer the question based

  • n my constitutional rights.”

Colonel Oliver North, on the Iran-Contra arms deal “Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States.” David Duncan, former auditor for Enron and partner in Arthur Andersen

Sounds Familiar?

[slide stolen from Kobbi Nissim]

slide-18
SLIDE 18

slide 18

 Variables di are real, privacy breached if

adversary learns some di

Example: Sum/Max

Database

Gimme sum(d1,d2,d3)

Auditor

Answer=15 Gimme max(d1,d2,d3) “Denied”

Wait… there must be a reason why second query was denied Oh well The only possible reason for denial is if d1=d2=d3=5

slide-19
SLIDE 19

slide 19

Online Audit

 Denials reduce the search space

Possible assignments to {d1,…,dn} Assignments consistent with (q1,…qi; a1,…,ai) qi+1 denied

slide-20
SLIDE 20

One workaround

 Deny whenever the offline algorithm does, in

addition, randomly deny queries.

 Issues

 Leakage is not prevented  Have to remember which queries were

randomly denied

 Semantically determine whether two queries

are equivalent

slide-21
SLIDE 21

Simulatable Auditing

 Observation: denials have the potential to

leak information if the auditor uses information that is unavailable to the attacker (the answer to the new query)

 Simlatable auditing: the attacker should be

able to simulate or mimic the auditors decisions to answer or deny a query

 Denials provably do not leak information

Simulatable Auditing, Kenthapadi, 2005

slide-22
SLIDE 22

slide 22

Simulatable Auditing

 An auditor is a function of Q, A and X  An auditor is simulatable if there exists a simulator

that is a function of only Q and A-ai+1 and whose

  • utput is always equal to the auditor.

Auditor qi+1 Deny or answer qi+1

Deny or answer Simulator q1,…,qi a1,…,ai Database q1,…,qi

slide-23
SLIDE 23

slide 23

Possible assignments to {d1,…,dn} Assignments consistent with (q1,…qi, a1,…,ai ) qi+1 denied/allowed

Simulatable Auditing

slide-24
SLIDE 24

Simulatable Auditing

 Query-set-size control  Query-set-overlap control  Audit expert for sum queries

slide-25
SLIDE 25

Constructing Simulatable Auditor

 General sufficient condition: the auditor

should determine if there is any possible dataset, consistent with all past responses, in which the answer to the current query would cause some element to be fully disclosed

slide-26
SLIDE 26

slide 26

Example revisited: Sum/Max

Database

Gimme sum(d1,d2,d3)

Auditor

Answer=a1 Gimme max(d1,d2,d3) “Denied”

 Simulatable auditor would always deny the max

query following a sum query

 Lose some utility due to the requirement of

simulatability

slide-27
SLIDE 27

Challenges

 Privacy definition

 Privacy of groups/families

 Algorithmic limitations

 Simulatable algorithms computationally prohibitive  Most work on sum queries, some on max, min,

median, hardness results on mixed queries

 Collusion

 Reduced utility for legitimate users  Large audit trail

 Utility

 Percentage of denials may not be the best measure