Di ff erentially-Private Batch Query Answering Exploiting the - - PowerPoint PPT Presentation

di ff erentially private batch query answering
SMART_READER_LITE
LIVE PREVIEW

Di ff erentially-Private Batch Query Answering Exploiting the - - PowerPoint PPT Presentation

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data Gerome Miklau University of Massachusetts, Amherst DIMACS Workshop on Recent Work on Di ff erential Privacy across Computer Science October 2012


slide-1
SLIDE 1

Differentially-Private Batch Query Answering

Exploiting the Workload vs. Exploiting the Data

Gerome Miklau

University of Massachusetts, Amherst

DIMACS Workshop on Recent Work on Differential Privacy across Computer Science • October 2012

slide-2
SLIDE 2

Batch (non-interactive) query answering

  • Given: a fixed set of queries
  • complex data analysis task into simpler queries.
  • multiple users each issuing one or more queries.
  • uncertainty about the eventual query answers

needed--design workload to include all queries possibly of interest.

slide-3
SLIDE 3

Batch (non-interactive) query answering

the “workload”

  • Given: a fixed set of queries
  • complex data analysis task into simpler queries.
  • multiple users each issuing one or more queries.
  • uncertainty about the eventual query answers

needed--design workload to include all queries possibly of interest.

slide-4
SLIDE 4

Batch (non-interactive) query answering

  • Goal: release answers to all queries under ε- or (ε, δ)-

differential privacy.

the “workload”

  • Given: a fixed set of queries
  • complex data analysis task into simpler queries.
  • multiple users each issuing one or more queries.
  • uncertainty about the eventual query answers

needed--design workload to include all queries possibly of interest.

slide-5
SLIDE 5

Batch (non-interactive) query answering

  • Goal: release answers to all queries under ε- or (ε, δ)-

differential privacy.

the “workload”

  • Linear counting queries
  • includes predicate counting queries, spatial queries,

multi-dimensional range queries, marginals, data cubes, etc.

  • Given: a fixed set of queries
  • complex data analysis task into simpler queries.
  • multiple users each issuing one or more queries.
  • uncertainty about the eventual query answers

needed--design workload to include all queries possibly of interest.

slide-6
SLIDE 6

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

analyst server

slide-7
SLIDE 7

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3 w1(D) + noise w2(D) + noise w3(D) + noise

analyst server

slide-8
SLIDE 8

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

slide-9
SLIDE 9

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3 a1(D) + noise a2(D) + noise a3(D) + noise

Observations A

slide-10
SLIDE 10

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3 a1(D) + noise a2(D) + noise a3(D) + noise

Observations A

noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

slide-11
SLIDE 11

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3 a1(D) + noise a2(D) + noise a3(D) + noise

Observations A

noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Select Observations

slide-12
SLIDE 12

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3 a1(D) + noise a2(D) + noise a3(D) + noise

Observations A

noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Apply standard mechanism Select Observations

slide-13
SLIDE 13

Approach 1: workload-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3 a1(D) + noise a2(D) + noise a3(D) + noise

Observations A

noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Apply standard mechanism Derive answers to workload queries Select Observations

slide-14
SLIDE 14

Workload-aware mechanisms

Workload Observations Citation low-order marginals Fourier basis queries

[Barak, PODS ‘07]

all one-dim range queries Hierarchical ranges

[Hay, PVLDB ‘10]

all (multi-dim) range queries Haar wavelet queries

[Xiao, ICDE ‘10]

2-dim range queries Quad-tree queries

[Cormode, ICDE ’12]

sets of data cubes sets of data cubes

[Ding, SIGMOD ’11]

set of linear queries set of linear queries

[Li, PODS ‘10] [Li, PVLDB ‘12]

set of linear queries set of linear queries

[Yuan, VLDB ’12]

  • Observations selected to match (only) the workload.

Optimized Fixed

slide-15
SLIDE 15

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

slide-16
SLIDE 16

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server T test noisy result T’

slide-17
SLIDE 17

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

slide-18
SLIDE 18

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

a1(D) + noise a2(D) + noise a3(D) + noise

slide-19
SLIDE 19

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

a1(D) + noise a2(D) + noise a3(D) + noise noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

slide-20
SLIDE 20

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

a1(D) + noise a2(D) + noise a3(D) + noise noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Test dataset

slide-21
SLIDE 21

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

a1(D) + noise a2(D) + noise a3(D) + noise noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Select Observations Test dataset

slide-22
SLIDE 22

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

a1(D) + noise a2(D) + noise a3(D) + noise noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Select Observations Apply standard mechanism Test dataset

slide-23
SLIDE 23

Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

Workload W

w1 w2 w3

analyst server

a1 a2 a3

Observations A T test noisy result T’

a1(D) + noise a2(D) + noise a3(D) + noise noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

Select Observations Apply standard mechanism Test dataset Derive workload answers

slide-24
SLIDE 24

Data-aware mechanisms

Workload Observations Citation 1D range queries

  • approx. v-optimal

histogram

[Xu, ICDE ’12]

2D range queries kd-tree queries

[Xiao, SDM ‘10]

2D range queries hybrid kd-tree queries

[Cormode, ICDE ’12]

Marginals scaled workload queries

[Xiao, SIGMOD ’11]

Linear queries subset of workload

[Hardt, NIPS ’12]

  • Observations selected to match properties of the database.
slide-25
SLIDE 25

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-26
SLIDE 26

Frequency representation of the database

name gender grade Alice

Female

91 Bob

Male

84 Carl

Male

82 Dave

Male

97 Edwina

Female

88 Faith

Female

78 Ghita

Female

85

... ... ...

Relational database Frequency vector

gender grade count Male 100 10 Male 99 13 Male 98 5 Male 97 7 ... ... ... Female 100 15 Female 99 21 Female 98 4 Female 97 14 Female 96 9 x1 x2 x3 x4 x5 x6 x7 x8 ... xn

{gender, grade}

x

slide-27
SLIDE 27

Frequency representation of the database

name gender grade Alice

Female

91 Bob

Male

84 Carl

Male

82 Dave

Male

97 Edwina

Female

88 Faith

Female

78 Ghita

Female

85

... ... ...

Relational database Frequency vector

x

slide-28
SLIDE 28

Frequency representation of the database

name gender grade Alice

Female

91 Bob

Male

84 Carl

Male

82 Dave

Male

97 Edwina

Female

88 Faith

Female

78 Ghita

Female

85

... ... ...

Relational database Frequency vector

x

grade count 90-100 10 80-90 23 70-80 16 60-70 3

{grade}

x1 x2 x3 x4

slide-29
SLIDE 29

Linear counting queries w(D) = w1x1 + w2x2 + ... + wnxn

A linear counting query w computes a linear combination of the frequency vector counts:

each wi ∈ R

slide-30
SLIDE 30

Linear counting queries w(D) = w1x1 + w2x2 + ... + wnxn

A linear counting query w computes a linear combination of the frequency vector counts:

each wi ∈ R

w = [w1, w2, w3 ... wn]

... as a length n row vector:

wx

The query result is:

slide-31
SLIDE 31

Linear counting queries w(D) = w1x1 + w2x2 + ... + wnxn

A linear counting query w computes a linear combination of the frequency vector counts:

each wi ∈ R

w = [w1, w2, w3 ... wn]

... as a length n row vector:

wx

The query result is: a set of linear counting queries is a matrix:

Wx

The query result is:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

  • 1
  • 1
  • 1
  • 1
  • 1

W

slide-32
SLIDE 32

Queries and workloads

  • 1-dimensional range queries: intervals
  • Marginals / data cube queries / contingency tables: aggregate over

excluded dimensions.

  • k-dimensional range queries: axis-aligned rectangles
  • Predicate counting queries: only 0 or 1 coefficients
  • Linear counting queries: arbitrary coefficients
slide-33
SLIDE 33

Queries and workloads

  • 1-dimensional range queries: intervals
  • Marginals / data cube queries / contingency tables: aggregate over

excluded dimensions.

  • k-dimensional range queries: axis-aligned rectangles
  • Predicate counting queries: only 0 or 1 coefficients
  • Linear counting queries: arbitrary coefficients

1-dim ranges

slide-34
SLIDE 34

Queries and workloads

  • 1-dimensional range queries: intervals
  • Marginals / data cube queries / contingency tables: aggregate over

excluded dimensions.

  • k-dimensional range queries: axis-aligned rectangles
  • Predicate counting queries: only 0 or 1 coefficients
  • Linear counting queries: arbitrary coefficients

1-dim ranges marginals

slide-35
SLIDE 35

k-dim ranges

Queries and workloads

  • 1-dimensional range queries: intervals
  • Marginals / data cube queries / contingency tables: aggregate over

excluded dimensions.

  • k-dimensional range queries: axis-aligned rectangles
  • Predicate counting queries: only 0 or 1 coefficients
  • Linear counting queries: arbitrary coefficients

1-dim ranges marginals

slide-36
SLIDE 36

predicate counting queries k-dim ranges

Queries and workloads

  • 1-dimensional range queries: intervals
  • Marginals / data cube queries / contingency tables: aggregate over

excluded dimensions.

  • k-dimensional range queries: axis-aligned rectangles
  • Predicate counting queries: only 0 or 1 coefficients
  • Linear counting queries: arbitrary coefficients

1-dim ranges marginals

slide-37
SLIDE 37

linear counting queries predicate counting queries k-dim ranges

Queries and workloads

  • 1-dimensional range queries: intervals
  • Marginals / data cube queries / contingency tables: aggregate over

excluded dimensions.

  • k-dimensional range queries: axis-aligned rectangles
  • Predicate counting queries: only 0 or 1 coefficients
  • Linear counting queries: arbitrary coefficients

1-dim ranges marginals

slide-38
SLIDE 38

Privacy definitions & mechanisms

  • Differential privacy

A randomized algorithm A provides (ε,δ)-differential privacy if: for all neighboring databases D and D’, and for any set of outputs S:

Pr[A(D) ∈ S] ≤ ePr[A(D) ∈ S] + δ

  • if δ=0, standard ε-differential privacy:
  • Laplace(0,b) noise where b=||q||1/ε
  • if δ>0, approximate (ε,δ)-differential privacy:
  • Gaussian(0,σ) noise where σ= ||q||2 (2ln(2/δ))1/2/ε
  • Multi-query Laplace/Gaussian mechanism adds independent noise to each

query answer.

  • Exponential mechanism
slide-39
SLIDE 39

The sensitivity of a query matrix

  • For two neighboring databases D and D’, their frequency vectors x

and x’ will differ in one position, by exactly 1.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1

y1 y2 y3 y4

=

x1 x2 x3 x4 x5 x6 x7 +1 x8 x9 x10

x

query matrix answers

W

x’

slide-40
SLIDE 40

The sensitivity of a query matrix

  • For two neighboring databases D and D’, their frequency vectors x

and x’ will differ in one position, by exactly 1.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1

y1 y2 y3 y4

=

x1 x2 x3 x4 x5 x6 x7 +1 x8 x9 x10

x

query matrix answers

W

x’

slide-41
SLIDE 41

The sensitivity of a query matrix

  • For two neighboring databases D and D’, their frequency vectors x

and x’ will differ in one position, by exactly 1.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1

y1 y2 y3 y4

=

x1 x2 x3 x4 x5 x6 x7 +1 x8 x9 x10

x

query matrix answers

W

||W||1 = 4

x’

slide-42
SLIDE 42

The sensitivity of a query matrix

  • For two neighboring databases D and D’, their frequency vectors x

and x’ will differ in one position, by exactly 1.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1

y1 y2 y3 y4

=

x1 x2 x3 x4 x5 x6 x7 +1 x8 x9 x10

x

query matrix answers

W

The L1 sensitivity of a query matrix is: the maximum L1 norm of the columns.

||W||1 = 4

x’

slide-43
SLIDE 43

The sensitivity of a query matrix

  • For two neighboring databases D and D’, their frequency vectors x

and x’ will differ in one position, by exactly 1.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1

y1 y2 y3 y4

=

x1 x2 x3 x4 x5 x6 x7 +1 x8 x9 x10

x

query matrix answers

W

The L1 sensitivity of a query matrix is: the maximum L1 norm of the columns.

||W||1 = 4

x’

The L2 sensitivity of a query matrix is: the maximum L2 norm of the columns.

slide-44
SLIDE 44

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-45
SLIDE 45

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-46
SLIDE 46

Answering all range queries

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

workload W

Goal: answer all range-count queries over x

AllRange = { w | w = xi + ... + xj for 1 ≤ i ≤ j ≤ n }

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 range(x1,x4) range(x1,x3) range(x2,x4) range(x1,x2) range(x2,x3) range(x3,x4) range(x1,x1) range(x2,x2) range(x3,x3) range(x4,x4)

slide-47
SLIDE 47

Answering all range queries

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

workload W

Goal: answer all range-count queries over x

AllRange = { w | w = xi + ... + xj for 1 ≤ i ≤ j ≤ n }

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 range(x1,x4) range(x1,x3) range(x2,x4) range(x1,x2) range(x2,x3) range(x3,x4) range(x1,x1) range(x2,x2) range(x3,x3) range(x4,x4) 10 23 16 3

x=

slide-48
SLIDE 48

Answering all range queries

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

workload W

Goal: answer all range-count queries over x

AllRange = { w | w = xi + ... + xj for 1 ≤ i ≤ j ≤ n }

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 range(x1,x4) range(x1,x3) range(x2,x4) range(x1,x2) range(x2,x3) range(x3,x4) range(x1,x1) range(x2,x2) range(x3,x3) range(x4,x4) 10 23 16 3

x=

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

52 49 42 33 39 19 10 23 16 3

slide-49
SLIDE 49

Method 1: basic Laplace mechanism

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

W

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10

+ (6/ε)

private output Laplace noise w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10 Workload queries

||W||1 =6

slide-50
SLIDE 50

Method 1: basic Laplace mechanism

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

W

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10

+ (6/ε)

private output Laplace noise w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10 Workload queries

||W||1 =6

8.2

  • 5.4
  • 3.1

6.6

  • 7.9

2.4

  • 3.0
  • 4.9

6.7 4.6 60.2 44.6 38.9 39.6 31.1 21.4 7.0 18.1 22.7 7.6 52 49 42 33 39 19 10 23 16 3

slide-51
SLIDE 51

Method 1: basic Laplace mechanism

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

W

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10

+ (6/ε)

private output Laplace noise w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10 Workload queries

||W||1 =6

8.2

  • 5.4
  • 3.1

6.6

  • 7.9

2.4

  • 3.0
  • 4.9

6.7 4.6 60.2 44.6 38.9 39.6 31.1 21.4 7.0 18.1 22.7 7.6 52 49 42 33 39 19 10 23 16 3

Σ=55.4

slide-52
SLIDE 52

Method 1: basic Laplace mechanism

x1 + x2 + x3 + x4 x1 + x2 + x3 x2 + x3 + x4 x1 + x2 x2 + x3 x3 + x4 x1 x2 x3 x4

W

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10

n=4 n

Sensitivity ||W||1

6 O(n2)

Error per query

2(||W||1/ε)2 = 72/ε2 2(||W||1/ε)2 = O(n4)/ε2

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10

+ (6/ε)

private output Laplace noise w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10 Workload queries

||W||1 =6

8.2

  • 5.4
  • 3.1

6.6

  • 7.9

2.4

  • 3.0
  • 4.9

6.7 4.6 60.2 44.6 38.9 39.6 31.1 21.4 7.0 18.1 22.7 7.6 52 49 42 33 39 19 10 23 16 3

Σ=55.4

slide-53
SLIDE 53

Method 2: noisy frequency counts

z1 z2 z3 z4

b1 b2 b3 b4

+ (1/ε)

Use Laplace mechanism to get noisy estimates for each xi.

private output

x1 x2 x3 x4

queries submitted Laplace noise

||I||1 =1 I

Observation

slide-54
SLIDE 54

Method 2: noisy frequency counts

z1 z2 z3 z4

b1 b2 b3 b4

+ (1/ε)

Use Laplace mechanism to get noisy estimates for each xi.

private output

x1 x2 x3 x4

queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

z1 + z2 + z3 + z4 z1 + z2 + z3 z2 + z3 + z4 z1 + z2 z2 + z3 z3 + z4 z1 z2 z3 z4

Laplace noise

||I||1 =1 I

Observation

slide-55
SLIDE 55

Method 2: noisy frequency counts

z1 z2 z3 z4

b1 b2 b3 b4

+ (1/ε)

Use Laplace mechanism to get noisy estimates for each xi.

For w=range(xi,xj) Error(w)= 2(j-i+1)/ε2

private output

x1 x2 x3 x4

queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

z1 + z2 + z3 + z4 z1 + z2 + z3 z2 + z3 + z4 z1 + z2 z2 + z3 z3 + z4 z1 z2 z3 z4

Laplace noise

||I||1 =1 I

Observation

slide-56
SLIDE 56

Method 2: noisy frequency counts

z1 z2 z3 z4

b1 b2 b3 b4

+ (1/ε)

Use Laplace mechanism to get noisy estimates for each xi.

For w=range(xi,xj) Error(w)= 2(j-i+1)/ε2

private output

x1 x2 x3 x4

queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

z1 + z2 + z3 + z4 z1 + z2 + z3 z2 + z3 + z4 z1 + z2 z2 + z3 z3 + z4 z1 z2 z3 z4

Laplace noise

||I||1 =1 I

8/ε2 2/ε2

Observation

slide-57
SLIDE 57

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted Observation

slide-58
SLIDE 58

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

Observation

slide-59
SLIDE 59

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

Possible estimates for query range(x2,x3) = x2 + x3

Observation

slide-60
SLIDE 60

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

z5 + z6

Possible estimates for query range(x2,x3) = x2 + x3

Observation

slide-61
SLIDE 61

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

z5 + z6 z2 - z4 + z6

Possible estimates for query range(x2,x3) = x2 + x3

Observation

slide-62
SLIDE 62

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

z5 + z6 z1 - z4 - z7 z2 - z4 + z6

Possible estimates for query range(x2,x3) = x2 + x3

Observation

slide-63
SLIDE 63

Method 3: hierarchical observations

H

||H||1 = 3

= logn+1 Hierarchical queries: recursively partition the domain, computing sums of each interval. [Hay, PVLDB 10]

x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4

+ (3/ε)

private output Laplace noise b1 b2 b3 b4 b5 b6 b7 z1 z2 z3 z4 z5 z6 z7 queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

z5 + z6 z1 - z4 - z7 z2 - z4 + z6

Possible estimates for query range(x2,x3) = x2 + x3

Least-squares estimate

(6z1 + 3z2 + 3z3 - 9z4 + 12z5 + 12z6 - 9z7)/21

Observation

slide-64
SLIDE 64

Error rates: workload of all range queries

40000 80000 120000 160000 200000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mean Squared Error

Query Width (as fraction of the domain) Noisy counts Hierarchical (2)

ε = 0.1

n = 1024

small ranges big ranges

ε-differential privacy

slide-65
SLIDE 65

Method 4: wavelet queries

[Xiao, ICDE 10]

x1 + x2 + x3 + x4 x1 + x2 - x3 - x4 x1 - x2 x3 - x4

z1 z2 z3 z4

b1 b2 b3 b4

+ (3/ε)

private output queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

Wavelet: use Haar wavelet as observations.

Estimate for query range(x2,x3) = x2 + x3

.5z1 + 0z2 - .5z3 + .5z4

Y

Laplace noise

||Y||1 = 3

= logn+1

Observation

slide-66
SLIDE 66

Method 4: wavelet queries

[Xiao, ICDE 10]

x1 + x2 + x3 + x4 x1 + x2 - x3 - x4 x1 - x2 x3 - x4

z1 z2 z3 z4

b1 b2 b3 b4

+ (3/ε)

private output queries submitted derived workload answers w’1 w’2 w’3 w’4 w’5 w’6 w’7 w’8 w’9 w’10

?

Wavelet: use Haar wavelet as observations.

Estimate for query range(x2,x3) = x2 + x3

.5z1 + 0z2 - .5z3 + .5z4

Y

Laplace noise

||Y||1 = 3

= logn+1

Observation

slide-67
SLIDE 67

Error: workload of all range queries

ε = 0.1

n = 1024

40000 80000 120000 160000 200000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mean Squared Error

Query Width (as fraction of the domain) Identity Hierarchical (2) Wavelet Hierarchical (4)

ε-differential privacy

slide-68
SLIDE 68

Observations for the workload of all range queries

Low sensitivity, and all range queries can be estimated using no more than logn output entries. Very low sensitivity, but large ranges estimated badly.

H Y I

Noisy counts Hierarchical Wavelet

O(n/ε2)

Max/Avg error

O(log3n/ε2) O(log3n/ε2)

x1 x2 x3 x4 x1 + x2 + x3 + x4 x1 + x2 x3 + x4 x1 x2 x3 x4 x1 + x2 + x3 + x4 x1 + x2

  • x3
  • x4

x1

  • x2

x3

  • x4

O(log3kn/ε2)

1-dim k-dim

slide-69
SLIDE 69

Observations for alternative workloads

  • Workload: sets of 2D range

queries

  • Observations: [Cormode, ICDE ’12]
  • Quad-tree queries
  • Geometrically increasing ε by

level

  • Workload: sets of low-order

marginals

  • Observations: [Barak, PODS ‘07]
  • Fourier basis queries

...

more accurate less accurate

Hi-1 Hi-1 Hi-1 -Hi-1 Hi =

slide-70
SLIDE 70

Questions raised

  • Are these observations optimal for the targeted workloads?
  • Which observations should we use for other custom workloads?

Workload Observations Citation low-order marginals Fourier basis queries

[Barak, PODS ‘07]

all one-dim range queries Hierarchical ranges

[Hay, PVLDB ‘10]

all (multi-dim) range queries Haar wavelet queries

[Xiao, ICDE ‘10]

2-dim range queries Quad-tree queries

[Cormode, ICDE ’12]

Non-adaptive

slide-71
SLIDE 71

Questions raised

  • Are these observations optimal for the targeted workloads?
  • Which observations should we use for other custom workloads?

Workload Observations Citation low-order marginals Fourier basis queries

[Barak, PODS ‘07]

all one-dim range queries Hierarchical ranges

[Hay, PVLDB ‘10]

all (multi-dim) range queries Haar wavelet queries

[Xiao, ICDE ‘10]

2-dim range queries Quad-tree queries

[Cormode, ICDE ’12]

Non-adaptive

slide-72
SLIDE 72

Questions raised

  • Are these observations optimal for the targeted workloads?
  • Which observations should we use for other custom workloads?

Workload Observations Citation low-order marginals Fourier basis queries

[Barak, PODS ‘07]

all one-dim range queries Hierarchical ranges

[Hay, PVLDB ‘10]

all (multi-dim) range queries Haar wavelet queries

[Xiao, ICDE ‘10]

2-dim range queries Quad-tree queries

[Cormode, ICDE ’12]

Non-adaptive

Adapt observations to workload

slide-73
SLIDE 73

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-74
SLIDE 74

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-75
SLIDE 75

Laplace mechanism (matrix notation)

Laplace(W,x) = Wx + (||W||1/ε)b W

mxn workload

x

nx1 database

||W||1

scalar

sensitivity

b

mx1 noise: independent samples from Laplace(1)

slide-76
SLIDE 76

Laplace mechanism (matrix notation)

Laplace(W,x) = Wx + (||W||1/ε)b

Error(w) = 2 (||W||1/ε)2

W

mxn workload

x

nx1 database

||W||1

scalar

sensitivity

b

mx1 noise: independent samples from Laplace(1)

slide-77
SLIDE 77

The matrix mechanism: justification

slide-78
SLIDE 78

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A

slide-79
SLIDE 79

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A

slide-80
SLIDE 80

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A

z = Ax + (||A||1/ε)b

slide-81
SLIDE 81

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A ➌ (Derive answers) Compute estimate x of x using answers z.

z = Ax + (||A||1/ε)b

slide-82
SLIDE 82

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A ➌ (Derive answers) Compute estimate x of x using answers z.

z = Ax + (||A||1/ε)b

  • compute estimate x of x that minimizes squared error:

⎟⎜Ax - z⎟⎜2

2

slide-83
SLIDE 83

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A ➌ (Derive answers) Compute estimate x of x using answers z.

z = Ax + (||A||1/ε)b

  • compute estimate x of x that minimizes squared error:

⎟⎜Ax - z⎟⎜2

2

where A+=(ATA)-1AT x=A+z

  • solution is the ordinary least squares estimator:
slide-84
SLIDE 84

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A ➌ (Derive answers) Compute estimate x of x using answers z.

z = Ax + (||A||1/ε)b

Thm: x is unbiased

and has the least variance among all linear unbiased estimators.

  • compute estimate x of x that minimizes squared error:

⎟⎜Ax - z⎟⎜2

2

where A+=(ATA)-1AT x=A+z

  • solution is the ordinary least squares estimator:
slide-85
SLIDE 85

The matrix mechanism: justification

➊ (Select Observations) Choose a (full rank) query matrix A ➋ (Apply Laplace) Use the Laplace mechanism to answer A ➌ (Derive answers) Compute estimate x of x using answers z.

z = Ax + (||A||1/ε)b

Thm: x is unbiased

and has the least variance among all linear unbiased estimators.

  • compute estimate x of x that minimizes squared error:

⎟⎜Ax - z⎟⎜2

2

  • Compute workload queries using estimate x:

Wx where A+=(ATA)-1AT x=A+z

  • solution is the ordinary least squares estimator:
slide-86
SLIDE 86

The matrix mechanism

Given a workload W, and any full-rank strategy matrix A, the following randomized algorithm is ε-differentially private: MatrixA(W,x) = Wx + (||A||1/ε) WA+ b

b=Lap(1)

slide-87
SLIDE 87

The matrix mechanism

Given a workload W, and any full-rank strategy matrix A, the following randomized algorithm is ε-differentially private: MatrixA(W,x) = Wx + (||A||1/ε) WA+ b

b=Lap(1)

instantiated with

  • bservations A
slide-88
SLIDE 88

The matrix mechanism

Given a workload W, and any full-rank strategy matrix A, the following randomized algorithm is ε-differentially private: MatrixA(W,x) = Wx + (||A||1/ε) WA+ b

b=Lap(1)

instantiated with

  • bservations A

true answer

slide-89
SLIDE 89

The matrix mechanism

Given a workload W, and any full-rank strategy matrix A, the following randomized algorithm is ε-differentially private: MatrixA(W,x) = Wx + (||A||1/ε) WA+ b

b=Lap(1)

instantiated with

  • bservations A

true answer

scaling by

||A||1

slide-90
SLIDE 90

The matrix mechanism

Given a workload W, and any full-rank strategy matrix A, the following randomized algorithm is ε-differentially private: MatrixA(W,x) = Wx + (||A||1/ε) WA+ b

b=Lap(1)

instantiated with

  • bservations A

true answer

scaling by

||A||1

transformation by WA+

slide-91
SLIDE 91

The matrix mechanism

Given a workload W, and any full-rank strategy matrix A, the following randomized algorithm is ε-differentially private: MatrixA(W,x) = Wx + (||A||1/ε) WA+ b

b=Lap(1)

Laplace(W,x) = Wx + (||W||1/ε)b

Compare with the Laplace mechanism:

instantiated with

  • bservations A

true answer

scaling by

||A||1

transformation by WA+

slide-92
SLIDE 92

Instances of the matrix mechanism

Observation Matrix A Resulting mechanism

A = W Never worse than Laplace -- sometimes better A = Identity matrix a common baseline A = Haar wavelet

[Xiao, ICDE ‘10]

A = tree based

[Hay, PVLDB ‘10] [Cormode, ICDE ’12]

A = fourier basis

[Barak, PODS ‘07]

Given workload W of linear queries:

slide-93
SLIDE 93

Observation matrices equivalent to wavelet

1 1 1 1 1 1

  • 1
  • 1

1

  • 1

1

  • 1

Wavelet Y

||Y||1 = 3

Y’

||Y’’||1 = 2.414

Y’’

||Y’||1 = 3

1 1 1 1 1 1 1 1 1 1 1 1

≡ >

1 1 1 1 √2 √2 √2 √2

slide-94
SLIDE 94

Observation matrices equivalent to wavelet

1 1 1 1 1 1

  • 1
  • 1

1

  • 1

1

  • 1

Wavelet Y

||Y||1 = 3

Y’

||Y’’||1 = 2.414

Y’’

||Y’||1 = 3

1 1 1 1 1 1 1 1 1 1 1 1

≡ >

1 1 1 1 √2 √2 √2 √2

Equivalent error for all queries

slide-95
SLIDE 95

Observation matrices equivalent to wavelet

1 1 1 1 1 1

  • 1
  • 1

1

  • 1

1

  • 1

Wavelet Y

||Y||1 = 3

Y’

||Y’’||1 = 2.414

Y’’

||Y’||1 = 3

1 1 1 1 1 1 1 1 1 1 1 1

≡ >

1 1 1 1 √2 √2 √2 √2

Equivalent error for all queries Lower error for all queries

slide-96
SLIDE 96

Observation matrices equivalent to wavelet

1 1 1 1 1 1

  • 1
  • 1

1

  • 1

1

  • 1

Wavelet Y

||Y||1 = 3

Y’

||Y’’||1 = 2.414 The haar wavelet observation matrix Y is dominated by alternative matrix Y’’.

Y’’

||Y’||1 = 3

1 1 1 1 1 1 1 1 1 1 1 1

≡ >

1 1 1 1 √2 √2 √2 √2

Equivalent error for all queries Lower error for all queries

slide-97
SLIDE 97

Given an observation matrix A and workload W, the error under the mechanism MatrixA is: TotalErrorA(w) = (2/ε2)(||A||1)2 trace( W(ATA)-1WT ) ErrorA(w) = (2/ε2)(||A||1)2 w(ATA)-1wT For a single query w in W: Total error for workload W:

Error of matrix mechanism Error independent of the input data

slide-98
SLIDE 98

Optimal selection of observations

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-99
SLIDE 99

Optimal selection of observations

Privacy

Optimization Objective

Problem Type Runtime

(W)

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-100
SLIDE 100

Optimal selection of observations

Privacy

Optimization Objective

Problem Type Runtime

ε DP Given W consisting of data cube queries, choose A consisting of data cube queries to minimize simplified error

  • measure. [Ding, SIGMOD ’11]

set-cover approx O(n)

W A TotalError (W)

(W)

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-101
SLIDE 101

Optimal selection of observations

Privacy

Optimization Objective

Problem Type Runtime

ε DP Given W consisting of data cube queries, choose A consisting of data cube queries to minimize simplified error

  • measure. [Ding, SIGMOD ’11]

set-cover approx O(n) ε DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP w/ rank constraints O(n8)

W A TotalError (W)

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-102
SLIDE 102

Optimal selection of observations

Privacy

Optimization Objective

Problem Type Runtime

ε DP Given W consisting of data cube queries, choose A consisting of data cube queries to minimize simplified error

  • measure. [Ding, SIGMOD ’11]

set-cover approx O(n) ε DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP w/ rank constraints O(n8) (ε,δ) DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP O(n8)

W AB≈W

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-103
SLIDE 103

Optimal selection of observations

Privacy

Optimization Objective

Problem Type Runtime

ε DP Given W consisting of data cube queries, choose A consisting of data cube queries to minimize simplified error

  • measure. [Ding, SIGMOD ’11]

set-cover approx O(n) ε DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP w/ rank constraints O(n8) (ε,δ) DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP O(n8) ε DP

Given W, choose AB≈W to minimize TotalErrorA(AB) [Yuan, VLDB ’12]

bi-convex

  • pt

O(n4)

W

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-104
SLIDE 104

Optimal selection of observations

Privacy

Optimization Objective

Problem Type Runtime

ε DP Given W consisting of data cube queries, choose A consisting of data cube queries to minimize simplified error

  • measure. [Ding, SIGMOD ’11]

set-cover approx O(n) ε DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP w/ rank constraints O(n8) (ε,δ) DP

Given W, choose A to minimize TotalErrorA(W)

[Li, PODS ‘10]

SDP O(n8) ε DP

Given W, choose AB≈W to minimize TotalErrorA(AB) [Yuan, VLDB ’12]

bi-convex

  • pt

O(n4) (ε,δ) DP

Given W, choose optimal scaling of eigenvectors

  • f W to minimize TotalErrorA(W) [Li, PVLDB ‘12]

convex

  • pt

O(n4)

Objective: given workload W, find the observation matrix A that minimizes the total error.

slide-105
SLIDE 105

Approximately optimal selection of observations

  • Given W, choose a set of basis queries for the observations:
  • (the eigenvectors of W)
  • compute optimal scalars to minimize error
  • Resulting observation matrix is:
  • Efficiently solvable and achieves optimal error rates in practice.

v1, v2, ... vn c1v1 c2v2 ... cnvn c1, c2, ... cn

A=

Matrix Mechanism under (ε,δ)-Differential Privacy

  • Algorithm running time: O(n rank(W)3)
slide-106
SLIDE 106

Representative experimental findings

  • Benefit of fixed observations:
  • W={All Range Queries} can be reduced by a factor of 2-4 by

using wavelet or hierarchical observations. [Xiao, ICDE ‘10] [Hay, PVLDB

‘10]

  • Benefit of optimized observations:
  • ε-DP: Error reduced by 2-3 times compared with fixed
  • bservation methods. [Yuan, VLDB ’12]
  • (ε,δ)-DP: Error reduced by 2-6 times on range and marginal

workloads for which fixed observation methods were designed; up to 10 times reduction for ad hoc workloads. [Li, PVLDB ‘12]

Note 2: ratios based on root mean squared error. Note 1: comparisons don’t depend on input data or privacy parameters.*

slide-107
SLIDE 107

Lower bound on error

  • Given workload W with singular values λ1 > ... > λn, the minimum

total error of the matrix mechanism is greater than or equal to:

(tight)

Privacy

Error Lower Bound

ε-DP

(2/ε2)(1/n)(λ1 + ... + λn)2

(ε,δ)-DP

(2log(2/δ)/ε2)(1/n)(λ1 + ... + λn)2

slide-108
SLIDE 108

Runtime complexity

  • Answering W using Laplace/Gaussian mechanism takes O(|W|n)

time.

Costs Fixed Observations Optimized Observations

  • 1. Select observations
  • ~ O(n4)
  • 2. Apply standard mechanism

O(|A|n) O(|A|n)

  • 3. Derive answers

O(|W|n) O(|W|n2)

  • Because of data-independence, observation matrix can be

preprocessed:

  • Given fixed workload W and observation matrix A, runtime is

O(|W|n) after pre-computation of WA+: no worse than standard mechanisms

slide-109
SLIDE 109

Summary: workload-aware mechanisms

Workload Observations Citation low-order marginals Fourier basis queries

[Barak, PODS ‘07]

all one-dim range queries Hierarchical ranges

[Hay, PVLDB ‘10]

all (multi-dim) range queries Haar wavelet queries

[Xiao, ICDE ‘10]

2-dim range queries Quad-tree queries

[Cormode, ICDE ’12]

sets of data cubes sets of data cubes

[Ding, SIGMOD ’11]

set of linear queries set of linear queries

[Li, PODS ‘10] [Li, PVLDB ‘12]

set of linear queries low-order set of linear queries

[Yuan, VLDB ’12]

  • Methods can be seen as a generalization of Laplace/Gaussian

mechanism, with error rates significantly reduced and independent

  • f data.

Optimized Fixed

slide-110
SLIDE 110

Summary: workload-aware mechanisms

  • Benefits
  • Independence of data makes error analysis easy, error rates

publishable to analyst, and improves efficiency in some cases.

  • Limitations
  • Computational dependence on domain size, n.
  • Error dependence on epsilon: 1/ε2
  • For some workloads, there is no set of observations that can

help much.

  • Open questions
  • Alternative derivation methods: e.g. non-negative least squares
  • Relationship with “universal” error lower bounds for DP

.

slide-111
SLIDE 111

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-112
SLIDE 112

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-113
SLIDE 113

(Recall) Approach 2: data-aware mechanisms

Laplace or Gaussian Mechanism

database

workload W

w1 w2 w3

analyst server

a1 a2 a3 a1(D) + noise a2(D) + noise a3(D) + noise

Observations A

noisy est. w1(D) noisy est. w2(D) noisy est. w3(D)

T test noisy result T’

Select Observations Apply standard mechanism Test dataset Derive workload answers

slide-114
SLIDE 114

A basic intuition

  • Detect when additional
  • bservations won’t help much.
  • Challenges:
  • Balance privacy budget

between testing data and usable observations.

  • When possible, incorporate

test observations into query answers.

  • Perturbation error vs.

approximation error.

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

slide-115
SLIDE 115

A basic intuition

  • Detect when additional
  • bservations won’t help much.
  • Challenges:
  • Balance privacy budget

between testing data and usable observations.

  • When possible, incorporate

test observations into query answers.

  • Perturbation error vs.

approximation error.

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1+x x1+x2+x +x2+x3 x4+x x4+x5 x6+x x6+x7+x +x7+x8+x +x8+x9+x10 +x10

slide-116
SLIDE 116
  • 1. Compute a private estimate of the k-bin, variance-optimal

histogram using the exponential mechanism.

  • 2. Use Laplace mechanism to get bin counts and all individual

counts.

  • 3. Derive answers to workload queries using least squares.

Data-aware histogram

Workload 1D Range Queries Parameters k, ε1, ε2 s.t. ε1+ε2=ε

[Xu, ICDE ’12]

ε1 ε2

slide-117
SLIDE 117

Techniques for spatial queries

  • Spatial queries are 2 dimensional counting queries (typically range

queries)

  • kd-tree: a data-aware hierarchical space partitioning data structure.
slide-118
SLIDE 118

Techniques for spatial queries

  • Spatial queries are 2 dimensional counting queries (typically range

queries)

  • kd-tree: a data-aware hierarchical space partitioning data structure.
slide-119
SLIDE 119

Techniques for spatial queries

  • Spatial queries are 2 dimensional counting queries (typically range

queries)

  • kd-tree: a data-aware hierarchical space partitioning data structure.
slide-120
SLIDE 120

Techniques for spatial queries

  • Spatial queries are 2 dimensional counting queries (typically range

queries)

  • kd-tree: a data-aware hierarchical space partitioning data structure.
slide-121
SLIDE 121

Techniques for spatial queries

  • Spatial queries are 2 dimensional counting queries (typically range

queries)

  • kd-tree: a data-aware hierarchical space partitioning data structure.
slide-122
SLIDE 122

Data-aware kd-tree (1)

  • 1. Use Laplace mechanism to get noisy counts: x’
  • 2. Build kd-tree K from x’, but stop splitting if:
  • sum of counts in current region is too small (p1), or
  • counts in current region are close to uniform (p2)
  • 3. Use Laplace mechanism to get noisy counts K’ for all

regions in K.

  • 4. Compute workload answers from K’ using least

squares.

ε1=ε/2 ε2=ε/2

[Xiao, SDM ‘10]

Workload 2D Range Queries Parameters p1, p2, ε1, ε2 s.t. ε1+ε2=ε

slide-123
SLIDE 123

Data-aware kd-tree (2)

1.Build hybrid hierarchical structure:

  • l-levels of kd-tree using exponential mechanism to

compute median.

  • remaining (k-l) levels uniform quad-tree.

2.Use Laplace mechanism to get noisy counts. 3.Derive workload query answers using least squares.

ε1=.3ε ε2=.7ε

[Cormode, ICDE ’12]

Workload 2D Range Queries Parameters l, k, ε1, ε2 s.t. ε1+ε2=ε

slide-124
SLIDE 124

Optimizing for relative error

  • 1. Answer all workload queries using Laplace mechanism with

budget ε/T

  • 2. Repeat T-1 times:
  • Refine query answers, by resampling queries with small

values.

  • Final query answers have same privacy cost as single Laplace

random variable with resulting error.

Workload marginals Parameters T, ε

[Xiao, SIGMOD ’11]

slide-125
SLIDE 125

Multiplicative weights

  • Begin with uniform estimate x0 of database x
  • For i = 1...T :
  • Evaluate all workload queries using current estimate

xi-1. Select inaccurate qi with exponential mechanism.

  • Laplace mechanism: get noisy estimate mi of qi.
  • Update xi-1 → xi using mi: multiplicative weights.

ε1=ε/2T ε2=ε/2T

[Hardt, NIPS ’12]

Workload linear queries Parameters T, ε1, ε2 s.t. T(ε1+ε2)=ε

slide-126
SLIDE 126

Multiplicative weights

  • Provably better dependence on ε than workload-aware techniques:

squared error O(1/ε2) vs. O(1/ε2/3)

  • Observations customized to workload.
  • Very good accuracy for sparse datasets.
  • Output satisfies non-negativity constraints.
  • Must compute all workload queries T times.

[Hardt, NIPS ’12]

slide-127
SLIDE 127

Representative experimental findings

  • Building a data-aware histogram reduces error on range queries by

20-40% compared with fixed workload-aware methods like wavelet

  • r tree-based. [Xu, ICDE ’12]
  • Neither of the data-aware kd-trees consistently outperform

workload-aware quad-tree (on random sets of 2D range queries).

[Cormode, ICDE ’12]

  • For reasonable privacy parameters, small workloads of random

range queries on sparse data, multiplicative weights can reduce error by a factor of 10 over matrix mechanism. [Hardt, NIPS ’12]

  • (But for other datasets, it can be outperformed by a factor of 10

by a fixed workload-aware method like wavelet.)

Note: ratios based on root mean squared error.

slide-128
SLIDE 128

Data-aware mechanisms

Workload Observations Citation 1D range queries

  • approx. v-optimal

histogram

[Xu, ICDE ’12]

2D range queries kd-tree queries

[Xiao, SDM ‘10]

2D range queries hybrid kd-tree queries

[Cormode, ICDE ’12]

Marginals scaled workload queries

[Xiao, SIGMOD ’11]

Linear queries subset of workload

[Hardt, NIPS ’12]

  • Observations selected to match properties of the database;

generally efficient, but spending privacy budget on testing doesn’t always pay off.

slide-129
SLIDE 129

Summary: data-aware mechanisms

  • Benefits:
  • Lower error than Approach 1 in some cases.
  • Limitations:
  • Parameters for algorithms must be selected carefully.
  • Public error rates not available to analyst.
  • Techniques are data-aware, but are they workload-aware?
  • Open questions:
  • Evaluation highly dependent on workload, dataset, epsilon.

What are “real” data and workloads like? What properties of data determine error?

slide-130
SLIDE 130

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-131
SLIDE 131

Outline

  • 1. Preliminaries
  • 2. Approach 1: workload-aware
  • Fixed Observations
  • Optimized Observations
  • 3. Approach 2: data-aware
  • 4. Conclusions
slide-132
SLIDE 132

Summary and conclusions

  • Two approaches to batch query answering, each of which provide

significant error improvements by building on standard Laplace/ Gaussian mechanisms, but using alternative observations.

  • Workload-aware methods ignore the input data, and choose
  • bservations solely by analyzing the workload.
  • Data-aware methods carefully (i.e. privately) exploit properties of

the input data.

  • Both approaches are efficient for modestly sized domains.
slide-133
SLIDE 133
  • Benefits:
  • Lower error than Approach 1 in some

cases.

  • Limitations:
  • Parameters for algorithms must be

selected carefully.

  • Public error rates not available to

analyst.

  • Techniques are data-aware, but are

they workload-aware?

  • Open questions:
  • Evaluation highly dependent on

workload, dataset, epsilon. What are “real” data and workloads like? What properties of data determine error?

  • Benefits
  • Independence of data makes error

analysis easy, error rates publishable to analyst, and improves efficiency in some cases.

  • Limitations
  • Computational dependence on

domain size, n.

  • Error dependence on epsilon: 1/ε2
  • For some workloads, there is no set
  • f observations that can help much.
  • Open questions
  • Alternative derivation methods: e.g.

non-negative least squares

  • Relationship with “universal” error

lower bounds for DP .

Workload-aware Data-aware

slide-134
SLIDE 134

Open issues

  • What makes one workload “harder” to answer than another?
  • What makes one database “harder” to support accurately?
  • Can we avoid the computational dependence on the domain size n?
  • How do we analyze the error resulting from non-negative least

squares if applied in derivation of matrix mechanism?

  • Methods for more expressive queries.
slide-135
SLIDE 135

[Barak, PODS ‘07]

  • B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F

. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Principles of Database Systems (PODS) 2007.

[McSherry, FOCS ‘07]

F . McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS ’07

[McSherry, SIGMOD ’09]

F . D. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. SIGMOD 2009.

[Hay, PVLDB ‘10]

  • M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially-

private queries through consistency. PVLDB, 2010.

[Xiao, ICDE ‘10]

  • X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. In

International Conference on Data Engineering (ICDE), 2010.

[Li, PODS ‘10]

  • C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing Linear Counting

Queries Under Differential Privacy. Principles of Database Systems (PODS) 2010.

[Xiao, SDM ‘10]

  • Y. Xiao, L. Xiong, and C. Yuan. Differentially private data release through

multidimensional partitioning. Secure Data Management (SDM) 2010.

[Xiao, SIGMOD ‘11]

Xiaokui Xiao, Gabriel Bender, Michael Hay, and Johannes Gehrke. iReduct: Differential privacy with reduced relative errors. SIGMOD, 2011.

[Ding, SIGMOD ’11]

  • B. Ding, M. Winslett, J. Han, and Z. Li. Differentially private data cubes: optimizing noise

sources and consistency. In SIGMOD 2011.

[Xiao, SIGMOD ’11]

  • X. Xiao, G. Bender, M. Hay, and J. Gehrke. iReduct: Differential privacy with reduced

relative errors. In SIGMOD, 2011.

[Cormode, ICDE ’12]

  • G. Cormode, M. Procopiuc, D. Srivastava, E. Shen, and T. Yu. Differentially private

spatial decompositions. International Conference on Data Engineering (ICDE), 2012.

References

slide-136
SLIDE 136

[Xu, ICDE ’12]

  • J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication.

In ICDE, 2012.

[Li, PVLDB ‘12]

  • C. Li and G. Miklau. An adaptive mechanism for accurate query answering under

differential privacy. Proceedings of the VLDB Endowment (PVLDB) 2012.

[Yuan, VLDB ’12]

  • G. Yuan, Z. Zhang, M. Winslett, X. Xiao, Y. Yang, and Z. Hao. Low-rank mechanism:

Optimizing batch queries under differential privacy. VLDB, 2012.

[Hardt, NIPS ’12]

  • M. Hardt, K. Ligett, and F

. McSherry. A simple and practical algorithm for differentially private data release. In NIPS, 2012.

References (con’t)