Learning to Reconstruct: Statistical Learning Theory and Encrypted - - PowerPoint PPT Presentation

learning to reconstruct statistical learning theory and
SMART_READER_LITE
LIVE PREVIEW

Learning to Reconstruct: Statistical Learning Theory and Encrypted - - PowerPoint PPT Presentation

Learning to Reconstruct: Statistical Learning Theory and Encrypted Database Attacks Paul Grubbs , Marie-Sarah Lacharit, Brice Minaud, Kenny Paterson pag225@cornell.edu, @pag_crypto Outsourced Databases Database Encrypting Outsourced


slide-1
SLIDE 1

Paul Grubbs, Marie-Sarah Lacharité, Brice Minaud, Kenny Paterson

Learning to Reconstruct: Statistical Learning Theory and Encrypted Database Attacks

pag225@cornell.edu, @pag_crypto

slide-2
SLIDE 2

Database

Outsourced Databases

slide-3
SLIDE 3

Database

Encrypting Outsourced Databases?

Encryption prevents querying!

slide-4
SLIDE 4

Encrypted Database

Encrypted Databases

Efficient ones leak access patterns: set of matching records for query

What can an attacker learn from access pattern leakage?

slide-5
SLIDE 5

Encrypted Database

Database Reconstruction (DR)

With enough queries, can learn data from access patterns! [KKNO], [LMP], [KPT] Prior work: huge numbers of queries, strong assumptions, specific query types. [KKNO]: 1026 for salaries [LMP]: dense database [KPT]: kNN queries only

slide-6
SLIDE 6

Our Contributions

  • Enabling insight: access pattern leakage is a binary classification

Use statistical learning theory (SLT) to build and analyze attacks

  • New DR attacks on range queries

Generalize and improve [KKNO], [LMP] with SLT + PQ trees On real data: with only 50 queries, predict salaries to 2% error

  • Generic reduction from DR with known queries to PAC learning
  • Give “minimal” attack for all query types via ε-nets

Instantiate with first DR attack for prefix queries

  • First general lower bound on #queries needed for DR

Full version: https://eprint.iacr.org/2019/011

slide-7
SLIDE 7

Our Contributions

  • Enabling insight: access pattern leakage is a binary classification

Use statistical learning theory (SLT) to build and analyze attacks

  • New DR attacks on range queries

Generalize and improve [KKNO], [LMP] with SLT + PQ trees On real data: with only 50 queries, predict salaries to 2% error

  • Generic reduction from DR with known queries to PAC learning
  • Give “minimal” attack for all query types via ε-nets

Instantiate with first DR attack for prefix queries

  • First general lower bound on #queries needed for DR

Full version: https://eprint.iacr.org/2019/011

slide-8
SLIDE 8

N: number of possible values, wlog [1, …, N] E.g., N=125 for age data Range query: is a pair [a, b] where 1 ≤ a ≤ b ≤ N. Database: is composed of records, each with values in [1, …, N]

Notation and Terminology

12 8 14 9

1 2 3 4

[7, 10] 8 9

2 4

Full database reconstruction (DR): recovering exact record values Approximate DR: recovering all record values within εN. ε = 0.05 is recovery within 5%. ε = 1/N is full DR. Scale-free: query complexity independent of #records or N. Access Pattern: which records match

slide-9
SLIDE 9

[KKNO16]: full DR in O(N 4 log N) queries

Three attacks:

  • GeneralizedKKNO: O(ε-4 log ε-1) for approx. DR
  • ApproxValue: O(ε-2 log ε-1) approx. DR*
  • ApproxOrder: O(ε-1 log ε-1) for approx. order rec*

O(N4 log N) O(N2 log N) O(N log N) Ω(ε-4) Ω(ε-2) Ω(ε-1 log ε-1)

[LMP18]: Full DR for dense DB in O(N log N). Generalizes Implies

*Requires a mild hypothesis about data

Bypass [LMP] lower bound via relaxing to “sacrificial” recon.

DR For Range Queries: Our Work

Lower Bound

Full DR

With DB distribution info, get approx. DR

slide-10
SLIDE 10

1 N

Less probable More probable Assume uniform distribution on range queries + static database. Induces a distribution f on the probability that a value is accessed.

GeneralizedKKNO

f

slide-11
SLIDE 11

1 N

GeneralizedKKNO

f

Idea: for each record…

  • 1. Count #accesses to estimate f(value)
  • 2. Find value by “inverting” f estimate

Estimate How many queries to get estimate sufficient for ε approx. DR? More work needed to break

  • symmetry. See paper for details

Two values!

slide-12
SLIDE 12

X C

Sample complexity: to measure Pr(C) within ε, you need O(1/ε2) samples.

Estimating a Probability

Set X with probability distribution D. Let C ⊆ X be a set.

slide-13
SLIDE 13

X

The set of samples drawn from X is an ε-sample iff for all C in 𝓓:

Estimating a Set of Probabilities

Now: set of sets 𝓓. Goal: estimate all sets’ probabilities simultaneously.

slide-14
SLIDE 14

V & C 1971: If 𝓓 has VC dimension d, then the number of points to get an ε-sample whp is

Does not depend on |𝓓|!

The ε-sample Theorem

How many points do we need to draw to get an ε-sample w.h.p.?

X

slide-15
SLIDE 15

1 N

GeneralizedKKNO

f

Idea: for each record…

  • 1. Count #accesses to estimate f(value)
  • 2. Find value by “inverting” f estimate

Estimate This is an ε-sample! X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1,N]} VC dim. = 2 We need O(ε-4 log ε-1) queries (inverting f adds a square) Can we get rid of squaring?

slide-16
SLIDE 16

Idea: for each record…

  • 1. Count #accesses to estimate f(value)
  • 2. Find value by “inverting” f estimate

This is an ε-sample! X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1,N]} VC dim. = 2 We need O(ε-4 log ε-1) queries (inverting f adds a square) Assume there exists at least one record in [N/8, 3N/8].

1 N

GeneralizedKKNO

f

Estimate

slide-17
SLIDE 17

Idea: for each record…

  • 1. Count #accesses to estimate f(value)
  • 2. Find value by “inverting” f estimate

This is an ε-sample! X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1,N]} VC dim. = 2 We need O(ε-2 log ε-1) queries! More complex attack – see paper Assume there exists at least one record in [N/8, 3N/8].

1 N

ApproxValue

f

Estimate

slide-18
SLIDE 18

Three attacks:

  • GeneralizedKKNO: O(ε-4 log ε-1) for approx. DR
  • ApproxValue: O(ε-2 log ε-1) approx. DR*
  • ApproxOrder: O(ε-1 log ε-1) for approx. order rec*

O(N4 log N) O(N2 log N) O(N log N) Ω(ε-4) Ω(ε-2) Ω(ε-1 log ε-1)

DR For Range Queries: Our Work

Lower Bound

Full DR

With DB distribution info, get approx. DR

Require iid uniform queries, adversary knows query distribution. What can we do without making these assumptions?

slide-19
SLIDE 19

Three attacks:

  • GeneralizedKKNO: O(ε-4 log ε-1) for approx. DR
  • ApproxValue: O(ε-2 log ε-1) approx. DR*
  • ApproxOrder: O(ε-1 log ε-1) for approx. order rec*

O(N4 log N) O(N2 log N) O(N log N) Ω(ε-4) Ω(ε-2) Ω(ε-1 log ε-1)

DR For Range Queries: Our Work

Lower Bound

Full DR

With DB distribution info, get approx. DR

Reveal order without no assumptions on query distribution. See paper for details

slide-20
SLIDE 20

Conclusion

  • Enabling insight: access pattern leakage is a binary classification

Use statistical learning theory (SLT) to build and analyze attacks

  • New DR attacks on range queries

Generalize and improve [KKNO], [LMP] with SLT + PQ trees On real data: with only 50 queries, predict salaries to 2% error

  • Generic reduction from DR with known queries to PAC learning
  • Give “minimal” attack for all query types via ε-nets

Instantiate with first DR attack for prefix queries

  • First general lower bound on #queries needed for DR

Full version: https://eprint.iacr.org/2019/011

Thanks for listening! Any questions?

slide-21
SLIDE 21
slide-22
SLIDE 22

Effective constants are ~ 1!

Attack Simulation

100 200 300 400 500 Number of queries 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Symmetric value/error (as a fraction of N)

  • Max. sacrificed symmetric value

N = 100 N = 1000 N = 10000 N = 100000

  • Max. symmetric error

N = 100 N = 1000 N = 10000 N = 100000 ✏−2 log ✏−1 ✏−2 log ✏−1

ApproxValue experimental results R = 1000, compared to theoretical ✏-sample bound

slide-23
SLIDE 23

X = range queries 𝓓 ={{range queries ∋ x}: x ∈ [1,N]}

DR As Learning a Binary Classifier

This formulation is not specific to range queries! Record values are binary classifiers Approximately learning classifier => approximate DR