m-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao - - PowerPoint PPT Presentation

m invariance and dynamic datasets
SMART_READER_LITE
LIVE PREVIEW

m-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao - - PowerPoint PPT Presentation

m-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir Goryczka Panta rhei (Heraclitus) "everything is in a state of flux" To provide


slide-1
SLIDE 1

m-Invariance and Dynamic Datasets

based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets

Slawomir Goryczka

slide-2
SLIDE 2

Panta rhei (Heraclitus)

"everything is in a state of flux"

  • To provide most recent anonymized data

publisher needs to re-publish them

  • Most of the current approaches do not consider

this!

  • Exception:

– Support only insertions of data – J.-W. Byun, Y. Sohn, E. Bertino, and N. Li Secure

anonymization for incremental datasets. (2006)

  • Where is the problem?
slide-3
SLIDE 3

Maybe it's simple?

We just need to ensure that:

  • Dataset is not published too often (movie effect)
  • We use different algorithm for each dataset

snapshot (“white” noise instead of the movie effect, but may be used to identify part of the data!)

  • Play with data to keep similar statistics of

attribute values – what with long time trends, i.e. flu pandemic, which change global and local statistics of the data

slide-4
SLIDE 4

Deletion of tuples

  • Deletion of data may introduce critical

absence:

slide-5
SLIDE 5

Deletion of tuples

  • Deletion of data may introduce critical

absence:

slide-6
SLIDE 6

Deletion of tuples

  • Deletion of data may introduce critical

absence:

slide-7
SLIDE 7

Deletion of tuples

  • Deletion of data may introduce critical

absence:

slide-8
SLIDE 8

Deletion of tuples

  • Deletion of data may introduce critical

absence:

Bob has dyspepsia

slide-9
SLIDE 9

Deletion of tuples

  • Deletion of data may introduce critical

absence:

Bob has dyspepsia Solution(?) Ignore deletions

slide-10
SLIDE 10

Counterfeit generalization

  • Add some counterfeit tuples to avoid critical

absence

  • Publish number and location of these tuples

(utility)

slide-11
SLIDE 11

Counterfeit generalization

  • Add some counterfeit tuples to avoid critical

absence

  • Publish number and location of these tuples

(utility)

slide-12
SLIDE 12

Counterfeit generalization

  • Add some counterfeit tuples to avoid critical

absence

  • Publish number and location of these tuples

(utility)

slide-13
SLIDE 13

Counterfeit generalization

(continued)

  • Crucial to preserve privacy is to ensure certain

invariance in all quasi-identifier groups that a tuple (here: Bob's tuple) is generalized to in different snapshots

  • Existing generalization schemas are special

cases of counterfeited generalization, where there is no counterfeits

  • Goal: minimize number of counterfeit tuples, but

ensure privacy among all snapshots. How?

slide-14
SLIDE 14

m-Invariance

m-unique

each QI group in anonymized table T*(j) contains ≥m tuples with different sensitive data among them

m-invariant

  • T*(j) is m-unique for all 1≤j≤n
  • For each tuple t, for each data snapshot where this

tuple appears, its QI generalized group have the same set of distinct sensitive values (For each QI generalized group its set of distinct sensitive values is constant – no problems with critical absence, but each tuple have limited number of QI generalized groups where it can belongs to)

slide-15
SLIDE 15

Privacy disclosure risk

  • Privacy disclosure

risk for tuple t: risk(t) = nis(t)/nrs

– nis(t) – number of

reasonable surjective functions that correctly reconstruct t

– nrs – number of all

reasonable surjections

slide-16
SLIDE 16

m-Invariance (properties)

  • If {T*(1), ..., T*(n)} is m-invariant, then

risk(i) ≤ 1/m, 1 ≤ i ≤ n

  • If {T*(1), ..., T*(n-1)} is m-invariant, then {T*(1),

..., T*(n)} is also m-invariant if and only if:

– T*(n) is m-unique – For any tuple its generalized QI

groups in snapshots T*(n-1) and T*(n) have the same signature (set of distinct sensitive values).

t ∈T n−1∩T n

slide-17
SLIDE 17

m-Invariant algorithm

  • n-th publication is allowed, only if T(n)-T(n-1)

is m-eligible, that is, at most 1/m of the tuples in T(n)-T(n-1) have an identical sensitive value

  • Algorithm (4 phases):

1.Division 2.Balancing 3.Assignment 4.Split

slide-18
SLIDE 18

m-Invariant algorithm

(continued)

  • Division – group tuples common for T*(n-1)

and T(n) with the same signature into one bucket

  • Balancing – balance number of tuples in

buckets using counterfeits if necessary (they have no value for QI attributes)

slide-19
SLIDE 19

m-Invariant algorithm

(continued)

  • Assignment – add tuples, which were not in

T*(n-1), but are in T(n) using similar steps to Dividing and Balancing

  • Split – split each bucket B into |B|/s QI

generalized groups where s (≥m) is the number

  • f values in the signature of B. Each group has

s tuples, taking the s sensitive values in the signature, respectively.

slide-20
SLIDE 20

m-Invariant algorithm

(continued)

  • Assignment – add tuples, which were not in

T*(n-1), but are in T(n) using similar steps to Dividing and Balancing

  • Split – split each bucket B into |B|/s QI

generalized groups where s (≥m) is the number

  • f values in the signature of B. Each group has

s tuples, taking the s sensitive values in the signature, respectively.

slide-21
SLIDE 21
  • Datasets (Tooc, Tsal):

– 400k tuples (600k in total) – Attributes: Age, Gender, Education, Birthplace,

Occupation, Salary

Experiments

slide-22
SLIDE 22

Pros and cons

  • Incremental
  • Small data

disturbance

  • High data utility

(measured as a median relative error for queries)

  • ...
  • Preserving current

statistics of attribute values – what if they change?

  • What about

continues attributes (numbers)?

  • ...
slide-23
SLIDE 23

Q & I*

* Ideas