TOWARDS A QUALITY ASSESSMENT OF DISCLOSURE-LIMITED STATISTICAL DATA - - PDF document

towards a quality assessment of disclosure limited
SMART_READER_LITE
LIVE PREVIEW

TOWARDS A QUALITY ASSESSMENT OF DISCLOSURE-LIMITED STATISTICAL DATA - - PDF document

TOWARDS A QUALITY ASSESSMENT OF DISCLOSURE-LIMITED STATISTICAL DATA Lawrence H. Cox, Ph.D. National Center for Health Statistics LCOX@CDC.GOV QUALITY-CONFIDENTIALITY TRADEOFF To reduce risk of statistical disclosure to an acceptable level ,


slide-1
SLIDE 1

TOWARDS A QUALITY ASSESSMENT OF DISCLOSURE-LIMITED STATISTICAL DATA Lawrence H. Cox, Ph.D. National Center for Health Statistics LCOX@CDC.GOV

slide-2
SLIDE 2

QUALITY-CONFIDENTIALITY TRADEOFF

To reduce risk of statistical disclosure to an acceptable level, statistical disclosure limitation (SDL) methods

  • abbreviate
  • eliminate
  • modify
  • riginal data

Lowering disclosure risk typically forces reduction

  • f data quality in terms of
  • accuracy
  • completeness
  • usability

Over the past 4 decades, SDL methods have been

  • studied/developed
  • improved/refined/implemented

with considerable success At the same time, efforts to assess/control/assure quality were virtually absent

slide-3
SLIDE 3

This presentation

  • examines quality effects of three

SDL methods for tabular data

  • explores quality-preserving methods

The three methods

  • rounding
  • complementary cell suppression
  • controlled tabular adjustment
slide-4
SLIDE 4

HIGHLIGHTS

In view of time limitations, the take-home messages are Rounding

  • rounding keeps the data release intact
  • methods for quality-preserving rounding

preserving mean, variance preserving distribution

  • available to NSOs
  • rounding can limit disclosure effectively

Complementary cell suppression

  • has very negative effects on data quality,

especially as the data release is not intact

  • in the absence of a mathematical model,

in some cases suppression can be undone

  • the security of suppression hinges on a

single quantity that often can be estimated

  • p-percent rules can be vulnerable
  • p/q-ambiguity rules are vulnerable

Controlled tabular adjustment

  • keeps the data release intact
  • can preserve key values and statistics
  • can preserve original distribution
  • effectively limits disclosure
slide-5
SLIDE 5

ROUNDING

Rounding (base B): replace original data values x = qB + r, 0 < r < B by integer multiples R(x) = mB

  • f an integer rounding base B

Adjacent rounding (typical): |x – R(x)| < B Zero-restricted rounding (typical): R(mB) = mB Controlled rounding preserves additivity We are concerned with

  • effects of base B rounding on statistical

properties of original data (data quality)

  • mean
  • variance/TMSE
  • distribution
  • effects on disclosure risk: P[x | R(x)]
slide-6
SLIDE 6

Principal issues in evaluating an SDL method (1) Is the method effective for limiting disclosure? (2) Are its effects on data quality acceptable? Examined these questions for four rounding rules

  • conventional rounding
  • modified conventional rounding
  • zero-restricted 50/50 rounding
  • unbiased rounding

We only on report zero-restricted 50/50 rounding We evaluate rounding rule/base (1) in terms of the posterior probability of an original data value given its rounded value (2) in terms expected increase in total mean squared error and expected difference between pre- and post-rounding distributions as measured by a conditional Chi-square statistic

slide-7
SLIDE 7

We assume

  • r- and q-distributions independent
  • {0,

1} r Uniform B − ∼

(can be relaxed) Focus on adjacent rounding

  • R(x) = qB or (q + 1)B
  • R(x) = qB + R(r) with R(r) = 0 or B

Zero-restricted 50/50 rounding

  • r = 0: round down
  • r ≠ 0: round down or up each with probability ½

Assumptions imply

  • E[x] = BE[q] + E[r]
  • P[r] = P[r|q] = 1/B
  • V(x) = B2 V(q) + V(r)
slide-8
SLIDE 8

EFFECTS OF ROUNDING ON MEAN, VARIANCE For zero-restricted 50/50 rounding

  • [

]

1 ( ) 2 B P R r B + = = and [

]

  • 1

( ) 2 B P R r B B = = , thus

  • [

]

  • 1

( ) 2 B E R r =

and

2

1 [ ( )] 4 B V R r − =

Expected value of x and R(x) Unrounded 50/50 qB +

  • 1

2 B

qB +

  • 1

2 B

Variance of x and R(x) Unrounded 50/50 B2 V[q] +

2

1 12 B − B2 V[q] +

2

1 4 B −

slide-9
SLIDE 9

EFFECTS OF ROUNDING ON x-DISTRIBUTION Use the conditional Chi-square statistic

  • x

x U

χ

2

=

  • 2

2

[ ( ) ] [ ( ) ]

x

x x

R r r R x x U x x − − =

=

(x = 0, Ux = 0) Degrees of freedom df determined by the tabular structure

2 2

[ | ]

( ) [ ( ) 0] [ ( ) ]

( )

x

x x x x x

r

E U x

r B r P R r P R r B x x

=

− = + =

⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦

  • d = # {x} = the number of x-observations
  • e = # {x < B}, viz., zeroes and confidential values

Can derive

  • 2

( 1) 1 1 [ ] ( ) [ | 1] 2 6 B B B E U e d e E q q − − ≤ + − ≥

2

( ( ) ) [ ] R r r E r −

=

2 1 1

1 ( ) 2

B s

B s

− =

( 1) 2 B B − −

2

( ( ) ) [ ] R r r E B −

= ( 1)(2 1) 6 B B − − NSO can estimate

1 [ | 1] E q q ≥

So, NSO can select B so that the expected conditional Chi-square value is not statistically significant

slide-10
SLIDE 10

EFFECTIVENESS FOR DISCLOSURE LIMITATION Evaluate effectiveness of rounding for SDL in terms of posterior predictive probabilities P[x=r|R(r) = 0] P[x|R(x)=0] x=r 50/50 {r = 0} R(x)=0 1 1 B + { 2 1 B + } Confidentiality analysis

  • prior r-probabilities uniform on {0, 1, …, B-1}
  • ideally, posterior probabilities uniform on same set
  • or, if x=0 is not a confidential value, then

uniform over its B-1 nonzero values

  • if x=r=0 is not confidential, under 50/50 rounding

posterior probabilities are uniform over the confidential values Reference: Cox and Kim (2006)

slide-11
SLIDE 11

COMPLEMENTARY CELL SUPPRESSION

p-PERCENT RULE For magnitude data, each respondent (contributor) to the value of cell X contributes an individual amount, e.g.,

  • monthly sales for a clothing store
  • weekly payroll for a factory
  • number of patient visits for an emergency room

Cell value of X is x = sum of all contributions xi to X

1 2

; .... ....

i i i

x x x x x = ≥ ≥ ≥

The p-percent rule is designed to prevent narrow estimation of any contribution to a cell value by a second contributor or third party. It says: A tabulation cell X is a disclosure (sensitive) cell if, after subtracting the second largest contribution from the cell value, the remainder is within p-percent of the largest contribution Express p as a decimal (not a percent); e.g., 20% = 0.20 Sensitivity expressed via

1 3

( ) (1/ )

p i i

S X x p x

= − >

NB: Protecting largest from second largest protects all

slide-12
SLIDE 12

p/q-AMBIGUITY RULE In addition to p-percent protection, data releaser assumes intruder can estimate any contribution within q-percent Express q as decimal: q < 1 and, of course, q >> p Sensitivity expressed via

/ 1 3

( ) ( / )

p q i i

S X x q p x

= − >

Thus, p/q-ambiguity rule is stricter than p-percent rule, viz., all p-percent sensitive cells are p/q-sensitive When q = 1: p/q-ambiguity rule = p-percent rule Disclosure limitation method must take into account the ability of the intruder to estimate within q-percent

slide-13
SLIDE 13

CCS

  • suppress from publication all sensitive cells
  • the disclosure rule enables releaser to compute for

each sensitive cell the minimum uncertainty in estimation required to protect the cell

  • that quantity is dependent on the distribution of

contributions within the cell and differs from cell to cell and cell value to cell value

  • it is called X’s protection limit r(X) = r
  • select other, nonsensitive cells whose suppression

will render the tabulations safe according to the disclosure rule--the complementary suppressions

  • safe means that no interval for x finer than

[x-r, x+r] is derivable from released tabulations

  • select the complementary suppressions optimally

with respect to some information loss criterion, e.g., # total value suppressed # total number of suppressions # Berg entropy

  • very complex mathematically/computationally
  • for the p/q-rule, the mathematical suppression must

take into account the ability of the intruder to estimate values to within q-percent

slide-14
SLIDE 14

Mathematical models for CCS Tabular structure is represented as Ay = 0 Entries of A = -1, 0, +1 Original data: a = (a1,…., an); Aa = 0 Sensitive cell values: ad(i), i = 1, …, s Protection: rd(i), 0 < rd(i) < ad(i), and rk = 0 otherwise CCS Models

, ,1, ,2, ( )

min 1,...., ; 1,2; 1,...., : (1 ) (1 ) 0,1; 1

k k k i j k k i k k k k k k i k k k k j d i

c z i s j k n Ay a z y a r z a z y a r z z z = = = = − ≤ ≤ − + ≥ ≥ + = =

Minimize number of cells suppressed: ck = 1 Minimize total value suppressed ck = ak Minimize Berg entropy ck = log (1+ ak)

slide-15
SLIDE 15

Suppression done “by hand” can be vulnerable 3x3x3 contingency table, all internal entries suppressed

( ) ( ) ( )

* * * 11 * * * 5 * * * 5 * * * 5 * * * 11 * * * 5 * * * 5 * * * 5 * * * 11 11 5 5 (21) 5 11 5 (21) 5 5 11 (21) 1 10 10 10 1 10 10 10 1 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

Unique solution: 1 5 5 5 5 5 5 1 5 5 5 5 5 5 1 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ contains three 1’s--DISCLOSURE

slide-16
SLIDE 16

This table has three marginal totals = 1, so would not be released—this example is unrealistic However, if we create a 3x3x15 table by stacking five copies of this table, we obtain

  • a unique table
  • all marginals > 5
  • fifteen 1’s--DISCLOSURE

Similarly, 18 21 18 23 80 D11 D12 D13 9 20 6 D22 D23 6 20 D31 5 5 D34 15 D41 5 6 D44 25 may appear protected, but in fact D11 = 1 can be deduced This example alone illustrates that why CCS should NOT

  • be done “by hand” or “by inspection” or
  • by software based on “by hand/inspection” reasoning
slide-17
SLIDE 17

A simple but realistic scenario T O T A L T O X(10) B(5) T A C(7) A(8) L X = sensitive cell A, B, C = X’s nonsensitive complementary suppressions Totals and all other values (here, blanks) are released The essence of this example is sum sum Sum x b sum c a sum sum’s = original totals reduced by released values

slide-18
SLIDE 18

Original data are 17 13 30 x=10 b=5 15 c=7 a=8 15 Let r = 2 X is protected if and only if no interval derivable for x is finer than [x-r, x+r] = [10-2, 10+2] = [8, 12] This condition holds if X is

  • in an alternating cycle of suppressed cells
  • the cycle permits a flow of r = 2 units from x = 10

in both the + and – directions

slide-19
SLIDE 19

The alternating cycle is 17 13 30 X (10)+/- B (5)-/+ 15 C ( 7)-/+ A (8)+/- 15 In the + direction, can move up to 5 units into X

  • more than 5 units would drive B negative

17 13 30 X (15) B (0) 15 C ( 2) A (13) 15 In the – direction, can move up to 8 units out of X

  • more than 8 units would drive A negative

17 13 30 X (2) B (13) 15 C (15) A (0) 15 In particular, can move r = 2 units in either direction-- X is protected CCS is mathematical but also data dependent— a similar table with different data could fail to protect 17 6 23 X (10)+/- B (5)-/+ 15 C ( 7)-/+ A (1)+/- 8

slide-20
SLIDE 20

Verification that X is protected is demonstrated by exact interval estimates (bounds) 17 13 30 X [2, 15] B [0, 13] 15 C [2, 15] A [0, 13] 15 One data quality enhancement that has been discussed

  • data releaser provides users with exact interval

bounds for all suppressed cells This could

  • assist unsophisticated users
  • save effort for sophisticated users (who could

compute the intervals using linear programming)

  • demonstrate sufficiency of the disclosure limitation

These intervals are safe if mathematical model used Can/should exact intervals be released? Also suggested: releasing q and/or p would assist analyst

  • has this case been made?
  • is this safe?

Can/should q and/or p be released? We examine these issues

slide-21
SLIDE 21

CCS, cycles and protection l(x) = l, u(x) = u: exact bounds for sensitive cell value x 17 13 30 X (10)+/- B (5)-/+ 15 C ( 7)-/+ A (8)+/- 15 Cells with +/- have the same parity as x Cells with -/+ have opposite parity to x In general (and without assuming q-ambiguity)

  • maximum increase to x = minimum value with
  • pposite parity to x (here, b = 5)
  • maximum decrease to x = minimum value

with same parity as x (here, a = 8)

  • exact interval for x = [x-a, x+b] (here, = [2, 15])
  • width of exact interval = (b+a)/2 (here, = 6.5)
  • interval midpoint = x + (b-a)/2 (here, = 8.5)
  • bias in midpt estimate of x = (b-a)/2 (here, = -1.5)

17 13 30 x+/- b-/+ 15 c-/+ a+/- 15

slide-22
SLIDE 22

Releaser provides exact intervals [l, u] for suppressed cells Or, not—as intruder can compute these for him/herself Then intruder knows

  • l(x) = x – a: a of same parity, l(a) = 0
  • u(x) = x + b: b of opp. parity, l(b) = 0
  • so, intruder knows (x + b) – (x + a) = b - a
  • if intruder can determine (or closely estimate)

a or b or b-a or b/a, then a, b and x are revealed

  • protection on a cycle hinges on a single quantity
slide-23
SLIDE 23

Vulnerability of CCS under p/q-rule and intervals Cell X is sensitive w.r.t. p/q-rule and is suppressed Cells A, B, C are complementary suppressions NSO releases best interval estimates of suppressed cells X [lX, uX]……+/- B [lB, uB] -/+ C [lC, uC] -/+ A [lA, uA] +/- X, A, B, C unknown, but all positive q expressed as decimal uX - lX = uB – lB = uA – lA= uC – lC = 2q min {a, b, c, x} Assume lA, lC, lX > lB (analogous results for other cases) Then a, c, x > b By virtue of p/q-rule, cycles and simplex algorithm

  • lB= (1 - q)b
  • uB= (1 + q)b
  • lX= x - qb
  • uX= x + qb

Thus, if q is known, then A, B and X, C are revealed

slide-24
SLIDE 24

q is in fact knowable For q < 1

  • /

(1 )/(1 )

B B

u l q q = + − therefore

  • (

)/( )

B B B B

q u l u l = − + is revealed, as are

  • b = lB/(1 – q)
  • a = lA + qb
  • c = lC + qb
  • x = lX + qb

Conclusion p/q-rule + exact intervals = complete disclosure Reference: Cox (2008 b)

slide-25
SLIDE 25

CONTROLLED TABULAR ADJUSTMENT

Two CTA methods

  • quality-preserving CTA (QP-CTA)

Cox, Kelly and Patil (2004)

  • minimum discrimination information CTA

(MDI-CTA) Cox, Orelien and Shah (2006) Basic CTA Methodology

  • replace sensitive cell values with safe values

= values outside the protection interval

  • adjust nonsensitive cell values to restore additivity
  • nonsensitive adjustments typically small
slide-26
SLIDE 26

MILP for basic CTA min

n i i i=1

( + ) y y

+ −

subject to:

+

Ty=T (y ) = 0 y− −

  • +

i i i i i i i i i i

(1 - ) (1 - ) , y y m m r I I r I I ≤ ≤ ≤ ≤

Ii binary i = 1, ..., s

  • +

i i i

, y y e ≤ ≤ i = s+1, ..., n s = number of sensitive cells; n = number of cells ri = lower/upper protection limit for sensitive cell i mi = upper bound on adjustment to sensitive cell i ei = bound on adjustment to nonsensitive cell i (often, ei =measurement error)

i i i

y y y

+ −

= − = (net) adjustment to cell value

i

a a + y = adjusted (masked) data Quality-preserving CTA (QP-CTA)

  • define

1

( ) ( ( ))/ ( )

n i i i i

L y a y y Var a

+ − =

= −

  • adjoin ( ) 0

L y = to the constraint system

slide-27
SLIDE 27

(Nearly) actual magnitude table with disclosures

167 317 1284 587 4490 3981 2442 1150 70(21) 14488 57(1) 1487 172 667 1006 327 1683 1138 46(7) 6583 616 202 1899 1098 2172 3825 4372 300(40) 787 15271 36(10) 16(4) 65 140(40) 257 840 2042 3355 2368 7668 8133 8562 2588 1043 36599

4x9 Table With (Protection Limits): 7 Sensitive Cells

D 317 1284 D 4490 3981 2442 1150 D 14488 D 1487 172 667 1006 327 1679 D D 6583 616 D 1899 1098 2172 3825 4371 D 787 15271 D D 70 D 257 840 2042 3355 2368 7668 8133 8562 2588 1043 36599

Table After Optimal Suppression 11 Cells (30%) & 2759 Units (7.5%) Suppressed

167 317 1276 587 4490 3981 2442 1150 91 14501 56 1487 172 667 1006 327 1683 1138 39 6575 617 196 1899 1095 2172 3825 4372 260 797 15233 26 12 65 180 283 840 2026 3347 2361 7668 8133 8562 2548 1107 36592

Table After Controlled Tabular Adjustment

slide-28
SLIDE 28

Quality characteristics of basic CTA

  • preserves additivity
  • can exempt selected cells from adjustment
  • far fewer (s) binary variables than CCS (n-s)
  • heuristics enable solutions based on LP relaxation
  • capacities on cell adjustments control local quality
  • proper objective functions encourage global quality

Univariate (one original data set a)

  • preserves means
  • preserves variances (approx)
  • assures (nearly) perfect correlation between
  • riginal and adjusted data

because additivity is preserved, means along tabular equations (rows, cols, etc.) are preserved

  • other means can be preserved by incorporating

appropriate constraints Multivariate (two or more related original data sets a, b)

  • preserves

( , ) ( , ) Cov a b Cov a y b z = + +

  • preserves covariances, regressions
slide-29
SLIDE 29

Minimum discrimination information CTA (MDI-CTA) Kullback-Leibler minimum discrimination information

  • measures distance btwn 2 statistical

distributions defined on a probability space Ω

  • first P is known and second Q* is closest to P

in MDI within a class of distributions

  • Q* =

( ) argmin{ ( : ) ( )log( )} ( ) Q I Q P Q P

ω

ω ω ω

∈Ω

=∑

  • P = original distribution (table)
  • class = tables satisfying specified marginal totals

(minimal sufficient statistics = MSS)

  • iterative proportional fitting (IPF) computes

unique minimal MDI solution

  • IPF permits fixing a subset of the cell values

* sensitive cells set at selected safe values * structural zeroes MDI-CTA

  • arbitrary choice of safe sensitive cell values
  • conditional on choice and MSS, IPF computes

minimal MDI solution

  • heuristic updates choice to improve MDI
  • terminate when MDI btwn original & adjusted

tables is statistically insignificant

slide-30
SLIDE 30

Quality characteristics of MDI-CTA

  • preserves additivity
  • relies on standard statistical algorithms

available as software

  • typically computationally efficient
  • objective/heuristics tied to statistical criteria
  • exempts structural zeroes from adjustment, but

* nonstructural zeroes fixed at zero * no control on extent of local changes

  • preserves original distribution

Reference: Cox (2008 a)

slide-31
SLIDE 31

REFERENCES

Cox, LH, JP Kelly and R Patil. Balancing quality and confidentiality for multivariate tabular data. in: Privacy in Statistical Databases, Lecture Notes in Computer Science 3050 (J Domingo-Ferrer and V Torra, eds.), Berlin: Springer-Verlag, 2004, 87-98. Cox, LH and JJ Kim. Effects of rounding on the quality and confidentiality of statistical data. In: Privacy and Statistical Data Bases 2006, Lecture Notes in Computer Science 4302 (J Domingo-Ferrer and L Franconi, eds.). Heidelberg: Springer-Verlag, 2006, 48-56. Cox, LH, JG Orelien and BV Shah. A method for preserving statistical distributions subject to controlled tabular adjustment. in: Privacy and Statistical Data Bases 2006, Lecture Notes in Computer Science 4302 (J Domingo-Ferrer and L Franconi, eds.), Heidelberg: Springer-Verlag, 2006, 1-11. Cox, LH. An examination of two methods of controlled tabular adjustment that preserve data quality. Monographs of Official Statistics: UNECE/Eurostat Work session on data confidentiality, Manchester, December 17-19, 2007, 2008, http://epp.eurostat.ec.europa.eu/portal/page?_pageid=3 154,70730193,3154_70730647&_dad=portal&_schem a=PORTAL.

slide-32
SLIDE 32

Cox, LH. A data quality and data confidentiality assessment of complementary cell suppression. in: Privacy in Statistical Data Bases 2008, Lecture Notes in Computer Science (J. Domingo-Ferrer, ed.), Heidelberg: Springer-Verlag, 2008, in press.