SLIDE 1
TOWARDS A QUALITY ASSESSMENT OF DISCLOSURE-LIMITED STATISTICAL DATA Lawrence H. Cox, Ph.D. National Center for Health Statistics LCOX@CDC.GOV
SLIDE 2 QUALITY-CONFIDENTIALITY TRADEOFF
To reduce risk of statistical disclosure to an acceptable level, statistical disclosure limitation (SDL) methods
- abbreviate
- eliminate
- modify
- riginal data
Lowering disclosure risk typically forces reduction
- f data quality in terms of
- accuracy
- completeness
- usability
Over the past 4 decades, SDL methods have been
- studied/developed
- improved/refined/implemented
with considerable success At the same time, efforts to assess/control/assure quality were virtually absent
SLIDE 3 This presentation
- examines quality effects of three
SDL methods for tabular data
- explores quality-preserving methods
The three methods
- rounding
- complementary cell suppression
- controlled tabular adjustment
SLIDE 4 HIGHLIGHTS
In view of time limitations, the take-home messages are Rounding
- rounding keeps the data release intact
- methods for quality-preserving rounding
preserving mean, variance preserving distribution
- available to NSOs
- rounding can limit disclosure effectively
Complementary cell suppression
- has very negative effects on data quality,
especially as the data release is not intact
- in the absence of a mathematical model,
in some cases suppression can be undone
- the security of suppression hinges on a
single quantity that often can be estimated
- p-percent rules can be vulnerable
- p/q-ambiguity rules are vulnerable
Controlled tabular adjustment
- keeps the data release intact
- can preserve key values and statistics
- can preserve original distribution
- effectively limits disclosure
SLIDE 5 ROUNDING
Rounding (base B): replace original data values x = qB + r, 0 < r < B by integer multiples R(x) = mB
- f an integer rounding base B
Adjacent rounding (typical): |x – R(x)| < B Zero-restricted rounding (typical): R(mB) = mB Controlled rounding preserves additivity We are concerned with
- effects of base B rounding on statistical
properties of original data (data quality)
- mean
- variance/TMSE
- distribution
- effects on disclosure risk: P[x | R(x)]
SLIDE 6 Principal issues in evaluating an SDL method (1) Is the method effective for limiting disclosure? (2) Are its effects on data quality acceptable? Examined these questions for four rounding rules
- conventional rounding
- modified conventional rounding
- zero-restricted 50/50 rounding
- unbiased rounding
We only on report zero-restricted 50/50 rounding We evaluate rounding rule/base (1) in terms of the posterior probability of an original data value given its rounded value (2) in terms expected increase in total mean squared error and expected difference between pre- and post-rounding distributions as measured by a conditional Chi-square statistic
SLIDE 7 We assume
- r- and q-distributions independent
- {0,
1} r Uniform B − ∼
(can be relaxed) Focus on adjacent rounding
- R(x) = qB or (q + 1)B
- R(x) = qB + R(r) with R(r) = 0 or B
Zero-restricted 50/50 rounding
- r = 0: round down
- r ≠ 0: round down or up each with probability ½
Assumptions imply
- E[x] = BE[q] + E[r]
- P[r] = P[r|q] = 1/B
- V(x) = B2 V(q) + V(r)
SLIDE 8 EFFECTS OF ROUNDING ON MEAN, VARIANCE For zero-restricted 50/50 rounding
]
1 ( ) 2 B P R r B + = = and [
]
( ) 2 B P R r B B = = , thus
]
( ) 2 B E R r =
and
2
1 [ ( )] 4 B V R r − =
Expected value of x and R(x) Unrounded 50/50 qB +
2 B
qB +
2 B
Variance of x and R(x) Unrounded 50/50 B2 V[q] +
2
1 12 B − B2 V[q] +
2
1 4 B −
SLIDE 9 EFFECTS OF ROUNDING ON x-DISTRIBUTION Use the conditional Chi-square statistic
x U
χ
2
∑
=
2
[ ( ) ] [ ( ) ]
x
x x
R r r R x x U x x − − =
=
(x = 0, Ux = 0) Degrees of freedom df determined by the tabular structure
2 2
[ | ]
( ) [ ( ) 0] [ ( ) ]
( )
x
x x x x x
r
E U x
r B r P R r P R r B x x
∑
=
− = + =
⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦
- d = # {x} = the number of x-observations
- e = # {x < B}, viz., zeroes and confidential values
Can derive
( 1) 1 1 [ ] ( ) [ | 1] 2 6 B B B E U e d e E q q − − ≤ + − ≥
2
( ( ) ) [ ] R r r E r −
=
2 1 1
1 ( ) 2
B s
B s
− =
∑
( 1) 2 B B − −
2
( ( ) ) [ ] R r r E B −
= ( 1)(2 1) 6 B B − − NSO can estimate
1 [ | 1] E q q ≥
So, NSO can select B so that the expected conditional Chi-square value is not statistically significant
SLIDE 10 EFFECTIVENESS FOR DISCLOSURE LIMITATION Evaluate effectiveness of rounding for SDL in terms of posterior predictive probabilities P[x=r|R(r) = 0] P[x|R(x)=0] x=r 50/50 {r = 0} R(x)=0 1 1 B + { 2 1 B + } Confidentiality analysis
- prior r-probabilities uniform on {0, 1, …, B-1}
- ideally, posterior probabilities uniform on same set
- or, if x=0 is not a confidential value, then
uniform over its B-1 nonzero values
- if x=r=0 is not confidential, under 50/50 rounding
posterior probabilities are uniform over the confidential values Reference: Cox and Kim (2006)
SLIDE 11 COMPLEMENTARY CELL SUPPRESSION
p-PERCENT RULE For magnitude data, each respondent (contributor) to the value of cell X contributes an individual amount, e.g.,
- monthly sales for a clothing store
- weekly payroll for a factory
- number of patient visits for an emergency room
Cell value of X is x = sum of all contributions xi to X
1 2
; .... ....
i i i
x x x x x = ≥ ≥ ≥
∑
The p-percent rule is designed to prevent narrow estimation of any contribution to a cell value by a second contributor or third party. It says: A tabulation cell X is a disclosure (sensitive) cell if, after subtracting the second largest contribution from the cell value, the remainder is within p-percent of the largest contribution Express p as a decimal (not a percent); e.g., 20% = 0.20 Sensitivity expressed via
1 3
( ) (1/ )
p i i
S X x p x
≥
= − >
∑
NB: Protecting largest from second largest protects all
SLIDE 12 p/q-AMBIGUITY RULE In addition to p-percent protection, data releaser assumes intruder can estimate any contribution within q-percent Express q as decimal: q < 1 and, of course, q >> p Sensitivity expressed via
/ 1 3
( ) ( / )
p q i i
S X x q p x
≥
= − >
∑
Thus, p/q-ambiguity rule is stricter than p-percent rule, viz., all p-percent sensitive cells are p/q-sensitive When q = 1: p/q-ambiguity rule = p-percent rule Disclosure limitation method must take into account the ability of the intruder to estimate within q-percent
SLIDE 13 CCS
- suppress from publication all sensitive cells
- the disclosure rule enables releaser to compute for
each sensitive cell the minimum uncertainty in estimation required to protect the cell
- that quantity is dependent on the distribution of
contributions within the cell and differs from cell to cell and cell value to cell value
- it is called X’s protection limit r(X) = r
- select other, nonsensitive cells whose suppression
will render the tabulations safe according to the disclosure rule--the complementary suppressions
- safe means that no interval for x finer than
[x-r, x+r] is derivable from released tabulations
- select the complementary suppressions optimally
with respect to some information loss criterion, e.g., # total value suppressed # total number of suppressions # Berg entropy
- very complex mathematically/computationally
- for the p/q-rule, the mathematical suppression must
take into account the ability of the intruder to estimate values to within q-percent
SLIDE 14 Mathematical models for CCS Tabular structure is represented as Ay = 0 Entries of A = -1, 0, +1 Original data: a = (a1,…., an); Aa = 0 Sensitive cell values: ad(i), i = 1, …, s Protection: rd(i), 0 < rd(i) < ad(i), and rk = 0 otherwise CCS Models
, ,1, ,2, ( )
min 1,...., ; 1,2; 1,...., : (1 ) (1 ) 0,1; 1
k k k i j k k i k k k k k k i k k k k j d i
c z i s j k n Ay a z y a r z a z y a r z z z = = = = − ≤ ≤ − + ≥ ≥ + = =
∑
Minimize number of cells suppressed: ck = 1 Minimize total value suppressed ck = ak Minimize Berg entropy ck = log (1+ ak)
SLIDE 15
Suppression done “by hand” can be vulnerable 3x3x3 contingency table, all internal entries suppressed
( ) ( ) ( )
* * * 11 * * * 5 * * * 5 * * * 5 * * * 11 * * * 5 * * * 5 * * * 5 * * * 11 11 5 5 (21) 5 11 5 (21) 5 5 11 (21) 1 10 10 10 1 10 10 10 1 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠
Unique solution: 1 5 5 5 5 5 5 1 5 5 5 5 5 5 1 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ contains three 1’s--DISCLOSURE
SLIDE 16 This table has three marginal totals = 1, so would not be released—this example is unrealistic However, if we create a 3x3x15 table by stacking five copies of this table, we obtain
- a unique table
- all marginals > 5
- fifteen 1’s--DISCLOSURE
Similarly, 18 21 18 23 80 D11 D12 D13 9 20 6 D22 D23 6 20 D31 5 5 D34 15 D41 5 6 D44 25 may appear protected, but in fact D11 = 1 can be deduced This example alone illustrates that why CCS should NOT
- be done “by hand” or “by inspection” or
- by software based on “by hand/inspection” reasoning
SLIDE 17
A simple but realistic scenario T O T A L T O X(10) B(5) T A C(7) A(8) L X = sensitive cell A, B, C = X’s nonsensitive complementary suppressions Totals and all other values (here, blanks) are released The essence of this example is sum sum Sum x b sum c a sum sum’s = original totals reduced by released values
SLIDE 18 Original data are 17 13 30 x=10 b=5 15 c=7 a=8 15 Let r = 2 X is protected if and only if no interval derivable for x is finer than [x-r, x+r] = [10-2, 10+2] = [8, 12] This condition holds if X is
- in an alternating cycle of suppressed cells
- the cycle permits a flow of r = 2 units from x = 10
in both the + and – directions
SLIDE 19 The alternating cycle is 17 13 30 X (10)+/- B (5)-/+ 15 C ( 7)-/+ A (8)+/- 15 In the + direction, can move up to 5 units into X
- more than 5 units would drive B negative
17 13 30 X (15) B (0) 15 C ( 2) A (13) 15 In the – direction, can move up to 8 units out of X
- more than 8 units would drive A negative
17 13 30 X (2) B (13) 15 C (15) A (0) 15 In particular, can move r = 2 units in either direction-- X is protected CCS is mathematical but also data dependent— a similar table with different data could fail to protect 17 6 23 X (10)+/- B (5)-/+ 15 C ( 7)-/+ A (1)+/- 8
SLIDE 20 Verification that X is protected is demonstrated by exact interval estimates (bounds) 17 13 30 X [2, 15] B [0, 13] 15 C [2, 15] A [0, 13] 15 One data quality enhancement that has been discussed
- data releaser provides users with exact interval
bounds for all suppressed cells This could
- assist unsophisticated users
- save effort for sophisticated users (who could
compute the intervals using linear programming)
- demonstrate sufficiency of the disclosure limitation
These intervals are safe if mathematical model used Can/should exact intervals be released? Also suggested: releasing q and/or p would assist analyst
- has this case been made?
- is this safe?
Can/should q and/or p be released? We examine these issues
SLIDE 21 CCS, cycles and protection l(x) = l, u(x) = u: exact bounds for sensitive cell value x 17 13 30 X (10)+/- B (5)-/+ 15 C ( 7)-/+ A (8)+/- 15 Cells with +/- have the same parity as x Cells with -/+ have opposite parity to x In general (and without assuming q-ambiguity)
- maximum increase to x = minimum value with
- pposite parity to x (here, b = 5)
- maximum decrease to x = minimum value
with same parity as x (here, a = 8)
- exact interval for x = [x-a, x+b] (here, = [2, 15])
- width of exact interval = (b+a)/2 (here, = 6.5)
- interval midpoint = x + (b-a)/2 (here, = 8.5)
- bias in midpt estimate of x = (b-a)/2 (here, = -1.5)
17 13 30 x+/- b-/+ 15 c-/+ a+/- 15
SLIDE 22 Releaser provides exact intervals [l, u] for suppressed cells Or, not—as intruder can compute these for him/herself Then intruder knows
- l(x) = x – a: a of same parity, l(a) = 0
- u(x) = x + b: b of opp. parity, l(b) = 0
- so, intruder knows (x + b) – (x + a) = b - a
- if intruder can determine (or closely estimate)
a or b or b-a or b/a, then a, b and x are revealed
- protection on a cycle hinges on a single quantity
SLIDE 23 Vulnerability of CCS under p/q-rule and intervals Cell X is sensitive w.r.t. p/q-rule and is suppressed Cells A, B, C are complementary suppressions NSO releases best interval estimates of suppressed cells X [lX, uX]……+/- B [lB, uB] -/+ C [lC, uC] -/+ A [lA, uA] +/- X, A, B, C unknown, but all positive q expressed as decimal uX - lX = uB – lB = uA – lA= uC – lC = 2q min {a, b, c, x} Assume lA, lC, lX > lB (analogous results for other cases) Then a, c, x > b By virtue of p/q-rule, cycles and simplex algorithm
- lB= (1 - q)b
- uB= (1 + q)b
- lX= x - qb
- uX= x + qb
Thus, if q is known, then A, B and X, C are revealed
SLIDE 24 q is in fact knowable For q < 1
(1 )/(1 )
B B
u l q q = + − therefore
)/( )
B B B B
q u l u l = − + is revealed, as are
- b = lB/(1 – q)
- a = lA + qb
- c = lC + qb
- x = lX + qb
Conclusion p/q-rule + exact intervals = complete disclosure Reference: Cox (2008 b)
SLIDE 25 CONTROLLED TABULAR ADJUSTMENT
Two CTA methods
- quality-preserving CTA (QP-CTA)
Cox, Kelly and Patil (2004)
- minimum discrimination information CTA
(MDI-CTA) Cox, Orelien and Shah (2006) Basic CTA Methodology
- replace sensitive cell values with safe values
= values outside the protection interval
- adjust nonsensitive cell values to restore additivity
- nonsensitive adjustments typically small
SLIDE 26 MILP for basic CTA min
n i i i=1
( + ) y y
+ −
∑
subject to:
+
Ty=T (y ) = 0 y− −
i i i i i i i i i i
(1 - ) (1 - ) , y y m m r I I r I I ≤ ≤ ≤ ≤
Ii binary i = 1, ..., s
i i i
, y y e ≤ ≤ i = s+1, ..., n s = number of sensitive cells; n = number of cells ri = lower/upper protection limit for sensitive cell i mi = upper bound on adjustment to sensitive cell i ei = bound on adjustment to nonsensitive cell i (often, ei =measurement error)
i i i
y y y
+ −
= − = (net) adjustment to cell value
i
a a + y = adjusted (masked) data Quality-preserving CTA (QP-CTA)
1
( ) ( ( ))/ ( )
n i i i i
L y a y y Var a
+ − =
= −
∑
L y = to the constraint system
SLIDE 27
(Nearly) actual magnitude table with disclosures
167 317 1284 587 4490 3981 2442 1150 70(21) 14488 57(1) 1487 172 667 1006 327 1683 1138 46(7) 6583 616 202 1899 1098 2172 3825 4372 300(40) 787 15271 36(10) 16(4) 65 140(40) 257 840 2042 3355 2368 7668 8133 8562 2588 1043 36599
4x9 Table With (Protection Limits): 7 Sensitive Cells
D 317 1284 D 4490 3981 2442 1150 D 14488 D 1487 172 667 1006 327 1679 D D 6583 616 D 1899 1098 2172 3825 4371 D 787 15271 D D 70 D 257 840 2042 3355 2368 7668 8133 8562 2588 1043 36599
Table After Optimal Suppression 11 Cells (30%) & 2759 Units (7.5%) Suppressed
167 317 1276 587 4490 3981 2442 1150 91 14501 56 1487 172 667 1006 327 1683 1138 39 6575 617 196 1899 1095 2172 3825 4372 260 797 15233 26 12 65 180 283 840 2026 3347 2361 7668 8133 8562 2548 1107 36592
Table After Controlled Tabular Adjustment
SLIDE 28 Quality characteristics of basic CTA
- preserves additivity
- can exempt selected cells from adjustment
- far fewer (s) binary variables than CCS (n-s)
- heuristics enable solutions based on LP relaxation
- capacities on cell adjustments control local quality
- proper objective functions encourage global quality
Univariate (one original data set a)
- preserves means
- preserves variances (approx)
- assures (nearly) perfect correlation between
- riginal and adjusted data
because additivity is preserved, means along tabular equations (rows, cols, etc.) are preserved
- other means can be preserved by incorporating
appropriate constraints Multivariate (two or more related original data sets a, b)
( , ) ( , ) Cov a b Cov a y b z = + +
- preserves covariances, regressions
SLIDE 29 Minimum discrimination information CTA (MDI-CTA) Kullback-Leibler minimum discrimination information
- measures distance btwn 2 statistical
distributions defined on a probability space Ω
- first P is known and second Q* is closest to P
in MDI within a class of distributions
( ) argmin{ ( : ) ( )log( )} ( ) Q I Q P Q P
ω
ω ω ω
∈Ω
=∑
- P = original distribution (table)
- class = tables satisfying specified marginal totals
(minimal sufficient statistics = MSS)
- iterative proportional fitting (IPF) computes
unique minimal MDI solution
- IPF permits fixing a subset of the cell values
* sensitive cells set at selected safe values * structural zeroes MDI-CTA
- arbitrary choice of safe sensitive cell values
- conditional on choice and MSS, IPF computes
minimal MDI solution
- heuristic updates choice to improve MDI
- terminate when MDI btwn original & adjusted
tables is statistically insignificant
SLIDE 30 Quality characteristics of MDI-CTA
- preserves additivity
- relies on standard statistical algorithms
available as software
- typically computationally efficient
- objective/heuristics tied to statistical criteria
- exempts structural zeroes from adjustment, but
* nonstructural zeroes fixed at zero * no control on extent of local changes
- preserves original distribution
Reference: Cox (2008 a)
SLIDE 31
REFERENCES
Cox, LH, JP Kelly and R Patil. Balancing quality and confidentiality for multivariate tabular data. in: Privacy in Statistical Databases, Lecture Notes in Computer Science 3050 (J Domingo-Ferrer and V Torra, eds.), Berlin: Springer-Verlag, 2004, 87-98. Cox, LH and JJ Kim. Effects of rounding on the quality and confidentiality of statistical data. In: Privacy and Statistical Data Bases 2006, Lecture Notes in Computer Science 4302 (J Domingo-Ferrer and L Franconi, eds.). Heidelberg: Springer-Verlag, 2006, 48-56. Cox, LH, JG Orelien and BV Shah. A method for preserving statistical distributions subject to controlled tabular adjustment. in: Privacy and Statistical Data Bases 2006, Lecture Notes in Computer Science 4302 (J Domingo-Ferrer and L Franconi, eds.), Heidelberg: Springer-Verlag, 2006, 1-11. Cox, LH. An examination of two methods of controlled tabular adjustment that preserve data quality. Monographs of Official Statistics: UNECE/Eurostat Work session on data confidentiality, Manchester, December 17-19, 2007, 2008, http://epp.eurostat.ec.europa.eu/portal/page?_pageid=3 154,70730193,3154_70730647&_dad=portal&_schem a=PORTAL.
SLIDE 32
Cox, LH. A data quality and data confidentiality assessment of complementary cell suppression. in: Privacy in Statistical Data Bases 2008, Lecture Notes in Computer Science (J. Domingo-Ferrer, ed.), Heidelberg: Springer-Verlag, 2008, in press.