[PPT] - GoBack Confidentiality Protection in the Census Bureaus Quarterly PowerPoint Presentation

SLIDE 1

GoBack

SLIDE 2

August 10, 2005 - p. 1/31

Confidentiality Protection in the Census Bureau’s Quarterly Workforce Indicators

John M. Abowd1,2, Bryce E. Stephens2,3, and Lars Vilhuber1

1 Cornell University, 2 U.S. Census Bureau, 3 University of Maryland

SLIDE 3

August 10, 2005 - p. 2/31

Introduction

■ We will describe the confidentiality protection mechanism as

applied to a new statistical product: the Quarterly Workforce Indicators (QWI).

■ The underlying data infrastructure was designed by the

Longitudinal Employer-Household Dynamics Program at the Census Bureau and is described in detail in Abowd et al (2005).

■ From a longitudinally integrated frame of state

unemployment insurance wage records, we create measures

f employment, wages, hiring, separation, job creation and

job destruction.

SLIDE 4

August 10, 2005 - p. 3/31

Core problem

■ Disclosure proofing is required to protect the information

about individuals and businesses that contribute to

✦ (confidential) unemployment insurance (UI) wage records ✦ (confidential) Quarterly Census of Employment and

Wages (QCEW, also known as ES-202) reports

✦ as well as information from Census Bureau demographic

data that have been integrated with these sources.

■ Primary concern of the confidentiality protection mechanism

is thus with small cells, i.e., cells that reflect data on few individuals or few firms.

SLIDE 5

August 10, 2005 - p. 4/31

Protection provided

■ In general, data are considered protected when

aggregate cell values do not closely approximate data for any one respondent in the cell (Cox and Zayatz, 1993, pg. 5)

■ In the QWI confidentiality protection scheme, confidential

micro-data are considered protected by noise infusion if

1. any inference regarding the magnitude of a particular

respondent’s data differs from the confidential quantity by at least c% even if that inference is made by a coalition of respondents with exact knowledge of their own answers

r
2. any inference regarding the magnitude of an item is

incorrect with probability no less than y%, where c and y are confidential but generally large

SLIDE 6

August 10, 2005 - p. 5/31

Quality of the disclosed data

■ The confidentiality-protected data must be inference-valid for

a well-defined set of analyses

■ We show that ✦ the theoretical properties of the disclosure-proofing

mechanism are designed to maintain analytical validity for trend analysis;

✦ in practice, the disclosure-proofed data are not biased; ✦ in practice, the time-series properties of the

disclosure-proofed data remain intact.

SLIDE 7

August 10, 2005 - p. 6/31

Three-layer confidentiality protection in QWI (I)

■ Layer 1: Multiplicative noise-infusion at the establishment

level, with three very important properties

1. every establishment-level data item is distorted by some

minimum amount

2. distortion amount and direction are time-invariant: data

are always distorted in the same direction (increased or decreased) by the same percentage amount in every period.

3. when estimates are aggregated, the effects of the

distortion cancel out for the vast majority of the estimates

SLIDE 8

August 10, 2005 - p. 7/31

Three-layer confidentiality protection in QWI (II)

■ Layer 2: Weighting of estimates at higher levels ( e.g.,

sub-state geography and industry detail)

✦ construct weights such that state-level beginning of

quarter employment for all private employers matches the first month in quarter employment in QCEW.

✦ the establishment-level weight is used for every indicator

in the QWIs

SLIDE 9

August 10, 2005 - p. 8/31

Three-layer confidentiality protection in QWI (II)

■ Layer 3: Small-cell editing (Suppression or synthesizing) ✦ Some aggregate estimates are based on fewer than three

persons or establishments.

✦ Currently, these estimates are suppressed and a flag set

to indicate suppression.

✦ In next version of disclosure-proofing system, these

estimates are replaced with synthetic values. Note: Editing is only used when the combination of noise infusion and weighting may not distort the publication data with a high enough probability to meet the criteria layed out above.

✦ Count data such as employment are subject to editing. ✦ Continuous dollar measures like payroll are not. ✦ Regardless of small-cell editing, all published estimates

are still substantially influenced by the noise that was infused in the first layer of the protection system.

SLIDE 10

August 10, 2005 - p. 9/31

Implementation of multiplicative noise model

■ a random fuzz factor δj is drawn for each establishment j

p (δj) =      (b − δ) / (b − a)2 , δ ∈ [a, b] (b + δ − 2) / (b − a)2 , δ ∈ [2 − b, 2 − a] 0, otherwise F (δj) =                  0, δ < 2 − b (δ + b − 2)2 /

2 (b − a)2

, δ ∈ [2 − b, 2 − a] 0.5, δ ∈ (2 − a, a) 0.5 +

(b − a)2 − (b − δ)2

, δ ∈ [a, b] 1, δ > b where a = 1 + c/100 and b = 1 + d/100 are constants chosen such that the true value is distorted by a minimum of c percent and a maximum of d percent.

SLIDE 11

August 10, 2005 - p. 10/31

Distribution of Fuzz Factors

2−b 2−a a b

1 2

SLIDE 12

August 10, 2005 - p. 11/31

Distorting magnitudes and counts

The exact implementation depends on the type of estimate:

■ Magnitudes and counts

X∗

jt = δjXjt,

where Xjt is an establishment level statistic among B, E, M, F, A, S, H, R, FA, FS, FH, Wk, WFH, NA, NH, NR, NS

SLIDE 13

August 10, 2005 - p. 12/31

Distorting ratios

Ratios are distorted by distorting numerators (magnitudes), but using undistorted denominators: ZY ∗

jt =

Y ∗

jt

B(Y )jt = δj Yjt B(Y )jt , This method is used for

■ average earnings (ZWk) and ■ average periods of nonemployment (ZN_) for various groups

SLIDE 14

August 10, 2005 - p. 13/31

Distorting flows

■ Distorted net job flow (JF) is computed at the aggregate (k =

geography, industry, or combination of the two for the appropriate age and sex categories) level as the product of the aggregated, undistorted rate of growth and the aggregated distorted employment: JF ∗

kt = Gkt × ¯

E∗

kt = JFkt ×

¯ E∗

kt

¯ Ekt .

■ The formulas for distorting gross job creation (JC) and job

destruction (JD) are similar.

■ Same logic is used to distort wage changes for subgroups

SLIDE 15

August 10, 2005 - p. 14/31

Item suppression

■ Some disclosure risk remains for counts based on very few

entities in a cell (fewer than three individuals or employers)

■ Variables affected are: B, E, M, F, A, S, H, R, FA, FH,

FS, JC, JD, JF, FJC, FJD, FJF. ➜ item suppression based on the number of either workers or the number of employers that contribute data for that item in a cell k in time period t, where a cell represents a particular combination of geography × industry × age × sex.

■ Because of noise infusion, no complementary suppressions

are needed

■ Some denominators may be zeroes - the ratio or rate cannot

be computed.

SLIDE 16

August 10, 2005 - p. 15/31

Economic concepts in QCEW and QWI

Difference in the economic concepts underlying the Quarterly Census of Employment and Wages (QCEW) and the QWI statistics

■ QCEW: employment on the 12th day of the first month in the

quarter (QCEW1,jt)

■ QWI: several measures of employment, derived from reports

f quarterly employment and wages of individual workers at

particular employers (state UI accounts).

■ Key definition: Beginning of quarter employment Bjt

employees at establishment j in both quarter t and t − 1, and by inference, on the 1st day of quarter t.

SLIDE 17

August 10, 2005 - p. 16/31

Protection by Weighting

■ QCEW1,jt and Bjt are not identical because ✦ they do not refer to exactly the same point in time, ✦ the in-scope establishments differ slightly, and ✦ they are computed from different universe data. ■ Actual differences captured by the QWI weighting scheme:

time-series of adjustment weights are defined by wt

j

bjt =

j

QCEW1,jt

■ All variables are weighted by wt

SLIDE 18

August 10, 2005 - p. 17/31

Protection by Imputation

■ no actual confidential micro-data measured at the

establishment level in QWI

■ workplace characteristics ( geography, industry) are

multiply-imputed for multi-unit employers ➜ these establishments are protected by a form of synthetic data.

SLIDE 19

August 10, 2005 - p. 18/31

Protection

Table 1: Small Cells: B, Raw vs. Weighted (a) Illinois

Weighted count Unweighted

5 or

count

1 2 3 4 more 99.33 0.66 0.00 0.00 0.00 0.00 1 0.10 96.76 3.13 0.00 0.00 0.00 2 0.01 2.00 84.68 13.26 0.04 0.01 3 0.01 0.01 3.42 75.72 20.26 0.59 4 0.00 0.00 0.01 4.49 67.62 27.87 5 or more 0.00 0.00 0.00 0.01 0.59 99.39

Total number of cells: 14,229,968 . For details, see text.

SLIDE 20

August 10, 2005 - p. 19/31

Protection

Table 2: Small Cells: B, Undistorted vs. Distorted (a) Illinois

Distorted count Undistorted

5 or

count

1 2 3 4 more 99.86 0.14 0.00 0.00 0.00 0.00 1 0.91 95.75 3.34 0.00 0.00 0.00 2 0.00 4.27 87.25 8.47 0.00 0.00 3 0.00 0.00 10.69 77.20 12.11 0.00 4 0.00 0.00 0.00 14.73 67.49 17.78 5 or more 0.00 0.00 0.00 0.00 1.93 98.07

Total number of cells: 14,229,968 . Both comparisons are for weighted

data. For details, see text.

SLIDE 21

August 10, 2005 - p. 20/31

Protection

Table 3: Small Cells: B, Raw vs. Published (a) Illinois

Published count Unweighted

5 or

count

Suppressed 1 2 3 4 more 0.79 99.21 0.00 0.00 0.00 0.00 0.00 1 99.91 0.08 0.00 0.00 0.00 0.00 0.00 2 94.02 0.01 0.00 0.00 5.87 0.09 0.01 3 34.33 0.00 0.00 0.00 47.75 16.98 0.94 4 25.87 0.00 0.00 0.00 5.56 43.24 25.32 5 or more 15.20 0.00 0.00 0.00 0.03 0.82 83.95

Total number of cells: 14,229,968 . Raw is unweighted and undistorted. Published is after weighting, distorting, and suppression. For details, see text.

SLIDE 22

August 10, 2005 - p. 21/31

Analytic validity

Analytic validity is obtained when

■ the data display no bias and ■ the additional dispersion can be quantified so

that statistical inferences can be adjusted to accomodate it. Evidence on

■ time-series properties of the distorted and published data, ■ cross-sectional unbiasedness of the published data

Unit of analysis:

■ interior sub-state geography × industry × age × sex cell kt. ■ Sub-state geography = county ■ Industry classification = SIC (Division, 2- and 3-digit)

SLIDE 23

August 10, 2005 - p. 22/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Distorted Data

∆r = r − r∗ Percentile B A S F JF 01

0.069373
0.049274
0.052155
0.066461
0.007969

05

0.041585
0.031460
0.032934
0.039787
0.004651

10

0.028849
0.022166
0.023733
0.027926
0.002785

25

0.011920
0.009996
0.010161
0.011913
0.001003

50 0.000571 0.000384 0.000768 0.000306

0.000044

August 10, 2005 - p. 29/31

Cross-sectional unbiasedness: W1

.05 .1 .15 Fraction −d −c +c +d percent

SLIDE 31

August 10, 2005 - p. 30/31

Conclusion

■ First large-scale implementation of confidentiality protection

by noise-infusion

✦ absence of table-level or complementary suppressions,

despite significant number of count item suppressions

✦ All ratios and magnitudes are released without

suppression

■ High-quality data: ✦ remarkable consistency in the serial correlation

coefficients between the undistorted and the distorted data series at highly detailed levels

✦ little or no bias in induced on average by the confidentiality

protection mechanism

✦ distributions of bias are tightly centered around the modal

bias of zero

✦ properties of the raw data distortion can be used to

correct inferences from statistical models

SLIDE 32

August 10, 2005 - p. 31/31

Download of the presentation and paper

This presentation and the accompanying paper can be downloaded from the Cornell VirtualRDC website http://lservices.ciser.cornell.edu/news/index.php?itemid=40 and soon on the U.S. Census LEHD website http://lehd.dsd.census.gov