GoBack Confidentiality Protection in the Census Bureaus Quarterly - - PowerPoint PPT Presentation

goback confidentiality protection in the census bureau s
SMART_READER_LITE
LIVE PREVIEW

GoBack Confidentiality Protection in the Census Bureaus Quarterly - - PowerPoint PPT Presentation

GoBack Confidentiality Protection in the Census Bureaus Quarterly Workforce Indicators John M. Abowd 1 , 2 , Bryce E. Stephens 2 , 3 , and Lars Vilhuber 1 1 Cornell University, 2 U.S. Census Bureau, 3 University of Maryland August 10, 2005 -


slide-1
SLIDE 1

GoBack

slide-2
SLIDE 2

August 10, 2005 - p. 1/31

Confidentiality Protection in the Census Bureau’s Quarterly Workforce Indicators

John M. Abowd1,2, Bryce E. Stephens2,3, and Lars Vilhuber1

1 Cornell University, 2 U.S. Census Bureau, 3 University of Maryland

slide-3
SLIDE 3

August 10, 2005 - p. 2/31

Introduction

■ We will describe the confidentiality protection mechanism as

applied to a new statistical product: the Quarterly Workforce Indicators (QWI).

■ The underlying data infrastructure was designed by the

Longitudinal Employer-Household Dynamics Program at the Census Bureau and is described in detail in Abowd et al (2005).

■ From a longitudinally integrated frame of state

unemployment insurance wage records, we create measures

  • f employment, wages, hiring, separation, job creation and

job destruction.

slide-4
SLIDE 4

August 10, 2005 - p. 3/31

Core problem

■ Disclosure proofing is required to protect the information

about individuals and businesses that contribute to

✦ (confidential) unemployment insurance (UI) wage records ✦ (confidential) Quarterly Census of Employment and

Wages (QCEW, also known as ES-202) reports

✦ as well as information from Census Bureau demographic

data that have been integrated with these sources.

■ Primary concern of the confidentiality protection mechanism

is thus with small cells, i.e., cells that reflect data on few individuals or few firms.

slide-5
SLIDE 5

August 10, 2005 - p. 4/31

Protection provided

■ In general, data are considered protected when

aggregate cell values do not closely approximate data for any one respondent in the cell (Cox and Zayatz, 1993, pg. 5)

■ In the QWI confidentiality protection scheme, confidential

micro-data are considered protected by noise infusion if

  • 1. any inference regarding the magnitude of a particular

respondent’s data differs from the confidential quantity by at least c% even if that inference is made by a coalition of respondents with exact knowledge of their own answers

  • r
  • 2. any inference regarding the magnitude of an item is

incorrect with probability no less than y%, where c and y are confidential but generally large

slide-6
SLIDE 6

August 10, 2005 - p. 5/31

Quality of the disclosed data

■ The confidentiality-protected data must be inference-valid for

a well-defined set of analyses

■ We show that ✦ the theoretical properties of the disclosure-proofing

mechanism are designed to maintain analytical validity for trend analysis;

✦ in practice, the disclosure-proofed data are not biased; ✦ in practice, the time-series properties of the

disclosure-proofed data remain intact.

slide-7
SLIDE 7

August 10, 2005 - p. 6/31

Three-layer confidentiality protection in QWI (I)

■ Layer 1: Multiplicative noise-infusion at the establishment

level, with three very important properties

  • 1. every establishment-level data item is distorted by some

minimum amount

  • 2. distortion amount and direction are time-invariant: data

are always distorted in the same direction (increased or decreased) by the same percentage amount in every period.

  • 3. when estimates are aggregated, the effects of the

distortion cancel out for the vast majority of the estimates

slide-8
SLIDE 8

August 10, 2005 - p. 7/31

Three-layer confidentiality protection in QWI (II)

■ Layer 2: Weighting of estimates at higher levels ( e.g.,

sub-state geography and industry detail)

✦ construct weights such that state-level beginning of

quarter employment for all private employers matches the first month in quarter employment in QCEW.

✦ the establishment-level weight is used for every indicator

in the QWIs

slide-9
SLIDE 9

August 10, 2005 - p. 8/31

Three-layer confidentiality protection in QWI (II)

■ Layer 3: Small-cell editing (Suppression or synthesizing) ✦ Some aggregate estimates are based on fewer than three

persons or establishments.

✦ Currently, these estimates are suppressed and a flag set

to indicate suppression.

✦ In next version of disclosure-proofing system, these

estimates are replaced with synthetic values. Note: Editing is only used when the combination of noise infusion and weighting may not distort the publication data with a high enough probability to meet the criteria layed out above.

✦ Count data such as employment are subject to editing. ✦ Continuous dollar measures like payroll are not. ✦ Regardless of small-cell editing, all published estimates

are still substantially influenced by the noise that was infused in the first layer of the protection system.

slide-10
SLIDE 10

August 10, 2005 - p. 9/31

Implementation of multiplicative noise model

■ a random fuzz factor δj is drawn for each establishment j

p (δj) =      (b − δ) / (b − a)2 , δ ∈ [a, b] (b + δ − 2) / (b − a)2 , δ ∈ [2 − b, 2 − a] 0, otherwise F (δj) =                  0, δ < 2 − b (δ + b − 2)2 /

  • 2 (b − a)2

, δ ∈ [2 − b, 2 − a] 0.5, δ ∈ (2 − a, a) 0.5 +

  • (b − a)2 − (b − δ)2

, δ ∈ [a, b] 1, δ > b where a = 1 + c/100 and b = 1 + d/100 are constants chosen such that the true value is distorted by a minimum of c percent and a maximum of d percent.

slide-11
SLIDE 11

August 10, 2005 - p. 10/31

Distribution of Fuzz Factors

2−b 2−a a b

1 2

slide-12
SLIDE 12

August 10, 2005 - p. 11/31

Distorting magnitudes and counts

The exact implementation depends on the type of estimate:

■ Magnitudes and counts

X∗

jt = δjXjt,

where Xjt is an establishment level statistic among B, E, M, F, A, S, H, R, FA, FS, FH, Wk, WFH, NA, NH, NR, NS

slide-13
SLIDE 13

August 10, 2005 - p. 12/31

Distorting ratios

Ratios are distorted by distorting numerators (magnitudes), but using undistorted denominators: ZY ∗

jt =

Y ∗

jt

B(Y )jt = δj Yjt B(Y )jt , This method is used for

■ average earnings (ZWk) and ■ average periods of nonemployment (ZN_) for various groups

slide-14
SLIDE 14

August 10, 2005 - p. 13/31

Distorting flows

■ Distorted net job flow (JF) is computed at the aggregate (k =

geography, industry, or combination of the two for the appropriate age and sex categories) level as the product of the aggregated, undistorted rate of growth and the aggregated distorted employment: JF ∗

kt = Gkt × ¯

E∗

kt = JFkt ×

¯ E∗

kt

¯ Ekt .

■ The formulas for distorting gross job creation (JC) and job

destruction (JD) are similar.

■ Same logic is used to distort wage changes for subgroups

slide-15
SLIDE 15

August 10, 2005 - p. 14/31

Item suppression

■ Some disclosure risk remains for counts based on very few

entities in a cell (fewer than three individuals or employers)

■ Variables affected are: B, E, M, F, A, S, H, R, FA, FH,

FS, JC, JD, JF, FJC, FJD, FJF. ➜ item suppression based on the number of either workers or the number of employers that contribute data for that item in a cell k in time period t, where a cell represents a particular combination of geography × industry × age × sex.

■ Because of noise infusion, no complementary suppressions

are needed

■ Some denominators may be zeroes - the ratio or rate cannot

be computed.

slide-16
SLIDE 16

August 10, 2005 - p. 15/31

Economic concepts in QCEW and QWI

Difference in the economic concepts underlying the Quarterly Census of Employment and Wages (QCEW) and the QWI statistics

■ QCEW: employment on the 12th day of the first month in the

quarter (QCEW1,jt)

■ QWI: several measures of employment, derived from reports

  • f quarterly employment and wages of individual workers at

particular employers (state UI accounts).

■ Key definition: Beginning of quarter employment Bjt

employees at establishment j in both quarter t and t − 1, and by inference, on the 1st day of quarter t.

slide-17
SLIDE 17

August 10, 2005 - p. 16/31

Protection by Weighting

■ QCEW1,jt and Bjt are not identical because ✦ they do not refer to exactly the same point in time, ✦ the in-scope establishments differ slightly, and ✦ they are computed from different universe data. ■ Actual differences captured by the QWI weighting scheme:

time-series of adjustment weights are defined by wt

  • j

bjt =

  • j

QCEW1,jt

■ All variables are weighted by wt

slide-18
SLIDE 18

August 10, 2005 - p. 17/31

Protection by Imputation

■ no actual confidential micro-data measured at the

establishment level in QWI

■ workplace characteristics ( geography, industry) are

multiply-imputed for multi-unit employers ➜ these establishments are protected by a form of synthetic data.

slide-19
SLIDE 19

August 10, 2005 - p. 18/31

Protection

Table 1: Small Cells: B, Raw vs. Weighted (a) Illinois

Weighted count Unweighted

5 or

count

1 2 3 4 more 99.33 0.66 0.00 0.00 0.00 0.00 1 0.10 96.76 3.13 0.00 0.00 0.00 2 0.01 2.00 84.68 13.26 0.04 0.01 3 0.01 0.01 3.42 75.72 20.26 0.59 4 0.00 0.00 0.01 4.49 67.62 27.87 5 or more 0.00 0.00 0.00 0.01 0.59 99.39

Total number of cells: 14,229,968 . For details, see text.

slide-20
SLIDE 20

August 10, 2005 - p. 19/31

Protection

Table 2: Small Cells: B, Undistorted vs. Distorted (a) Illinois

Distorted count Undistorted

5 or

count

1 2 3 4 more 99.86 0.14 0.00 0.00 0.00 0.00 1 0.91 95.75 3.34 0.00 0.00 0.00 2 0.00 4.27 87.25 8.47 0.00 0.00 3 0.00 0.00 10.69 77.20 12.11 0.00 4 0.00 0.00 0.00 14.73 67.49 17.78 5 or more 0.00 0.00 0.00 0.00 1.93 98.07

Total number of cells: 14,229,968 . Both comparisons are for weighted

  • data. For details, see text.
slide-21
SLIDE 21

August 10, 2005 - p. 20/31

Protection

Table 3: Small Cells: B, Raw vs. Published (a) Illinois

Published count Unweighted

5 or

count

Suppressed 1 2 3 4 more 0.79 99.21 0.00 0.00 0.00 0.00 0.00 1 99.91 0.08 0.00 0.00 0.00 0.00 0.00 2 94.02 0.01 0.00 0.00 5.87 0.09 0.01 3 34.33 0.00 0.00 0.00 47.75 16.98 0.94 4 25.87 0.00 0.00 0.00 5.56 43.24 25.32 5 or more 15.20 0.00 0.00 0.00 0.03 0.82 83.95

Total number of cells: 14,229,968 . Raw is unweighted and undistorted. Published is after weighting, distorting, and suppression. For details, see text.

slide-22
SLIDE 22

August 10, 2005 - p. 21/31

Analytic validity

Analytic validity is obtained when

■ the data display no bias and ■ the additional dispersion can be quantified so

that statistical inferences can be adjusted to accomodate it. Evidence on

■ time-series properties of the distorted and published data, ■ cross-sectional unbiasedness of the published data

Unit of analysis:

■ interior sub-state geography × industry × age × sex cell kt. ■ Sub-state geography = county ■ Industry classification = SIC (Division, 2- and 3-digit)

slide-23
SLIDE 23

August 10, 2005 - p. 22/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Distorted Data

∆r = r − r∗ Percentile B A S F JF 01

  • 0.069373
  • 0.049274
  • 0.052155
  • 0.066461
  • 0.007969

05

  • 0.041585
  • 0.031460
  • 0.032934
  • 0.039787
  • 0.004651

10

  • 0.028849
  • 0.022166
  • 0.023733
  • 0.027926
  • 0.002785

25

  • 0.011920
  • 0.009996
  • 0.010161
  • 0.011913
  • 0.001003

50 0.000571 0.000384 0.000768 0.000306

  • 0.000044

75 0.013974 0.011806 0.012891 0.012632 0.000776 90 0.030948 0.025152 0.026290 0.028299 0.002263 95 0.044233 0.033871 0.037198 0.040565 0.004375 99 0.078519 0.054415 0.060327 0.074212 0.007845

SIC-division × County, State of Illinois. r from AR(1) estimated for each cell’s time series.

slide-24
SLIDE 24

August 10, 2005 - p. 23/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Distorted Data, SIC Division, F

25 50 75

.025 .05 −.025 −.05

.1 −.1 Delta r(f)

Distribution of Delta r IL, Division

slide-25
SLIDE 25

August 10, 2005 - p. 24/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Distorted Data, SIC Division, JF

25 50 75

.025 .05 −.025 −.05

.1 −.1 Delta r(jf)

Distribution of Delta r IL, Division

slide-26
SLIDE 26

August 10, 2005 - p. 25/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Distorted Data, SIC Division, B

25 50 75

.025 .05 −.025 −.05

.1 −.1 Delta r(b)

Distribution of Delta r IL, Division

slide-27
SLIDE 27

August 10, 2005 - p. 26/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Published Data, SIC Division, B

25 50 75

.025 .05 −.025 −.05

.1 −.1 Delta r(b)

Distribution of Delta r IL, Division

slide-28
SLIDE 28

August 10, 2005 - p. 27/31

Distribution of the Error in the First Order Serial Correlation, Raw vs. Published Data, SIC2 B

25 50 75

.025 .05 −.025 −.05

.1 −.1 Delta r(b)

Distribution of Delta r IL, SIC2

slide-29
SLIDE 29

August 10, 2005 - p. 28/31

Cross-sectional unbiasedness: B

.05 .1 .15 Fraction −d −c +c +d percent

slide-30
SLIDE 30

August 10, 2005 - p. 29/31

Cross-sectional unbiasedness: W1

.05 .1 .15 Fraction −d −c +c +d percent

slide-31
SLIDE 31

August 10, 2005 - p. 30/31

Conclusion

■ First large-scale implementation of confidentiality protection

by noise-infusion

✦ absence of table-level or complementary suppressions,

despite significant number of count item suppressions

✦ All ratios and magnitudes are released without

suppression

■ High-quality data: ✦ remarkable consistency in the serial correlation

coefficients between the undistorted and the distorted data series at highly detailed levels

✦ little or no bias in induced on average by the confidentiality

protection mechanism

✦ distributions of bias are tightly centered around the modal

bias of zero

✦ properties of the raw data distortion can be used to

correct inferences from statistical models

slide-32
SLIDE 32

August 10, 2005 - p. 31/31

Download of the presentation and paper

This presentation and the accompanying paper can be downloaded from the Cornell VirtualRDC website http://lservices.ciser.cornell.edu/news/index.php?itemid=40 and soon on the U.S. Census LEHD website http://lehd.dsd.census.gov