2020 Disclosure Avoidance System (DAS) Presenter: John L. El?nge - - PowerPoint PPT Presentation

2020 disclosure avoidance system das
SMART_READER_LITE
LIVE PREVIEW

2020 Disclosure Avoidance System (DAS) Presenter: John L. El?nge - - PowerPoint PPT Presentation

2020 Disclosure Avoidance System (DAS) Presenter: John L. El?nge Assistant Director for Research and Methodology Presen?ng materials originally from: Simson L. Garfinkel Senior Computer Scien:st for Confiden:ality and Data Access John M. Abowd


slide-1
SLIDE 1

2020 Disclosure Avoidance System (DAS)

Presenter: John L. El?nge Assistant Director for Research and Methodology Presen?ng materials originally from: Simson L. Garfinkel Senior Computer Scien:st for Confiden:ality and Data Access John M. Abowd Chief Scien:st and Associate Director for Research and Methodology (ADRM)

1

slide-2
SLIDE 2

Acknowledgements and Disclaimer

Almost all of the materials covered in the slides were originally prepared by Simson Garfinkel and John Abowd of the United States Census Bureau. The views expressed in this presenta:on are those of the authors and speaker, and do not necessarily represent the policies of the United States Census Bureau.

2

slide-3
SLIDE 3

General Background

Essen?ally All Large-Scale Sta?s?cal Programs Require a Complex Balance of Mul?ple Dimensions of:

  • Quality
  • Risk (Including Disclosure Risk)
  • Cost

3

slide-4
SLIDE 4
  • The Disclosure Avoidance System (DAS) assures that the 2020 Census data products meet the legal

requirements of Title 13, Sec:on 9 of the U.S. Code.

  • The DAS is designed to prevent improper disclosures of data about individuals and establishments in the

2020 census data products.

  • Stakeholders: All users of data from the 2020 Census.

Disclosure Avoidance System

Purpose

4

slide-5
SLIDE 5
  • Project purpose — Why do we need a new DAS?
  • Noise injec:on and differen:al privacy — A brief tutorial
  • State of the project
  • Looking forward and conclusion

CONTROLLED NOISE

Disclosure Avoidance System

Agenda

5

slide-6
SLIDE 6

Project purpose:

Why we need a new disclosure avoidance system

6

slide-7
SLIDE 7

We create sta:s:cs by collec:ng data, processing and publishing PUBLISHED SUMMARY DATA RESPONDENT DATA PROCESSING

7

slide-8
SLIDE 8

Database reconstruc:on is a mathema:cal process that reverses this process. PUBLISHED SUMMARY DATA RESPONDENT DATA PROCESSING

8

slide-9
SLIDE 9

Database reconstruc:on is a mathema:cal process that reverses this process. PUBLISHED SUMMARY DATA RESPONDENT DATA PROCESSING

9

slide-10
SLIDE 10

Consider a census block:

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

PUBLISHED DATA

68

slide-11
SLIDE 11

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

Race 1 Race 2 Race 3 R1 PUBLISHED DATA RECONSTRUCTED DATA

69

slide-12
SLIDE 12

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

Race 1 Race 2 Race 3 R1 R2 PUBLISHED DATA RECONSTRUCTED DATA

70

slide-13
SLIDE 13

Race 1 Race 2 Race 3

71

slide-14
SLIDE 14

AGE >=18

AGE + RACE

72

slide-15
SLIDE 15

RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE

73

slide-16
SLIDE 16

RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE RACE + AGE

TWENTY CONFIDENTIAL VALUES

74

slide-17
SLIDE 17

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

PUBLISHED DATA FIVE PUBLISHED STATISTICS TWENTY CONFIDENTIAL VALUES

75

slide-18
SLIDE 18

18

slide-19
SLIDE 19

“This is the

  • fficial form

for all the people at this address.” “It is quick and easy, and your answers are protected by law.”

19

slide-20
SLIDE 20

2010 Census of Popula:on and Housing

Total popula?on 308,745,538 Pieces of informa:on per person: 6 Total pieces of informa:on: 1,852,473,228

20

slide-21
SLIDE 21

PL94-171 Redistric?ng 2,771,998,263 Balance of Summary File 1 2,806,899,669 Summary File 2 2,093,683,376 2010 Census Publica:on Schedule

21

slide-22
SLIDE 22

Publica?on Released counts (including zeros) PL94-171 Redistric:ng 2,771,998,263 Balance of Summary File 1 2,806,899,669 Summary File 2 2,093,683,376 Public-use micro data sample 30,874,554 Lower bound on published sta:s:cs 7,703,455,862 Sta:s:cs/person 25 2010 Census: Summary of Publica:ons

(approximate counts)

22

slide-23
SLIDE 23

2010 Census Sta?s?cs/person collected: 6 2010 Census Sta:s:cs/person published: 25 Lower bound on collected sta:s:cs: (308,745,538 x 6) 1,852,473,228 Lower bound on published sta:s:cs (25 sta:s:cs per person) 7,703,455,862 The threat of database reconstruc:on

23

slide-24
SLIDE 24

Aggrega?on

Two privacy mechanisms for the 2010 Census

24

slide-25
SLIDE 25

Aggrega?on Swapping

Two privacy mechanisms for the 2010 Census

25

slide-26
SLIDE 26

Noise injec=on and differen=al privacy

CONTROLLED NOISE

26

slide-27
SLIDE 27

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

NOISE

Counts 5 5 3 5 2

Database reconstruc:on and noise injec:on

27

slide-28
SLIDE 28

The more noise, the more privacy — and the less accuracy

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

Lille Noise

Counts 5 5 3 5 2

28

slide-29
SLIDE 29

The more noise, the more privacy — and the less accuracy

Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2

Lille Noise

Counts 5 5 3 5 2

BIG NOISE

Counts 2 8 8 1 1

29

slide-30
SLIDE 30

The more noise, the more privacy — and the less accuracy

Counts Age < 18 8 Age >= 18 2 Race 1 3 Race 2 2 Race 3 5

BIG NOISE

Counts 2 8 8 1 1 Counts Age < 18 4 Age >= 18 6 Race 1 4 Race 2 4 Race 3 2 Counts Age < 18 3 Age >= 18 7 Race 1 5 Race 2 2 Race 3 3

POSSIBILITY 1 POSSIBILITY 2 POSSIBILITY 3

30

slide-31
SLIDE 31

Differen:al privacy is a tool for controlling the noise/accuracy trade-off

31

slide-32
SLIDE 32
  • Differen:al privacy provides:
  • Provable bounds on the maximum privacy

loss

  • Algorithms that allow policy makers to

manage the trade-off between accuracy and privacy loss.

Final privacy-loss budget determined by the Data Stewardship Execu:ve Policy Commilee (DSEP) with recommenda:ons from the Disclosure Review Board (DRB)

Less Noise

MORE NOISE

In 2017, the Census Bureau announced that it would use differen:al privacy for the 2020 Census.

92

slide-33
SLIDE 33

State of the project

33

slide-34
SLIDE 34

Census Unedited File Census Edited File Microdata Detail File

Pre-specified tabular summaries: PL94-171, SF1, SF2 Special tabula?ons and post-census research

Decennial Response File

Disclosure Avoidance System

Privacy-loss Budget, Accuracy Decisions

ε

Global Confiden,ality Protec,on Process accuracy trade-offs

Red = Confiden:al Data Blue = Priva:zed Data

The “Disclosure Avoidance System” is part of the Census data processing pipeline

34

slide-35
SLIDE 35
  • Advantages:
  • Privacy guarantees are tunable and provable
  • Privacy guarantees are future-proof
  • Privacy guarantees are public and explainable
  • Protects against database reconstruc,on
  • Disadvantages:
  • En:re country must be processed at once for best

accuracy

  • Every use of private data must be tallied in the

privacy-loss budget

Global Confiden:ality Protec:on Process Disclosure Avoidance System

Differen:al privacy has many advantages to swapping

35

slide-36
SLIDE 36
  • Open source system
  • Source code published on the Internet
  • Testable with data from 1940 Census

36

We will make the DAS public!

slide-37
SLIDE 37
  • Differen:al privacy is not widely known or

understood outside academia

  • Most data users expect the same accuracy

regardless of the level of detail

  • In 2000 and 2010 we used swapping with an

undisclosed swap rate

– The Census Bureau did not quan:fy the error rate

Communica:ons Strategy

37

slide-38
SLIDE 38
  • ENGINEERING PROJECT – Building a Turnkey Batch-Oriented System
  • Crea:ng a produc:on system that runs within the 2018 End-to-End Census Test and 2020 Census

produc:on environments

– Resource intensive, but only when ac:vely in use – Based on Amazon Elas:c Map Reduce technology – Reads CEF from the Census Data Lake – Processes using DAS algorithms and a commercial op:mizer – Creates the Microdata Detail File – Saves results in the Census Data Lake

State of the DAS Project(s): Engineering & Science

38

slide-39
SLIDE 39
  • SCIENCE PROJECT — Improving the

differen?al privacy algorithms

  • We are steadily improving the accuracy/

privacy trade-off

  • Progress requires interac:ve access to

microdata from the 2010 Census, and con:nued access to high-performance compu:ng on demand.

By block By block

State of the DAS Project(s): Engineering & Science

39

slide-40
SLIDE 40

Looking forward

40

slide-41
SLIDE 41
  • The current “top-down” algorithm handles the PL94-171 queries and generates micro-data that meet the

requirements to publish test files.

  • We’re sharing tables with Subject Maker Experts (SMEs) and discussing possible improvements
  • We will soon integrate the High-Dimensional Matrix Mechanism (HDMM), into our top-down algorithm,

which will improve accuracy on requested tabula:ons

  • The Census Bureau is collec:ng “use cases” from our data users

DAS Highlights --- Good news!

41

slide-42
SLIDE 42

FRN No=ce

We want users of 2020 Census Data Products to tell us how they use our data! First FRN: 83 FR 84111 7/19/2018 -> 9/17/2018 Second FRN: 83 FR 50636 10/09/2018 -> 11/08/2018

42

slide-43
SLIDE 43
  • We have not yet addressed household queries or person-household joins, although we have in-progress

research for both

– Householder queries, e.g. “how many households are headed by someone aged 20-30?” – Person-household join, e.g. “how many children are in households headed by someone aged 20-30?”

  • Lack of scien:sts and engineers trained in differen:al privacy
  • Many open ques:ons in mathema:cal sta:s:cs and methodology

DAS Science Highlights --- Challenges!

43

slide-44
SLIDE 44
  • We are using differen?al privacy to assure that published sta:s:cs do not violate the Census Bureau’s

Title 13 obliga:ons

  • This is a huge step forward for the Census Bureau
  • We have a working system and will use it for the 2018 End-to-End Census Test

– For 2018 we are only producing the PL94-171 redistric:ng tabula:ons

  • There is a lot of scien:fic work that remains to be done
  • Contact: Simson.L.Garfinkel@census.gov John.M.Abowd@census.gov

2020 Disclosure Avoidance System: Conclusions

44

slide-45
SLIDE 45

QUESTIONS?

45