De-Identifying Education Data Future of Privacy Forum Webinar - - PowerPoint PPT Presentation

de identifying education
SMART_READER_LITE
LIVE PREVIEW

De-Identifying Education Data Future of Privacy Forum Webinar - - PowerPoint PPT Presentation

De-Identifying Education Data Future of Privacy Forum Webinar October 13, 2017 MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education United States Department of Education 2 Student Privacy Policy and Assistance


slide-1
SLIDE 1

2

United States Department of Education Student Privacy Policy and Assistance Division

MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education

De-Identifying Education Data

Future of Privacy Forum Webinar October 13, 2017

slide-2
SLIDE 2

2

Student Privacy Policy and Assistance Division 2

What is PII?

slide-3
SLIDE 3

2

Student Privacy Policy and Assistance Division 3

Personally Identifiable Information

Captain Hook

slide-4
SLIDE 4

2

Student Privacy Policy and Assistance Division 4

Personally Identifiable Information

A one-handed pirate, with an irrational fear of crocodiles and ticking clocks

slide-5
SLIDE 5

2

Student Privacy Policy and Assistance Division 5

FERPA: Personally Identifiable Information (PII)

  • Direct Identifiers
  • e.g., Name, SSN, Student ID Number, etc.

(1:1 relationship to student)

  • Indirect Identifiers
  • e.g., Birthdate, Demographic Information

(1:Many relationship to student)

  • “Other information that, alone or in combination, is

linked or linkable to a specific student that would allow a reasonable person in the school community, who does not have personal knowledge of the relevant circumstances, to identify the student with reasonable certainty. ” (§ 99.3)

5

slide-6
SLIDE 6

2

Student Privacy Policy and Assistance Division 6

FERPA’s Confidentiality Standard

Can a “reasonable person” in the school community re-identify the individual with any reasonable certainty?

Tabular Data:

A small degree of uncertainty (“reasonable doubt”) is often sufficient. [e.g., “the rule of 3”]

Individual-level Data:

The abundance of data points for each individual, the availability

  • f easy to use data-manipulation and data mining tools, and the

ability to link to external data sources make the risk of re- identification much higher.

6

slide-7
SLIDE 7

2

Student Privacy Policy and Assistance Division 7

FERPA vs. HIPAA’s “Safe Harbor”

slide-8
SLIDE 8

2

Student Privacy Policy and Assistance Division 8

PII? But I’m only releasing aggregate data …

Aggregate data tables can still contain PII if they report information on small groups, or individuals with unique or uncommon characteristics

8

slide-9
SLIDE 9

2

Student Privacy Policy and Assistance Division 9

How States are Doing It

5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

State Adoption of Sub-Group Suppression Rules (circa 2012)

Suppression Rule (n < X) # of States

Under ESEA, States adopted minimum n-size rules to protect student privacy in aggregate reports, but there is substantial variation across states on the minimum n-size selected. There is also substantial variation in how States have interpreted and implemented those minimum n-size requirements.

slide-10
SLIDE 10

2

Student Privacy Policy and Assistance Division 10

Small cells increase disclosure risk…

BUT, suppressing the small cells may not be sufficient

10

slide-11
SLIDE 11

2

Student Privacy Policy and Assistance Division 11

Common Mistakes in Public Reporting

slide-12
SLIDE 12

2

Student Privacy Policy and Assistance Division 12

Population Size vs. Cell Size

Subgroup # Tested # Proficient % Proficient Subgroup 1 6 1 16.7% Assume a minimum n-size rule of 5:

slide-13
SLIDE 13

2

Student Privacy Policy and Assistance Division 13

Population Size vs. Cell Size

Subgroup # Tested # Proficient % Proficient Subgroup 1 6 1 16.7%

What if I’m that 1 student? I now know something about the other 5!

Assume a minimum n-size rule of 5:

slide-14
SLIDE 14

2

Student Privacy Policy and Assistance Division 14

Fixed Top/Bottom Coding Thresholds

Subgroup # Tested # Proficient % Proficient Subgroup 1 8 * <5% Assume a minimum n-size rule of 5:

slide-15
SLIDE 15

2

Student Privacy Policy and Assistance Division 15

Fixed Top/Bottom Coding Thresholds

Subgroup # Tested # Proficient % Proficient Subgroup 1 8 * <5% Assume a minimum n-size rule of 5: 0/8 = 0% 1/8 = 12.5% So, “<5%” of 8 students = 0 students!

slide-16
SLIDE 16

2

Student Privacy Policy and Assistance Division 16

A Better Approach for Handling Extreme Values

Number of Students (denominator) Top/Bottom Coding for Percentages 1-5 Suppressed 6-15 <50%, ≥50% 16-30 ≤20%, ≥80% 31-60 ≤10%, ≥90% 61-300 ≤5%, ≥95% 301-3,000 ≤1%, ≥99% 3,001 or more ≤0.1%, ≥99.9%

slide-17
SLIDE 17

2

Student Privacy Policy and Assistance Division 17

What’s the missing number?

12 8 14 ? 6

slide-18
SLIDE 18

2

Student Privacy Policy and Assistance Division 18

What’s the missing number?

12 8 14 ? 6 44

slide-19
SLIDE 19

2

Student Privacy Policy and Assistance Division 19

What’s the missing number?

12 8 14 4 6 44

slide-20
SLIDE 20

2

Student Privacy Policy and Assistance Division 20

What’s the missing number?

12 8 14 6 44

CENSORED CENSORED

slide-21
SLIDE 21

2

Student Privacy Policy and Assistance Division 21

12 8 14 4 6 44

What’s the missing number?

20 24

Students by Gender Students by Subgroup

slide-22
SLIDE 22

2

Student Privacy Policy and Assistance Division 22

Lack of Complementary Suppression

Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * * * All Students 12 0% 42% 42% 17%

slide-23
SLIDE 23

2

Student Privacy Policy and Assistance Division 23

Lack of Complementary Suppression

Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * * * All Students 12 0% 42% 42% 17%

slide-24
SLIDE 24

2

Student Privacy Policy and Assistance Division 24

Lack of Complementary Suppression

Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * 100% * All Students 12 0% 42% 42% 17%

slide-25
SLIDE 25

2

Student Privacy Policy and Assistance Division 25

The Trouble with Cell Size Rules

Remember: It’s not just the small cells that are important. Bigger cells/values can still be disclosive if:

  • they are extreme values (e.g., ~0% or ~100% of students in a group),
  • r
  • they can be used to calculate the values of protected

cells elsewhere (in the same table, or even in another data

release!)

slide-26
SLIDE 26

2

Student Privacy Policy and Assistance Division 26

Take Home Point: Consider All Reporting Levels

Education data are often reported in a multi- dimensional structure. To be effective, a disclosure avoidance methodology must consider all levels of aggregation.

slide-27
SLIDE 27

2

Student Privacy Policy and Assistance Division 27

Lack of Complementary Suppression

Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * 100% * All Students 12 0% 42% 42% 17%

slide-28
SLIDE 28

2

Student Privacy Policy and Assistance Division 28

Complementary Suppression

Subgroup Grade School District State national

When using suppression to protect privacy, consider all the ways that data are aggregated. If a cell is suppressed (primary or complementary) at one level, it needs to be suppressed in at least one other reporting entity at the next level of aggregation. And make sure that those additional entities have proper complementary suppression too!

slide-29
SLIDE 29

2

Student Privacy Policy and Assistance Division 29

Take Home Point: Data Releases by Others

When performing a disclosure risk analysis, you must consider data releases made by other

  • rganizations.

How schools, districts, states, and the Federal government release the same (or related) data, may impact the re-identifiability of the data you (or they) release!

slide-30
SLIDE 30

2

Student Privacy Policy and Assistance Division 30

So What Are Your Options?

The 3 “Flavors” of Disclosure Avoidance Techniques:

  • Suppression
  • “Blurring”
  • Perturbation
slide-31
SLIDE 31

2

Student Privacy Policy and Assistance Division 31

Suppression

Definition:

Removing data to prevent the identification of individuals in small cells or with unique characteristics

Examples:

 Cell Suppression  Row Suppression  Sampling

Effect on Data Utility:  Results in very little data being produced for small

populations  Requires suppression of additional, non-sensitive data (e.g., complementary suppression)

Residual Risk of Disclosure:

 Suppression can be difficult to perform correctly (especially for large multi-dimensional tables)  If additional data is available elsewhere, the suppressed data may be re-calculated.

slide-32
SLIDE 32

2

Student Privacy Policy and Assistance Division 32

“Blurring“

Definition:

Reducing the precision of data that is presented to reduce the certainty of identification

Examples:

 Aggregation  Percents  Ranges  Top/Bottom-Coding  Rounding

Effect on Data Utility: 

Users cannot make inferences about small changes in the data  Reduces the ability to perform time-series or cross- case analysis

Residual Risk of Disclosure:

 Generally low risk, but if row/column totals are published (or available elsewhere) then it may be possible to calculate the actual values of sensitive cells

slide-33
SLIDE 33

2

Student Privacy Policy and Assistance Division 33

Perturbation

Definition:

Making small changes to the data to prevent identification of individuals from unique or rare characteristics

Examples:

 Data Swapping  Noise  Synthetic Data  Differential Privacy

Effect on Data Utility: 

Can minimize loss of utility compared to other methods  May be seen as inappropriate for program data because it reduces the transparency and credibility

  • f the data, which can have enforcement and

regulatory implications

Residual Risk of Disclosure:

 If someone has access to some (e.g., a single state’s)

  • riginal data, they may be able to reverse-engineer

the perturbation rules used to alter the rest of the data

slide-34
SLIDE 34

2

Student Privacy Policy and Assistance Division 34

Take Home Point: The method you choose matters!

There are many different types of disclosure avoidance methods, and limitless variations in how to apply them. But, each method impacts the usability of the data in different ways. The first question to ask when selecting a method should always be:

“How will these data be used?”

slide-35
SLIDE 35

2

Student Privacy Policy and Assistance Division 35

Some tips to consider:

  • You don’t have to limit your plan to a single method – you

can adopt multiple methods that complement each other (e.g., suppression and top/bottom coding)

  • If using suppression, be especially aware of row/column

totals, and related tables – complementary suppression will most likely be necessary

  • Don’t rely on suppressing underlying population/subgroup

counts – they are often known or can be inferred.

  • When reporting in percentages, round to whole numbers

whenever possible

  • Be especially careful with individual-level data – you will

probably need to use some amount of perturbation!

  • Be sure to audit your results
slide-36
SLIDE 36

2

Student Privacy Policy and Assistance Division 36

It’s all about risk

“The release of any data usually entails at least some element of

  • risk. A decision to eliminate all

risk of disclosure would curtail [data] releases drastically, if not

  • completely. Thus, for any

proposed release of [data] the acceptability of the level of risk

  • f disclosure must be

evaluated.”

Federal Committee on Statistical Methodology, “Statistical Working Paper #2”

slide-37
SLIDE 37

2

Student Privacy Policy and Assistance Division 37

Privacy Technical Assistance Center (PTAC) Resources

http://studentprivacy.ed.gov

  • Issue Briefs
  • Checklists
  • FAQs
  • Case Studies
  • Webinars
  • Etc.

Help Desk: PrivacyTA@ed.gov On-site Assistance

(site visits, trainings, etc.)

Selected PTAC Resources on Disclosure Avoidance:

Frequently Asked Questions— Disclosure Avoidance Data De-identification: An Overview of Basic Terms Case Study #5: Minimizing PII Access

slide-38
SLIDE 38

2

Student Privacy Policy and Assistance Division 38

Questions?