2
United States Department of Education Student Privacy Policy and Assistance Division
MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education
De-Identifying Education Data
Future of Privacy Forum Webinar October 13, 2017
De-Identifying Education Data Future of Privacy Forum Webinar - - PowerPoint PPT Presentation
De-Identifying Education Data Future of Privacy Forum Webinar October 13, 2017 MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education United States Department of Education 2 Student Privacy Policy and Assistance
2
United States Department of Education Student Privacy Policy and Assistance Division
MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education
Future of Privacy Forum Webinar October 13, 2017
2
Student Privacy Policy and Assistance Division 2
2
Student Privacy Policy and Assistance Division 3
2
Student Privacy Policy and Assistance Division 4
A one-handed pirate, with an irrational fear of crocodiles and ticking clocks
2
Student Privacy Policy and Assistance Division 5
(1:1 relationship to student)
(1:Many relationship to student)
linked or linkable to a specific student that would allow a reasonable person in the school community, who does not have personal knowledge of the relevant circumstances, to identify the student with reasonable certainty. ” (§ 99.3)
5
2
Student Privacy Policy and Assistance Division 6
Can a “reasonable person” in the school community re-identify the individual with any reasonable certainty?
Tabular Data:
A small degree of uncertainty (“reasonable doubt”) is often sufficient. [e.g., “the rule of 3”]
Individual-level Data:
The abundance of data points for each individual, the availability
ability to link to external data sources make the risk of re- identification much higher.
6
2
Student Privacy Policy and Assistance Division 7
2
Student Privacy Policy and Assistance Division 8
Aggregate data tables can still contain PII if they report information on small groups, or individuals with unique or uncommon characteristics
8
2
Student Privacy Policy and Assistance Division 9
5 10 15 20 25 30 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
State Adoption of Sub-Group Suppression Rules (circa 2012)
Suppression Rule (n < X) # of States
Under ESEA, States adopted minimum n-size rules to protect student privacy in aggregate reports, but there is substantial variation across states on the minimum n-size selected. There is also substantial variation in how States have interpreted and implemented those minimum n-size requirements.
2
Student Privacy Policy and Assistance Division 10
BUT, suppressing the small cells may not be sufficient
10
2
Student Privacy Policy and Assistance Division 11
2
Student Privacy Policy and Assistance Division 12
Subgroup # Tested # Proficient % Proficient Subgroup 1 6 1 16.7% Assume a minimum n-size rule of 5:
2
Student Privacy Policy and Assistance Division 13
Subgroup # Tested # Proficient % Proficient Subgroup 1 6 1 16.7%
What if I’m that 1 student? I now know something about the other 5!
Assume a minimum n-size rule of 5:
2
Student Privacy Policy and Assistance Division 14
Subgroup # Tested # Proficient % Proficient Subgroup 1 8 * <5% Assume a minimum n-size rule of 5:
2
Student Privacy Policy and Assistance Division 15
Subgroup # Tested # Proficient % Proficient Subgroup 1 8 * <5% Assume a minimum n-size rule of 5: 0/8 = 0% 1/8 = 12.5% So, “<5%” of 8 students = 0 students!
2
Student Privacy Policy and Assistance Division 16
Number of Students (denominator) Top/Bottom Coding for Percentages 1-5 Suppressed 6-15 <50%, ≥50% 16-30 ≤20%, ≥80% 31-60 ≤10%, ≥90% 61-300 ≤5%, ≥95% 301-3,000 ≤1%, ≥99% 3,001 or more ≤0.1%, ≥99.9%
2
Student Privacy Policy and Assistance Division 17
2
Student Privacy Policy and Assistance Division 18
2
Student Privacy Policy and Assistance Division 19
2
Student Privacy Policy and Assistance Division 20
CENSORED CENSORED
2
Student Privacy Policy and Assistance Division 21
Students by Gender Students by Subgroup
2
Student Privacy Policy and Assistance Division 22
Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * * * All Students 12 0% 42% 42% 17%
2
Student Privacy Policy and Assistance Division 23
Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * * * All Students 12 0% 42% 42% 17%
2
Student Privacy Policy and Assistance Division 24
Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * 100% * All Students 12 0% 42% 42% 17%
2
Student Privacy Policy and Assistance Division 25
Remember: It’s not just the small cells that are important. Bigger cells/values can still be disclosive if:
cells elsewhere (in the same table, or even in another data
release!)
2
Student Privacy Policy and Assistance Division 26
Education data are often reported in a multi- dimensional structure. To be effective, a disclosure avoidance methodology must consider all levels of aggregation.
2
Student Privacy Policy and Assistance Division 27
Subgroup # Tested Advanced Proficient Basic Below Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * 100% * All Students 12 0% 42% 42% 17%
2
Student Privacy Policy and Assistance Division 28
Subgroup Grade School District State national
When using suppression to protect privacy, consider all the ways that data are aggregated. If a cell is suppressed (primary or complementary) at one level, it needs to be suppressed in at least one other reporting entity at the next level of aggregation. And make sure that those additional entities have proper complementary suppression too!
2
Student Privacy Policy and Assistance Division 29
When performing a disclosure risk analysis, you must consider data releases made by other
How schools, districts, states, and the Federal government release the same (or related) data, may impact the re-identifiability of the data you (or they) release!
2
Student Privacy Policy and Assistance Division 30
The 3 “Flavors” of Disclosure Avoidance Techniques:
2
Student Privacy Policy and Assistance Division 31
Definition:
Removing data to prevent the identification of individuals in small cells or with unique characteristics
Examples:
Cell Suppression Row Suppression Sampling
Effect on Data Utility: Results in very little data being produced for small
populations Requires suppression of additional, non-sensitive data (e.g., complementary suppression)
Residual Risk of Disclosure:
Suppression can be difficult to perform correctly (especially for large multi-dimensional tables) If additional data is available elsewhere, the suppressed data may be re-calculated.
2
Student Privacy Policy and Assistance Division 32
Definition:
Reducing the precision of data that is presented to reduce the certainty of identification
Examples:
Aggregation Percents Ranges Top/Bottom-Coding Rounding
Effect on Data Utility:
Users cannot make inferences about small changes in the data Reduces the ability to perform time-series or cross- case analysis
Residual Risk of Disclosure:
Generally low risk, but if row/column totals are published (or available elsewhere) then it may be possible to calculate the actual values of sensitive cells
2
Student Privacy Policy and Assistance Division 33
Definition:
Making small changes to the data to prevent identification of individuals from unique or rare characteristics
Examples:
Data Swapping Noise Synthetic Data Differential Privacy
Effect on Data Utility:
Can minimize loss of utility compared to other methods May be seen as inappropriate for program data because it reduces the transparency and credibility
regulatory implications
Residual Risk of Disclosure:
If someone has access to some (e.g., a single state’s)
the perturbation rules used to alter the rest of the data
2
Student Privacy Policy and Assistance Division 34
There are many different types of disclosure avoidance methods, and limitless variations in how to apply them. But, each method impacts the usability of the data in different ways. The first question to ask when selecting a method should always be:
“How will these data be used?”
2
Student Privacy Policy and Assistance Division 35
can adopt multiple methods that complement each other (e.g., suppression and top/bottom coding)
totals, and related tables – complementary suppression will most likely be necessary
counts – they are often known or can be inferred.
whenever possible
probably need to use some amount of perturbation!
2
Student Privacy Policy and Assistance Division 36
“The release of any data usually entails at least some element of
risk of disclosure would curtail [data] releases drastically, if not
proposed release of [data] the acceptability of the level of risk
evaluated.”
Federal Committee on Statistical Methodology, “Statistical Working Paper #2”
2
Student Privacy Policy and Assistance Division 37
http://studentprivacy.ed.gov
Help Desk: PrivacyTA@ed.gov On-site Assistance
(site visits, trainings, etc.)
Selected PTAC Resources on Disclosure Avoidance:
Frequently Asked Questions— Disclosure Avoidance Data De-identification: An Overview of Basic Terms Case Study #5: Minimizing PII Access
2
Student Privacy Policy and Assistance Division 38