De-Identifying Education Data Future of Privacy Forum Webinar - PowerPoint PPT Presentation

De-Identifying Education Data Future of Privacy Forum Webinar October 13, 2017 MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education United States Department of Education 2 Student Privacy Policy and Assistance Division

What is PII? 2 2 Student Privacy Policy and Assistance Division

Personally Identifiable Information Captain Hook 2 3 Student Privacy Policy and Assistance Division

Personally Identifiable Information A one-handed pirate, with an irrational fear of crocodiles and ticking clocks 2 4 Student Privacy Policy and Assistance Division

FERPA: Personally Identifiable Information (PII) • Direct Identifiers • e.g., Name, SSN, Student ID Number, etc. (1:1 relationship to student) • Indirect Identifiers • e.g., Birthdate, Demographic Information (1:Many relationship to student) • “ Other information that, alone or in combination, is linked or linkable to a specific student that would allow a reasonable person in the school community, who does not have personal knowledge of the relevant circumstances, to identify the student with reasonable certainty. ” (§ 99.3) 5 2 5 Student Privacy Policy and Assistance Division

FERPA’s Confidentiality Standard Can a “reasonable person” in the school community re-identify the individual with any reasonable certainty? Tabular Data: A small degree of uncertainty (“reasonable doubt”) is often sufficient. [e.g., “the rule of 3”] Individual-level Data: The abundance of data points for each individual, the availability of easy to use data-manipulation and data mining tools, and the ability to link to external data sources make the risk of re- identification much higher. 6 2 6 Student Privacy Policy and Assistance Division

FERPA vs. HIPAA’s “Safe Harbor” 2 7 Student Privacy Policy and Assistance Division

PII? But I’m only releasing aggregate data … Aggregate data tables can still contain PII if they report information on small groups, or individuals with unique or uncommon characteristics 8 2 8 Student Privacy Policy and Assistance Division

How States are Doing It Under ESEA, States adopted minimum n-size rules to protect student privacy in aggregate reports, but there is substantial variation across states on the minimum n-size selected. State Adoption of Sub-Group Suppression Rules 35 (circa 2012) 30 25 20 15 # of States 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Suppression Rule (n < X) There is also substantial variation in how States have interpreted and implemented those minimum n-size requirements. 2 9 Student Privacy Policy and Assistance Division

Small cells increase disclosure risk… BUT, suppressing the small cells may not be sufficient 10 2 10 Student Privacy Policy and Assistance Division

Common Mistakes in Public Reporting 2 11 Student Privacy Policy and Assistance Division

Population Size vs. Cell Size Assume a minimum n-size rule of 5: Subgroup # Tested # Proficient % Proficient Subgroup 1 6 1 16.7% 2 12 Student Privacy Policy and Assistance Division

Population Size vs. Cell Size Assume a minimum n-size rule of 5: Subgroup # Tested # Proficient % Proficient Subgroup 1 6 1 16.7% What if I’m that 1 student? I now know something about the other 5! 2 13 Student Privacy Policy and Assistance Division

Fixed Top/Bottom Coding Thresholds Assume a minimum n-size rule of 5: Subgroup # Tested # Proficient % Proficient Subgroup 1 8 * <5% 2 14 Student Privacy Policy and Assistance Division

Fixed Top/Bottom Coding Thresholds Assume a minimum n-size rule of 5: Subgroup # Tested # Proficient % Proficient Subgroup 1 8 * <5% 0/8 = 0% 1/8 = 12.5% So, “<5%” of 8 students = 0 students! 2 15 Student Privacy Policy and Assistance Division

A Better Approach for Handling Extreme Values Number of Top/Bottom Coding for Students Percentages (denominator) 1-5 Suppressed 6-15 <50%, ≥50% 16-30 ≤20%, ≥80% 31-60 ≤10%, ≥90% 61-300 ≤5%, ≥95% 301-3,000 ≤1%, ≥99% 3,001 or more ≤0.1%, ≥99.9% 2 16 Student Privacy Policy and Assistance Division

What’s the missing number? 12 8 14 ? 6 2 17 Student Privacy Policy and Assistance Division

What’s the missing number? 12 8 14 ? 6 44 2 18 Student Privacy Policy and Assistance Division

What’s the missing number? 12 8 14 4 6 44 2 19 Student Privacy Policy and Assistance Division

What’s the missing number? 12 8 14 CENSORED 6 44 CENSORED 2 20 Student Privacy Policy and Assistance Division

What’s the missing number? Students by Subgroup 12 8 Students by 14 Gender 20 4 24 6 44 2 21 Student Privacy Policy and Assistance Division

Lack of Complementary Suppression Below Subgroup # Tested Advanced Proficient Basic Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * * * All Students 12 0% 42% 42% 17% 2 22 Student Privacy Policy and Assistance Division

Lack of Complementary Suppression Below Subgroup # Tested Advanced Proficient Basic Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * * * All Students 12 0% 42% 42% 17% 2 23 Student Privacy Policy and Assistance Division

Lack of Complementary Suppression Below Subgroup # Tested Advanced Proficient Basic Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * 100% * All Students 12 0% 42% 42% 17% 2 24 Student Privacy Policy and Assistance Division

The Trouble with Cell Size Rules Remember: It’s not just the small cells that are important. Bigger cells/values can still be disclosive if: • they are extreme values (e.g., ~0% or ~100% of students in a group) , or • they can be used to calculate the values of protected cells elsewhere (in the same table, or even in another data release!) 2 25 Student Privacy Policy and Assistance Division

Take Home Point: Consider All Reporting Levels Education data are often reported in a multi- dimensional structure. To be effective, a disclosure avoidance methodology must consider all levels of aggregation. 2 26 Student Privacy Policy and Assistance Division

Lack of Complementary Suppression Below Subgroup # Tested Advanced Proficient Basic Basic Subgroup 1 11 0% 45% 36% 18 Subgroup 2 1 * * 100% * All Students 12 0% 42% 42% 17% 2 27 Student Privacy Policy and Assistance Division

Complementary Suppression When using suppression to protect privacy, consider all the ways that data are aggregated. Subgroup Grade School District State national If a cell is suppressed (primary or complementary) at one level, it needs to be suppressed in at least one other reporting entity at the next level of aggregation. And make sure that those additional entities have proper complementary suppression too! 2 28 Student Privacy Policy and Assistance Division

Take Home Point: Data Releases by Others When performing a disclosure risk analysis, you must consider data releases made by other organizations. How schools, districts, states, and the Federal government release the same (or related) data, may impact the re-identifiability of the data you (or they) release! 2 29 Student Privacy Policy and Assistance Division

So What Are Your Options? The 3 “Flavors” of Disclosure Avoidance Techniques: • Suppression • “Blurring” • Perturbation 2 30 Student Privacy Policy and Assistance Division

Suppression Removing data to prevent the identification of individuals Definition: in small cells or with unique characteristics  Cell Suppression Examples:  Row Suppression  Sampling Effect on Data Utility:  Results in very little data being produced for small populations  Requires suppression of additional, non-sensitive data (e.g., complementary suppression)  Suppression can be difficult to perform correctly Residual Risk of (especially for large multi-dimensional tables) Disclosure:  If additional data is available elsewhere, the suppressed data may be re-calculated. 2 31 Student Privacy Policy and Assistance Division

“ Blurring “ Reducing the precision of data that is presented to Definition: reduce the certainty of identification  Aggregation Examples:  Percents  Ranges  Top/Bottom-Coding  Rounding Effect on Data Utility:  Users cannot make inferences about small changes in the data  Reduces the ability to perform time-series or cross- case analysis  Generally low risk, but if row/column totals are Residual Risk of published (or available elsewhere) then it may be Disclosure: possible to calculate the actual values of sensitive cells 2 32 Student Privacy Policy and Assistance Division

De-Identifying Education Data Future of Privacy Forum Webinar - PowerPoint PPT Presentation

De-Identifying Education Data Future of Privacy Forum Webinar October 13, 2017 MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education United States Department of Education 2 Student Privacy Policy and Assistance

Approaches to Identifying Approaches to Identifying Possible Mechanisms of Kidney Possible

1 2 Match Making: Identifying Partners, Match Making: Identifying Partners, Creative

Needs assessment : Kalahandi by Saloni Bedi June- July 2015 Identifying health needs in Tribal

Identifying Identifying & Engaging People Engaging People in in Community-Based Care

Identifying Network Traffic Activity Via Flow Sizes Overview Motivation identifying

Identifying and Identifying and Constructing a Constructing a Dredged Material Dredged Material

Key Tax Issues for Corporate Counsel: Identifying and Managing Tax Risk Identifying and Managing

Identifying MMORPG Bots: Identifying MMORPG Bots: A Traffic Analysis Approach A Traffic Analysis

Identifying Temporal Change of Merapi Identifying Temporal Change of Merapi Eruption Type by

Identifying Security Issues Identifying Security Issues in the Retail Payments System Evolution

Histological Features of Cells and Identifying Epithelia What well talk about

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Automatically Identifying Automatically Identifying and Georeferencing Georeferencing and

Identifying Animals Today we will be... Identifying and comparing common UK birds and reptiles.

Identifying and Prioritizing Needs Bond Issue Task Force Presentations Voting

Estimating the Effects of Tax changes Two Leading Methods for Identifying Tax Shocks Two Leading

Acco u ntabilit y, transparenc y, & predictabilit y C OU R SE C R E ATION AT DATAC AMP

DUNE APA Requirements study Yichen Li Brookhaven National Laboratory DUNE APA Consortium

Algorithmic and Data Transparency in NYC Agencies: Tools and Strategies Julia Stoyanovich Drexel

Obstacles to Transparency in Privacy Engineering Kiel Brennan-Marquez Daniel Susser NYU Law

Introduction to HTML & CSS Instructor: Beck Johnson Week 2 today Week One review and

1 Octrees Spatial Subdivision Subdivision algorithm Volume is subdivided into 4 equal

Assignments Office Hours Graded Work CPSC 314 Computer Graphics project extra TA

Dragonfly June 2012 IndieCrews Photo??? Paperclip https:/ /github.com/thoughtbot/paperclip

De-Identifying Education Data Future of Privacy Forum Webinar - PowerPoint PPT Presentation

De-Identifying Education Data Future of Privacy Forum Webinar October 13, 2017 MICHAEL HAWES Director of Student Privacy Policy U.S. Department of Education United States Department of Education 2 Student Privacy Policy and Assistance

Approaches to Identifying Approaches to Identifying Possible Mechanisms of Kidney Possible

1 2 Match Making: Identifying Partners, Match Making: Identifying Partners, Creative

Needs assessment : Kalahandi by Saloni Bedi June- July 2015 Identifying health needs in Tribal

Identifying Identifying &amp; Engaging People Engaging People in in Community-Based Care

Identifying Network Traffic Activity Via Flow Sizes Overview Motivation identifying

Identifying and Identifying and Constructing a Constructing a Dredged Material Dredged Material

Key Tax Issues for Corporate Counsel: Identifying and Managing Tax Risk Identifying and Managing

Identifying MMORPG Bots: Identifying MMORPG Bots: A Traffic Analysis Approach A Traffic Analysis

Identifying Temporal Change of Merapi Identifying Temporal Change of Merapi Eruption Type by

Identifying Security Issues Identifying Security Issues in the Retail Payments System Evolution

Histological Features of Cells and Identifying Epithelia What well talk about

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Automatically Identifying Automatically Identifying and Georeferencing Georeferencing and

Identifying Animals Today we will be... Identifying and comparing common UK birds and reptiles.

Identifying and Prioritizing Needs Bond Issue Task Force Presentations Voting

Estimating the Effects of Tax changes Two Leading Methods for Identifying Tax Shocks Two Leading

Acco u ntabilit y, transparenc y, &amp; predictabilit y C OU R SE C R E ATION AT DATAC AMP

DUNE APA Requirements study Yichen Li Brookhaven National Laboratory DUNE APA Consortium

Algorithmic and Data Transparency in NYC Agencies: Tools and Strategies Julia Stoyanovich Drexel

Obstacles to Transparency in Privacy Engineering Kiel Brennan-Marquez Daniel Susser NYU Law

Introduction to HTML &amp; CSS Instructor: Beck Johnson Week 2 today Week One review and

1 Octrees Spatial Subdivision Subdivision algorithm Volume is subdivided into 4 equal

Assignments Office Hours Graded Work CPSC 314 Computer Graphics project extra TA

Dragonfly June 2012 IndieCrews Photo??? Paperclip https:/ /github.com/thoughtbot/paperclip

Identifying Identifying & Engaging People Engaging People in in Community-Based Care

Acco u ntabilit y, transparenc y, & predictabilit y C OU R SE C R E ATION AT DATAC AMP

Introduction to HTML & CSS Instructor: Beck Johnson Week 2 today Week One review and