Differential Privacy Tabular Data Li Xiong Outline Tabular data - PowerPoint PPT Presentation

CS573 Data Privacy and Security Differential Privacy – Tabular Data Li Xiong

Outline • Tabular data and histogram/range queries • Algorithms for low dimensional data • Algorithms for high dimensional data

Example: statistics/synthetic data for medical records • Histograms • Cohort discovery: range queries – Select COUNT(*) from D Where A1 in I1 and A2 in I2 and … and Am in Im.

Example: statistical agencies: data publishing • A marginal over attributes 𝐵 1 , … , 𝐵 𝑙 reports count for each combination of attribute values. – aka cube, contingency table – E.g. 2-way marginal on EmploymentStatus and Gender • U.S. Census Bureau statistics can typically be derived from k -way marginal over different combinations of available attributes • Hundreds of marginals released https://factfinder.census.gov/ Module 3 Tutorial: Differential Privacy in the Wild 4

Example: range queries over spatial data Input: sensitive data D Input: range query workload W Shown is workload of 3 range queries BeijingTaxi dataset[1]: 4,268,780 records of (lat,lon) pairs of taxi pickup locations in Beijing, China in 1 month. Scatter plot of input data Task : compute answers to workload W over private input D [1]Raw data from: Taxi trajectory open dataset, Tsinghua university, China. Module 3 Tutorial: Differential Privacy in the Wild 5 http://sensor.ee.tsinghua.edu.cn, 2009.

Problem variant: offline vs. online • Offline (batch): – Entire W given as input, answers computed in batch • Online (adaptive): – W is sequence q 1 , q 2 , … that arrives online – Adaptive : analyst’s choice for q i can depend on answers 𝑏 1 , … , 𝑏 𝑗−1 Module 3 Tutorial: Differential Privacy in the Wild 6

Important aspects of problem: Data and query complexity • Data complexity – Dimensionality: number of attributes – Domain size: number of distinct attribute combinations – Many techniques specialized for low dimensional data • Query complexity – Given query workload vs. no query workload – Classes of queries: histograms, count queries, linear queries (sum, average), median … Module 3 Tutorial: Differential Privacy in the Wild 7

Solution variants: query answers vs. synthetic data Two high-level approaches to solving problem 1. Direct: – Output of the algorithm is list of query answers 2. Synthetic data : – Algorithm constructs a synthetic dataset D’ , which can be queried directly by analyst – Analyst can pose additional queries on D’ (though answers may not be accurate) Module 3 Tutorial: Differential Privacy in the Wild 8

Categories of Methods • Nonparametric methods – release empirical distributions, i.e. histograms with differential privacy • Parametric methods – learn parameters of a distribution with differential privacy

Categories of Methods • Semi-parametric methods – DP marginal histograms (non-parametric) – Model dependence between attributes (parametric)

Outline • Tabular data and histogram/range queries • Algorithms for low dimensional data – Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, … • Algorithms for high dimensional data

Baseline algorithm 1. Discretize attribute domain into cells 2. Add noise to cell counts (Laplace mechanism) – unit histogram 3. Use noisy counts to either… 1. Answer queries directly (assume distribution is uniform Scatter plot of input data within cell) 2. Generate synthetic data Limitations (derive distribution from counts • Granularity of discretization and sample) – Coarse: detail lost – Fine: noise overwhelms signal • Noise accumulates: squared error grows linearly with range Module 3 Tutorial: Differential Privacy in the Wild 12

DPCube: An early attempt [SDM 2010, ICDE 2012 demo] • Domain-based partitioning does not work very well – Equi-width: equal bucket range – Uniformity assumption • Data-driven partitioning – V-optimal: with the least frequency variance – Intuition: highest uniformity within each bucket September 20, 2016 13

Histograms (review) • Divide data into buckets and store average (sum) for each bucket • Partitioning rules: – Equi-width: equal bucket range – Equi-depth: equal frequency – V-optimal: with the least frequency variance September 20, 2016 14

DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y ε/ 2-DP Bob 31 60K Y Mary 28 20K Y … … … … Original Records DP unit Histogram Multi-dimensional • Compute unit partitioning histogram counts with ε/ 2-DP differential privacy • Use DP unit histogram for partitioning • Compute V-optimal histogram counts with differential privacy DP V-optimal Histogram DP Interface

Use kd-tree for partitioning  Choose dimension and splitting point to split (minimize variance)  Repeat until:  Count of this node less than threshold  Variance or entropy of this node less than threshold

DPCube [SecureDM 2010, ICDE 2012 demo] Name Age Income HIV+ Frank 42 30K Y ε/ 2-DP Bob 31 60K Y Mary 28 20K Y … … … … Original Records DP unit Histogram Multi-dimensional partitioning • Limitations: ε/ 2-DP – DP unit histogram very noisy – Affects the accuracy of partitioning DP V-optimal Histogram DP Interface

Private Spatial decompositions [CPSSY 12] quadtree kd-tree  Build: partitioning with differential privacy  Release: a private description of data distribution (in the form of bounding boxes and noisy counts) 18

Building a Private kd-tree  Process to build a private kd-tree  Input: maximum height h , minimum leaf size L, data set  Choose dimension to split  Get (private) median in this dimension  Create child nodes and add noise to the counts  Recurse until:  Max height is reached  Noisy count of this node less than L  Budget along the root-leaf path has used up  The entire PSD satisfies DP by the composition property 19

Building PSDs – privacy budget allocation  Budget is split between medians and counts at each node – Tradeoff accuracy of division with accuracy of counts  Budget is split across levels of the tree – Privacy budget used along any root-leaf path should total  – Optimal budget allocation – Post processing with consistency check

Building PSDs – privacy budget allocation  Budget is split between medians and counts at each node – Tradeoff accuracy of division with accuracy of counts  Budget is split across levels of the tree – Privacy budget used along any root-leaf path should total  – Optimal budget allocation – Post processing with consistency check Sequential composition Parallel composition 21

Data Transformations  Can think of trees as a ‘data - dependent’ transform of input  Can apply other data transformations  General idea: – Apply transform of data – Add noise in the transformed space (based on sensitivity) – Publish noisy coefficients, or invert transform (post-processing)  Goal: pick a transform that preserves good properties of data – And which has low sensitivity, so noise does not corrupt Noise Invert Transform Coefficients Original Noisy Private Data Coefficients Data 22

Linear transformations • Approach – Discretize domain to finest granularity cells – Use Laplace mechanism to answer batch of queries, each of which is linear combination of cell counts • Examples – Hierarchical: Trees [HRMS10,QYL13], full height quadtree [CPSSY12] – Haar Wavelet [XWG10] – Discrete Fourier transform [BCDKMT07] • Inverting transformation – Some transformations (e.g. tree) have redundancy (over- constrained), so require pseudo-inverse Module 3 Tutorial: Differential Privacy in the Wild 23

Lossy transformations • Variants – Drop “small” coefficients: • Quad-tree with early stopping (noisy count threshold) • Fourier coefficients: EFPA [ACC12], [RN10] – Data-adaptive discretization: • PrivTree [ZXX16], KD-Tree [CPSSY12], DAWA [LHMY14], [DNRR15], [QYL13], [BLR08] – Data-adaptive measurement: • MWEM [HLM12], DualQuery [GAHRW14] – Randomized transforms: sketches and compressed sensing • JL Transform [BBDS12], Compressive mechanism [LZWY11] • “Inverting” transformation – Because lossy, they are under-constrained, requires estimation • Error rates depend on input – Can be much lower (trades off small bias for lower variance) – Warrants careful empirical evaluation; algorithms are “ data dependent ” Module 3 Tutorial: Differential Privacy in the Wild 24

[HMMCZ16] Empirical benchmarks • [HMMCZ16] propose a novel evaluation framework for standardized evaluation of privacy algorithms. • Study of algorithms for range query answering over 1 and 2D • Benchmark website www.dpcomp.org One finding from [HMMCZ16] : Some data-dependent algorithms fail to offer benefits at larger scales (no. of tuples) Tutorial: Differential Privacy in the Wild 25

Outline • Tabular data and histogram/range queries • Algorithms for low dimensional data – Baseline – Partitioning algorithms: kd tree, quad tree, … – Transformation: Wavelet, Fourier Transform, … • Algorithms for high dimensional data – Copula functions [LXJ14] – Bayesian networks [ZCPSX14]

Differential Privacy Tabular Data Li Xiong Outline Tabular data - PowerPoint PPT Presentation

CS573 Data Privacy and Security Differential Privacy Tabular Data Li Xiong Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data Example: statistics/synthetic

CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

CS573 Data Privacy and Security Differential Privacy Real World Deployments Li Xiong

Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques

CS573 Data Privacy and Security Local Differential Privacy Li Xiong Privacy at Scale: Local

Toniann Pitassi Outline 1. Differential Privacy: The Basics 2. Differential Privacy in New

Differential Privacy Techniques Beyond Differential Privacy Steven Wu Assistant Professor

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Differential Privacy Machine Learning Li Xiong Big Data + Machine Learning + Machine

Locally tabular polymodal logics Ilya Shapirovsky Institute for Information Transmission Problems

CENG 342 Digital Systems Tabular Minimization Larry Pyeatt SDSM&T Tabular Minimization

Differential Privacy (Part III) Approximate (or ( , ))-differential privacy

Differential Privacy Privacy & Fairness in Data Science CS848 Fall 2019 2 Outline

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

scRNA-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics

SFY 2020-2022 RFP Proposers Conference Alpine AAA Overview Covered Today Services

Basic Rural Health Clinic Billing Charles A. James, Jr. President and CEO North American

Presenter Disclosure Gary D. Foster, PhD Obesity, Weight Loss and OSA Scientific Advisory

Patient Empowerment by Increasing Information Accessibility In a Telecare System presenter :

before and after IL2 treatment Lu Wang and Ying Sha 9/18/2014 1 Update since the 9/15/2014

Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA 18

Agenda SB 1383 Subgroup 2 12:3012:40 Welcome and introductions 12:4012:45 Status update for

Differential Privacy Tabular Data Li Xiong Outline Tabular data - PowerPoint PPT Presentation

CS573 Data Privacy and Security Differential Privacy Tabular Data Li Xiong Outline Tabular data and histogram/range queries Algorithms for low dimensional data Algorithms for high dimensional data Example: statistics/synthetic

CS573 Data Privacy and Security Differential Privacy tabular data and range queries Li Xiong

CS573 Data Privacy and Security Differential Privacy Real World Deployments Li Xiong

Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques

CS573 Data Privacy and Security Local Differential Privacy Li Xiong Privacy at Scale: Local

Toniann Pitassi Outline 1. Differential Privacy: The Basics 2. Differential Privacy in New

Differential Privacy Techniques Beyond Differential Privacy Steven Wu Assistant Professor

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Differential Privacy Machine Learning Li Xiong Big Data + Machine Learning + Machine

Locally tabular polymodal logics Ilya Shapirovsky Institute for Information Transmission Problems

CENG 342 Digital Systems Tabular Minimization Larry Pyeatt SDSM&amp;T Tabular Minimization

Differential Privacy (Part III) Approximate (or ( , ))-differential privacy

Differential Privacy Privacy &amp; Fairness in Data Science CS848 Fall 2019 2 Outline

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

scRNA-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics

SFY 2020-2022 RFP Proposers Conference Alpine AAA Overview Covered Today Services

Basic Rural Health Clinic Billing Charles A. James, Jr. President and CEO North American

Presenter Disclosure Gary D. Foster, PhD Obesity, Weight Loss and OSA Scientific Advisory

Patient Empowerment by Increasing Information Accessibility In a Telecare System presenter :

before and after IL2 treatment Lu Wang and Ying Sha 9/18/2014 1 Update since the 9/15/2014

Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA 18

Agenda SB 1383 Subgroup 2 12:3012:40 Welcome and introductions 12:4012:45 Status update for

CENG 342 Digital Systems Tabular Minimization Larry Pyeatt SDSM&T Tabular Minimization

Differential Privacy Privacy & Fairness in Data Science CS848 Fall 2019 2 Outline