PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , - PowerPoint PPT Presentation

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li ∗ , Yitao Duan † , Yu Yu § , Shouyao Zhao § , Wei Xu ∗ ∗ Institute for Interdisciplinary Information Sciences, Tsinghua University † NetEase Youdao § Shanghai Jiaotong University

Making use of data vs. data privacy Privacy Compliance Data asset 2

Scenario 1: Multi-source data mining s r e n w o a t a d f o s t u p n i e t a v i r P Compute servers see nothing Get nothing other than the final results 3

Scenario 2: Inference w/ secret models and data Private model Private data Inference result Similar setting: federated learning, but want to protect the model itself. 4

A nice theory provide solution u Secure multi-party computation (MPC) x 1 F(x 1 , x 2 , …. x n ) y x 2 • We can compute any function F() without revealing the inputs x i . • No noise introduced in computation, and do not reveal anything. x n 5

Tons of cryptography-based solutions tell us … Ø Many novel theoretical solutions • Secret Sharing (Shamir 1979) • Garbled Circuit (Yao 1986) • Fully Homomorphic Encryption (Gentry 2009) Ø Even many “practical” solutions exist • Sharemind (2008) • TASTY (2010) • PICCO (2013) • SPDZ (2008) • SecureML(2017) • ABY3(2018) Ø But, why people still not using it to mine real world data? 6

The gap between cryptography and data science The Cryptography World The Data Science World • Efficient bit-wise and integer operations • Efficient operations on real numbers • Fast single number arithmetic • Fast vector and array operations • Theoretically innovative • Scalable system implementation • A custom and beautiful programming language • Familiar language with rich algorithm libraries The gap is like a set of data structures v.s. a relational database 7

PrivPy attempts to bridge the gap • A fast (4,2)-secret-sharing protocol and engine PrivPy Convenient APIs Language • Python language with automatic code Front-end Interpreter Optimizer optimizer Computation Engines Back-end • NumPy types and libraries • Runs non-trivial algorithms on real data 8

Crypto preliminary: basic secret sharing - Two semi-honest servers: S 1 and S 2 - A large (e.g. 256 bits) number 𝑞 - Computation in the field 𝜚 𝑞 = {0, 1, …, 𝑞 -1} S 1 S 2 𝑣 𝑣 1 𝑣 2 = + 𝜒 ( 𝑣 ) = ( 𝑣 1 , 𝑣 2 ) 𝑣 1 : uniformly distributed in 𝜚 𝑞 𝑣 2 : = 𝑣 - 𝑣 1 (mod 𝑞 ) 9

Multiplication: Our (4 2) -secret sharing scheme , , 𝑤 * 𝑤 + 𝑤 + 𝑤 * , , 𝑣 * 𝑣 + 𝑣 + 𝑣 * • Two auxiliary servers S a and S a S b S b to compute the cross terms • Benefit: one round of S 1 S 2 communication only for × , , 𝑣 * 𝑣 * 𝑣 + 𝑣 + , , 𝑤 * 𝑤 * 𝑤 + 𝑤 + , , , , 𝑣 + 𝑤 + 𝑣 + 𝑤 * 𝑣 * 𝑤 + 𝑣 * 𝑤 * 𝑥 = 𝑣 × 𝑤 = + + + 𝑢 * 𝑢 + 𝑢 1 𝑢 2 10

Using fixed-point to represent real numbers 010010011100100.11011001001 … Fixed-length 𝑚 − 𝑙 Fixed-length 𝑙 Integer part decimal part 010010011100100 11011001001 • Use expensive bit-level operations Ø PICCO, Sharemind, SPDZ, etc • Support built-in fixed-point operations Ø SecureML, ABY3, PrivPy 11

The PrivPy computation engine Servers Clients TASK CONFIG 𝑦 𝐷 1 Python code 𝑇 1 𝑇 𝑏 Data source addr Result addr PO SS Store PO SS Store Engine 1 Engine a … … 𝐷 𝑙 𝑧 … … SS Store PO PO SS 2 Engine Engine Store b 𝑇 𝑐 𝑇 2 𝑨 𝐷 𝑜 12

The PrivPy computation engine Servers Clients res 1 + res 2 = res 𝑦 1 𝑦 𝐷 1 𝑇 1 𝑇 𝑏 𝑦 1 SS Store SS Store PO PO 𝑦 2 a 1 Engine Engine … … res 1 𝑧 1 Private Ops 𝐷 𝑙 𝑧 Protocols 𝑧 2 𝑦 2 … … 𝑨 1 PO SS Store PO SS Store Engine 2 Engine b 𝑇 𝑐 𝑇 2 res 2 𝑨 2 𝑨 𝐷 𝑜 13

Python compatible programming front-end u Overload basic operations for private variables: +, -, × , >, etc 14

Most existing solutions define their own language PICCO OblivC SPDZ Why? Many pitfalls if written in Python resulting in inefficiency. 15

AST-level code optimization to avoid pitfalls Common factor � � …… � ! � …… � ! " � " � � � � ! " � ! " � " � " � Auto vectorization × Still adding more optimizations to the language frontend. 16

APIs: from basic OPs to algorithms u Division: Newton-Raphson method Basic OPs Derived OPs u Sigmoid: Euler Method u ReLu: comparison Division Add u Other functions: e 𝘺 , log(x), … Sigmoid function SS( 𝑒 ) 1 𝑧 𝑦 = Mul =( 𝑒 1 , 𝑒 2 ) 1 + 𝑓 FG ReLU 𝑧′ 𝑦 = 𝑧(𝑦)(1 − 𝑧(𝑦)) 𝑧 𝑦 IJ* = 𝑧 𝑦 I + 𝑧 , (𝑦 I )Δ𝑦 Cmp = 𝑧 𝑦 I + 𝑧 𝑦 I 1 − 𝑧 𝑦 I Δ𝑦 Garbled circuit 17

APIs: arrays are first-class citizen • Array is a built-in type Ø 𝐵 = 𝑞𝑞. 𝑡𝑏𝑠𝑠 … ; 𝐶 = 𝑞𝑞. 𝑡𝑏𝑠𝑠( … ) Ø Both 𝐵 ∗ 𝐶 and 𝐵 + 𝐶 work • Array type is essential for data mining: reduces # of ops, thus # of rounds • Support large arrays (e.g. 1 million × 5000, ~200GB) using automatic disk buffer management 18

Beyond arrays: NumPy’s broadcasting and ndarray u Allow operations between arrays of different shapes Ø E.g. Ø 12d-scalar 𝑦 , a 3 * 4 array 𝐵 and a 2 * 3 * 4 array 𝐶 Ø 𝑦 + 𝐵 , 𝐵 ∗ 𝐶 and 𝑦 > 𝐶 all work Ø Can even mix plaintext and cipher text u Ndarray methods 𝑧 = 𝑔(𝑥 ⊺ ⋅ 𝑌 + 𝑐) 19

API example: neural network inference image PrivPy Inference result Engine model 20

Basic operation performance Throughput of basic operations (ops per second) LAN (10Gbps) Engine Approach decimal multiplication comparison PrivPy SS 10,473,532 1,282,027 Helib FHE 258 - Obliv-C GC 3,930 78,431 P4P+HE SS+HE 4,344 - SS with active SPDZ 83,073 20,472 security SS with active SPDZ+PrivPy 83,229 20,320 security Our thin wrapper 21

Real world algorithm performance Dataset: MNIST with 70,000 labeled handwritten digits Algorithm: • Logistic Regression (LR) : trained using SGD • Matrix Factorization (MF) : decomposes a 𝑛 × 𝑜 matrix to a 𝑛 ×5 matrix and a 5 ×𝑜 matrix • CNN : LeNet-5 Time of training/inference for 1 iteration (seconds) LAN (10Gbps) WAN (50Mbps) Batch size MF CNN MF CNN LR training LR training training inference training inference Single op 5.3e-3 7.1e-3 9.6e-2 2.61 0.37 7.64 Batch (1000 3.92 5.67 12.02 7.3 13.2 56.3 ops) 22

Conclusion and future work u MPC can be useful in data mining, but big gap to bridge u PrivPy is an early attempt to make MPC practical for large datasets Ø Language, data types, function libraries Ø Scalable and efficient system implementation Ø Heavily rely on language-level optimizations u PrivPy is an on-going effort Ø Integrating with other privacy-preserving techniques – differential privacy, federated learning, trusted execution etc. Ø More libraries, algorithms and compiler optimizations Wei Xu –http://iiis.tsinghua.edu.cn/~weixu Yi Li – xiaolixiaoyi@gmail.com 23

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , - PowerPoint PPT Presentation

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , Yitao Duan , Yu Yu , Shouyao Zhao , Wei Xu Institute for Interdisciplinary Information Sciences, Tsinghua University NetEase Youdao Shanghai Jiaotong

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Privacy preserving data mining randomized response and association rule hiding Li Xiong

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining & Information Privacy:

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Collaborative Privacy Preserving Data Mining in Vertically Partitioned Databases Ehud Gudes

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data

Introduction What is data mining? to Data mining functionalities Data Mining Major

Private Organizations (POs) I n t e g r i t y - S e r v i c e - E x c e l l e n c e 412th Force

CSC304 Lecture 19 Fair Division 2: Cake-cutting, Indivisible goods CSC304 - Nisarg Shah 1

FLP and RSMs The Consensus Trilogy - Part 1 FLP and RSMs The Consensus Trilogy - Part 1

Kanban vs Scrum Making the most of both QCon, San Francisco Nov 18, 2009 Henrik Kniberg

Machine Learning II DS 4420 - Spring 2020 MLE, MAP, & Graphical models Byron C. Wallace

Git as a HIT Dan Licata Wesleyan University 1 1 Darcs Git as a HIT Dan Licata Wesleyan

Clinical Trials in OSA Samuel T. Kuna, MD Department of Medicine Center for Sleep and Circadian

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , - PowerPoint PPT Presentation

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , Yitao Duan , Yu Yu , Shouyao Zhao , Wei Xu Institute for Interdisciplinary Information Sciences, Tsinghua University NetEase Youdao Shanghai Jiaotong

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Privacy preserving data mining randomized response and association rule hiding Li Xiong

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining &amp; Information Privacy:

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Collaborative Privacy Preserving Data Mining in Vertically Partitioned Databases Ehud Gudes

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

Privacy preserving data mining multiplicative perturbation techniques Li Xiong CS573 Data

Introduction What is data mining? to Data mining functionalities Data Mining Major

Private Organizations (POs) I n t e g r i t y - S e r v i c e - E x c e l l e n c e 412th Force

CSC304 Lecture 19 Fair Division 2: Cake-cutting, Indivisible goods CSC304 - Nisarg Shah 1

FLP and RSMs The Consensus Trilogy - Part 1 FLP and RSMs The Consensus Trilogy - Part 1

Kanban vs Scrum Making the most of both QCon, San Francisco Nov 18, 2009 Henrik Kniberg

Machine Learning II DS 4420 - Spring 2020 MLE, MAP, &amp; Graphical models Byron C. Wallace

Git as a HIT Dan Licata Wesleyan University 1 1 Darcs Git as a HIT Dan Licata Wesleyan

Clinical Trials in OSA Samuel T. Kuna, MD Department of Medicine Center for Sleep and Circadian

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije

DIMACS/PORTIA Workshop on Privacy Preserving Data Mining Data Mining & Information Privacy:

Machine Learning II DS 4420 - Spring 2020 MLE, MAP, & Graphical models Byron C. Wallace