PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , - - PowerPoint PPT Presentation

privpy scalable and general privacy preserving data mining
SMART_READER_LITE
LIVE PREVIEW

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , - - PowerPoint PPT Presentation

PrivPy: Scalable and General Privacy-Preserving Data Mining Yi Li , Yitao Duan , Yu Yu , Shouyao Zhao , Wei Xu Institute for Interdisciplinary Information Sciences, Tsinghua University NetEase Youdao Shanghai Jiaotong


slide-1
SLIDE 1

PrivPy: Scalable and General Privacy-Preserving Data Mining

Yi Li∗, Yitao Duan†, Yu Yu§, Shouyao Zhao§, Wei Xu∗

∗ Institute for Interdisciplinary Information Sciences, Tsinghua University † NetEase Youdao §Shanghai Jiaotong University

slide-2
SLIDE 2

Making use of data vs. data privacy

2 Privacy Compliance Data asset

slide-3
SLIDE 3

Scenario 1: Multi-source data mining

Compute servers see nothing Get nothing other than the final results P r i v a t e i n p u t s

  • f

d a t a

  • w

n e r s 3

slide-4
SLIDE 4

Scenario 2: Inference w/ secret models and data

Private data Private model Inference result

4

Similar setting: federated learning, but want to protect the model itself.

slide-5
SLIDE 5

A nice theory provide solution

uSecure multi-party computation (MPC)

5 F(x1, x2, …. xn) y x1 x2 xn

  • We can compute any function F() without

revealing the inputs xi.

  • No noise introduced in computation, and do

not reveal anything.

slide-6
SLIDE 6

Tons of cryptography-based solutions tell us …

ØMany novel theoretical solutions

  • Secret Sharing (Shamir 1979)
  • Garbled Circuit (Yao 1986)
  • Fully Homomorphic Encryption (Gentry 2009)

ØEven many “practical” solutions exist

  • Sharemind (2008)
  • TASTY (2010)
  • PICCO (2013)
  • SPDZ (2008)
  • SecureML(2017)
  • ABY3(2018)

ØBut, why people still not using it to mine real world data?

6

slide-7
SLIDE 7

The gap between cryptography and data science

  • Efficient bit-wise and integer operations
  • Efficient operations on real numbers
  • Fast single number arithmetic
  • Fast vector and array operations
  • Theoretically innovative
  • Scalable system implementation
  • A custom and beautiful programming language
  • Familiar language with rich algorithm libraries

The Cryptography World The Data Science World

The gap is like a set of data structures v.s. a relational database

7

slide-8
SLIDE 8

PrivPy attempts to bridge the gap

  • A fast (4,2)-secret-sharing protocol and

engine

  • Python language with automatic code
  • ptimizer
  • NumPy types and libraries
  • Runs non-trivial algorithms on real data

Computation Engines Interpreter Optimizer Convenient APIs Language Front-end Back-end

PrivPy

8

slide-9
SLIDE 9

Crypto preliminary: basic secret sharing

S1 S2

𝜒(𝑣) = (𝑣1, 𝑣2) 𝑣1: uniformly distributed in 𝜚𝑞 𝑣2: = 𝑣 - 𝑣1 (mod 𝑞)

  • Two semi-honest servers: S1 and S2
  • A large (e.g. 256 bits) number 𝑞
  • Computation in the field 𝜚𝑞= {0, 1, …,𝑞-1}

+ 𝑣2 𝑣1 = 𝑣

9

slide-10
SLIDE 10

Multiplication: Our (4 2)-secret sharing scheme

𝑣* 𝑣+

Sa

𝑣+ 𝑣*

S1 S2 Sb

𝑣*

,

𝑣+

,

𝑣*

,

𝑣+

,

𝑤* 𝑤+ 𝑤*

,

𝑤+

,

𝑤+ 𝑤* 𝑤*

,

𝑤+

,

= 𝑣 × 𝑤 + 𝑣*𝑤*

,

+ 𝑣+𝑤+

,

+ 𝑣+𝑤*

,

𝑣*𝑤+

,

𝑥 =

𝑢* 𝑢+ 𝑢1 𝑢2

  • Two auxiliary servers Sa and

Sbto compute the cross terms

  • Benefit: one round of

communication only for ×

10

slide-11
SLIDE 11

Using fixed-point to represent real numbers

010010011100100.11011001001…

Fixed-length 𝑚 − 𝑙 Integer part Fixed-length 𝑙 decimal part

010010011100100 11011001001

  • Use expensive bit-level operations

Ø PICCO, Sharemind, SPDZ, etc

  • Support built-in fixed-point operations

Ø SecureML, ABY3, PrivPy

11

slide-12
SLIDE 12

The PrivPy computation engine 𝑦 𝑨 … …

𝐷1 𝐷𝑜

SS Store 1 PO Engine SS Store 2 PO Engine PO Engine PO Engine

… …

Clients Servers

𝑇𝑐 𝑇1 𝑇2 𝑇𝑏

𝑧

𝐷𝑙

SS Store a SS Store b TASK CONFIG Python code Data source addr Result addr

12

slide-13
SLIDE 13

The PrivPy computation engine 𝑦 𝑨 … …

𝐷1 𝐷𝑜

SS Store 1 PO Engine SS Store 2 PO Engine PO Engine PO Engine

… …

Clients Servers

𝑇𝑐 𝑇1 𝑇2 𝑇𝑏

𝑧

𝐷𝑙

SS Store a SS Store b

𝑦1 𝑧2 𝑦2 𝑧1 𝑨1 𝑨2 𝑦1 𝑦2

Private Ops Protocols

res1 res2

res1 + res2 = res 13

slide-14
SLIDE 14

Python compatible programming front-end

uOverload basic operations for private variables: +, -, ×, >, etc

14

slide-15
SLIDE 15

Most existing solutions define their own language

PICCO OblivC SPDZ

15

Why? Many pitfalls if written in Python resulting in inefficiency.

slide-16
SLIDE 16
  • !

"

  • !

"

  • !

"

……

  • !
  • "

"

……

  • "

AST-level code optimization to avoid pitfalls

×

16

Still adding more optimizations to the language frontend. Common factor Auto vectorization

slide-17
SLIDE 17

SS(𝑒) =(𝑒1,𝑒2)

Mul Cmp Add

Division Sigmoid function ReLU

Garbled circuit

Basic OPs Derived OPs

APIs: from basic OPs to algorithms

u Division: Newton-Raphson method u Sigmoid: Euler Method u ReLu: comparison u Other functions: e𝘺, log(x), …

𝑧 𝑦 = 1 1 + 𝑓FG 𝑧′ 𝑦 = 𝑧(𝑦)(1 − 𝑧(𝑦)) 𝑧 𝑦IJ* = 𝑧 𝑦I + 𝑧,(𝑦I)Δ𝑦 = 𝑧 𝑦I + 𝑧 𝑦I 1 − 𝑧 𝑦I Δ𝑦 17

slide-18
SLIDE 18

APIs: arrays are first-class citizen

  • Array is a built-in type

Ø 𝐵 = 𝑞𝑞. 𝑡𝑏𝑠𝑠 … ; 𝐶 = 𝑞𝑞. 𝑡𝑏𝑠𝑠( … ) Ø Both 𝐵 ∗ 𝐶 and 𝐵 + 𝐶 work

  • Array type is essential for data mining: reduces # of ops, thus # of rounds
  • Support large arrays (e.g. 1 million × 5000, ~200GB) using automatic

disk buffer management

18

slide-19
SLIDE 19

Beyond arrays: NumPy’s broadcasting and ndarray

uAllow operations between arrays of different shapes

ØE.g.

Ø 12d-scalar 𝑦, a 3 * 4 array 𝐵 and a 2 * 3 * 4 array 𝐶 Ø𝑦 + 𝐵, 𝐵 ∗ 𝐶 and 𝑦 > 𝐶 all work ØCan even mix plaintext and cipher text

uNdarray methods

𝑧 = 𝑔(𝑥⊺ ⋅ 𝑌 + 𝑐)

19

slide-20
SLIDE 20

API example: neural network inference

model

Inference result

image

PrivPy Engine

20

slide-21
SLIDE 21

Basic operation performance

21

Throughput of basic operations (ops per second)

Engine Approach LAN (10Gbps) decimal multiplication comparison PrivPy SS 10,473,532 1,282,027 Helib FHE 258

  • Obliv-C

GC 3,930 78,431 P4P+HE SS+HE 4,344

  • SPDZ

SS with active security 83,073 20,472 SPDZ+PrivPy SS with active security 83,229 20,320

Our thin wrapper

slide-22
SLIDE 22

Real world algorithm performance

22

Dataset: MNIST with 70,000 labeled handwritten digits Algorithm:

  • Logistic Regression (LR): trained using SGD
  • Matrix Factorization (MF): decomposes a 𝑛 × 𝑜 matrix to a 𝑛 ×5 matrix and a 5 ×𝑜 matrix
  • CNN: LeNet-5

Time of training/inference for 1 iteration (seconds)

Batch size LAN (10Gbps) WAN (50Mbps) LR training MF training CNN inference LR training MF training CNN inference Single op 5.3e-3 7.1e-3 9.6e-2 2.61 0.37 7.64 Batch (1000

  • ps)

3.92 5.67 12.02 7.3 13.2 56.3

slide-23
SLIDE 23

Conclusion and future work

u MPC can be useful in data mining, but big gap to bridge u PrivPy is an early attempt to make MPC practical for large datasets

ØLanguage, data types, function libraries ØScalable and efficient system implementation ØHeavily rely on language-level optimizations

u PrivPy is an on-going effort

ØIntegrating with other privacy-preserving techniques – differential privacy, federated learning, trusted execution etc. ØMore libraries, algorithms and compiler optimizations

23

Wei Xu –http://iiis.tsinghua.edu.cn/~weixu Yi Li – xiaolixiaoyi@gmail.com