Data Analy1cs WITHOUT Seeing the Data Max O> with input from - - PowerPoint PPT Presentation

data analy1cs without seeing the data
SMART_READER_LITE
LIVE PREVIEW

Data Analy1cs WITHOUT Seeing the Data Max O> with input from - - PowerPoint PPT Presentation

Data Analy1cs WITHOUT Seeing the Data Max O> with input from the en1re N1 Team max.o>@data61.csiro.au www.csiro.au Challenge Result Learn this! Computa(on Learn NOTHING Confidential 2 | Data Analy(cs Without Seeing the Data


slide-1
SLIDE 1

www.csiro.au

Data Analy1cs WITHOUT Seeing the Data

Max O> … with input from the en1re N1 Team max.o>@data61.csiro.au

slide-2
SLIDE 2

Challenge

Computa(on

Result

Confidential

Learn this! Learn NOTHING

Data Analy(cs Without Seeing the Data 2 |

slide-3
SLIDE 3

The Problem

How can we learn valuable insights from sensi1ve data from mul1ple organisa(ons?

Insights

Sensitive data Sensitive data

Joint Analysis

Confidential Confidential

Data Analy(cs Without Seeing the Data 3 |

slide-4
SLIDE 4

Three Basic Building Blocks

  • Private computa(on
  • Arithme(c on encrypted numbers
  • Distributed, confiden(al analy(cs
  • Distributed algorithms, computa(on & protocols
  • Private Record Linkage
  • Privacy preserving record level matching

Data Analy(cs Without Seeing the Data 4 |

slide-5
SLIDE 5

Solu1on (1): Private computa1on

3

E

71175935987496430338623223060201843925208459762815635262949815592595 16861516633702469933935260534155369128712003211669147527394965883186 98743040588706948658192655353713280945959536474253285115856347911583 77797185627083578174160157299579445890692023902698424427665636040729 38327792655060957281939887206011322264791188672934779233385835564950 538042608146734818512597109….......... 65535371328094595953647425328511585634791158377797185627083578174160 15729957944589069202390269842442766563604072976104138715920619699952 17697451818900805720754176976456091364980410538327792655060957281939 88720601132226479118867293477923338583556495053804260814673481851259 70093558089132685793389213865608731685640953069735077874534452166343 33195600873200349632089…....

2

E

+ “+”

95364742532851158563479115837779718562708357817416015729957944589069 20239026984244276656360407297610413871592061969995217697451818900805 11886729347792333858355649505380426081467348185125971095628099782109 58956224480113528398128884692700462576308469655060770093558089132685 79338921386560873168564095306973507787453445216634333195600873200349 632089270046257630846…....

D

5 = =

Data Analy(cs Without Seeing the Data 5 |

slide-6
SLIDE 6

Solu1on (1): Private computa1on

3

E

71175935987496430338623223060201843925208459762815635262949815592595 16861516633702469933935260534155369128712003211669147527394965883186 98743040588706948658192655353713280945959536474253285115856347911583 77797185627083578174160157299579445890692023902698424427665636040729 38327792655060957281939887206011322264791188672934779233385835564950 538042608146734818512597109….......... 65535371328094595953647425328511585634791158377797185627083578174160 15729957944589069202390269842442766563604072976104138715920619699952 17697451818900805720754176976456091364980410538327792655060957281939 88720601132226479118867293477923338583556495053804260814673481851259 70093558089132685793389213865608731685640953069735077874534452166343 33195600873200349632089…....

2

E

+ “+”

95364742532851158563479115837779718562708357817416015729957944589069 20239026984244276656360407297610413871592061969995217697451818900805 11886729347792333858355649505380426081467348185125971095628099782109 58956224480113528398128884692700462576308469655060770093558089132685 79338921386560873168564095306973507787453445216634333195600873200349 632089270046257630846…....

D

5 = =

6 | Data Analy(cs Without Seeing the Data

slide-7
SLIDE 7

Solu1on (2): Distributed analy1cs

Compute

Data

Dept 2

Compute

Data

N1 Secure compute Confidentiality boundary

Data always remains confiden1al to the source ins(tu(on

Dept 1

Compute

N1

Coordinator

Messages containing encrypted data

7 | Data Analy(cs Without Seeing the Data

slide-8
SLIDE 8

Solu1on (3): Private Record Linkage

Dataset A Dataset B

Tori Mckone 7/06/1921 F Tori Mackon 6/07/1921 F Victoria Mckon 7/06/1921 F

? ?

8 | Data Analy(cs Without Seeing the Data

slide-9
SLIDE 9

Solu1on (3): Private Record Linkage

Jane Doe Paul Doe Jim Clark Kate Clark Shan Bo Reg Pal Janet Doe Bob Doe Jim Clark Kat Clark Shan Bo Joe Smith a8bf342 f72630b 14ce54 a72bef4 7830530 4bf6021 a8bf242 b3894f3 14ce54 672bef4 7830530 80ac364

Fuzzy Matching

One way hash func(ons One way hash func(ons

9 | Data Analy(cs Without Seeing the Data

slide-10
SLIDE 10

Use Cases

slide-11
SLIDE 11

Scoring

Model

Own Data Other Data

Quality

??

11 | Data Analy(cs Without Seeing the Data

slide-12
SLIDE 12

Suspicious Ac1vi1es

Need to report?

Model Builder

12 | Data Analy(cs Without Seeing the Data

slide-13
SLIDE 13

Industry using Gov Data

Model Builder Own Data Gov Data

13 | Data Analy(cs Without Seeing the Data

slide-14
SLIDE 14

Benchmarking

Own Data Model Builder

14 | Data Analy(cs Without Seeing the Data

slide-15
SLIDE 15

Device Analy1cs

Data Analytics Without Seeing the Data

Model of normal behaviour

OK OK NG OK

Private Modeling

learn deploy

OK NG OK

15 |

slide-16
SLIDE 16

Private Computa1on

slide-17
SLIDE 17

Homomorphic encryp1on

Partial Homomorphic Encryption Somewhat Homomorphic Encryption Fully Homomorphic Encryption Allows either addition or multiplication of encrypted numbers Allows evaluation of low order polynomials Allows evaluation of arbitrary functions Mor More gener e general al Faster aster

Data Analy(cs Without Seeing the Data 17 |

slide-18
SLIDE 18

Paillier Encryp1on

c = gmrn modn2

Encryption of m:

D E m1

( ).E m2 ( )modn2

( ) = m1 + m2 modn

D E m1

( )

m2 modn2

( ) = m1m2 modn

Addition of encrypted numbers: Multiplication of encrypted number by a scalar:

Data Analy(cs Without Seeing the Data 18 |

slide-19
SLIDE 19

Paillier Encryp1on

c = gmrn modn2

Encryption of m:

Addition of encrypted numbers: Multiplication of encrypted number by a scalar:

gm1 × gm2 = gm1+m2

gm1

( )

m2 = gm1m2

Data Analy(cs Without Seeing the Data 19 |

slide-20
SLIDE 20

Paillier Implementa1ons

  • Python – open source
  • www.github.com/nicta/python-paillier
  • Java – open source
  • www.github.com/nicta/javallier
  • Javascript – s(ll under closed

development

20 | Data Analy(cs Without Seeing the Data

slide-21
SLIDE 21

Distributed, Confiden1al Analy1cs

slide-22
SLIDE 22

Distributed Compu1ng with a Twist

Compute

Data

Org 2

Compute

Data

N1 Secure compute Confidentiality boundary

Data always remains confiden1al to the source organisa(on

Org 1

Compute

N1

Coordinator

Messages containing ONLY encrypted data

Data Analy(cs Without Seeing the Data 22 |

slide-23
SLIDE 23

Graph Computa1on Engine

Domains CE CE CE DF DF CE DF CE

Coordinator Worker Workers

Properties

M M

M

M M

Messages M

JSON Message

CE

AKKA actors

DF

Data frames

23 | Data Analy(cs Without Seeing the Data

slide-24
SLIDE 24

N1 Analy1cs PlaYorm

Privacy Technologies

Partial homomorphic encryption Private Record Linkage Irreversible aggregation

Distributed Graph Computation Engine Analytics

Statistics Regression Clustering

Data Auth Machine Learning

Learn Evaluate Deploy

Network

Data Analy(cs Without Seeing the Data 24 |

slide-25
SLIDE 25

Logis1c Regression

p x;θ

( ) =

1 1+e

−θ.x

L θ

( ) =

yi log p xi;θ

( )+ 1− yi ( )

i=0 n

log 1− p xi;θ

( )

( )

Logis(c func(on Log likelihood Minimise for : Evaluate:

θ Requires “secure log” and “secure inverse” protocol using Paillier encryp(on

25 | Data Analy(cs Without Seeing the Data

slide-26
SLIDE 26

Example Paillier Logis1c Regression

Org B

CE CE

Coordinator Worker

Secure Log Logistic Learner Secure Inverse

M

JSON Message

CE

AKKA actors

DF

Data frames

Gradient Descent

Private key holder Features & labels Features Org A N1Analytics

26 | Data Analy(cs Without Seeing the Data

slide-27
SLIDE 27

Performance

  • Learning
  • Learnt models have the same

accuracy as unencrypted calcula(ons

  • “Private learning” is (1000x)

slower due to encrypted computa(ons. Learning (mes are several hours.

  • Deployment
  • A score can be generated in real

(me (<50ms)

  • Customer data that contributes to

the score remains private.

  • ()
  • ()

27 | Data Analy(cs Without Seeing the Data

slide-28
SLIDE 28

Scaling

Coordinator Data Provider 1 Data Provider 2

Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker

■ ■ ■ ■

◆ ◆ ◆ ◆

100 200 300 400 Cores 5 10 50 100 500 Minutes Learning time scaling

  • 10,000x10 features

100,000x10 features

◆ 1,000,000x10 features

28 | Data Analy(cs Without Seeing the Data

slide-29
SLIDE 29

Confiden1al Record Linkage

slide-30
SLIDE 30

Record Linkage Challenge

Dataset A Dataset B

Tori Mckone 7/06/1921 F Tori Mackon 6/07/1921 F Victoria Mckon 7/06/1921 F

? ?

30 | Data Analy(cs Without Seeing the Data

slide-31
SLIDE 31

Solu1on (3): Private Record Linkage

Jane Doe Paul Doe Jim Clark Kate Clark Shan Bo Reg Pal Janet Doe Bob Doe Jim Clark Kat Clark Shan Bo Joe Smith a8bf342 f72630b 14ce54 a72bef4 7830530 4bf6021 a8bf242 b3894f3 14ce54 672bef4 7830530 80ac364

Fuzzy Matching

One way hash func(ons One way hash func(ons

31 | Data Analy(cs Without Seeing the Data

slide-32
SLIDE 32

Private Record Linkage

Fuzzy Matcher

Shared Secret Salt

Hasher Personally Iden(fiable Informa(on Anonymous Bloom filter Hasher Personally Iden(fiable Informa(on Anonymous Bloom filter

Linkage Table

N1 Company A Company B

PII cannot be recovered from the hashes

32 | Data Analy(cs Without Seeing the Data

slide-33
SLIDE 33

Probabilis1c Record Linkage

Common categorical features

(e.g post code, age range, gender)

Record linkage can be a privacy issue

33 | Data Analy(cs Without Seeing the Data

slide-34
SLIDE 34

Learning on Aggregated Data

Features Labels Rados

N instances M features per instance M features per Rado R<<N Rados

Irreversible transforma(on Can provide differen(al privacy guarantees

xi yi

π j = 1 2 yi −σ ji

( )xi

i

σ ji ∈ −1,1

{ }

N

yi ∈ −1,1

{ }

Nock, Patrini, & Friedman, ICML 2015, h@p://jmlr.org/proceedings/papers/v37/nock15.html

34 | Data Analy(cs Without Seeing the Data

slide-35
SLIDE 35

Example

50 100 150 200 250 300 0.6 0.7 0.8 0.9

Test Accuracy against number of shared feature categories

Accuracy from DP1 Accuracy from DP2 Accuracy from both

10,000 instances No label in DP2 1 shared categorical feature No en(ty resolu(on

35 | Data Analy(cs Without Seeing the Data

slide-36
SLIDE 36

Current Capabili1es of N1 plaYorm

  • Standard data analy(cs

techniques on confiden(al data:

  • Correla(on analysis
  • Classifica(on / predic(on
  • Regression
  • Clustering / outlier detec(on
  • Automated private record

linkage

  • Fine grained authorisa(on and

access control

Dept 1 Org 2 Comp3 Private record linkage Sta(s(cs Classifiers Anomaly Detec(on Private analy(cs

Federated model – No central database

Data is kept local to the source

36 | Data Analy(cs Without Seeing the Data

slide-37
SLIDE 37

www.csiro.au

Data Analy1cs Without Seeing the Data

Max O> … with input from the en1re N1 Team max.o>@data61.csiro.au