www.csiro.au
Data Analy1cs WITHOUT Seeing the Data
Max O> … with input from the en1re N1 Team max.o>@data61.csiro.au
Data Analy1cs WITHOUT Seeing the Data Max O> with input from - - PowerPoint PPT Presentation
Data Analy1cs WITHOUT Seeing the Data Max O> with input from the en1re N1 Team max.o>@data61.csiro.au www.csiro.au Challenge Result Learn this! Computa(on Learn NOTHING Confidential 2 | Data Analy(cs Without Seeing the Data
www.csiro.au
Max O> … with input from the en1re N1 Team max.o>@data61.csiro.au
Computa(on
Result
Confidential
Learn this! Learn NOTHING
Data Analy(cs Without Seeing the Data 2 |
How can we learn valuable insights from sensi1ve data from mul1ple organisa(ons?
Insights
Sensitive data Sensitive data
Joint Analysis
Confidential Confidential
Data Analy(cs Without Seeing the Data 3 |
Data Analy(cs Without Seeing the Data 4 |
E
71175935987496430338623223060201843925208459762815635262949815592595 16861516633702469933935260534155369128712003211669147527394965883186 98743040588706948658192655353713280945959536474253285115856347911583 77797185627083578174160157299579445890692023902698424427665636040729 38327792655060957281939887206011322264791188672934779233385835564950 538042608146734818512597109….......... 65535371328094595953647425328511585634791158377797185627083578174160 15729957944589069202390269842442766563604072976104138715920619699952 17697451818900805720754176976456091364980410538327792655060957281939 88720601132226479118867293477923338583556495053804260814673481851259 70093558089132685793389213865608731685640953069735077874534452166343 33195600873200349632089…....
E
95364742532851158563479115837779718562708357817416015729957944589069 20239026984244276656360407297610413871592061969995217697451818900805 11886729347792333858355649505380426081467348185125971095628099782109 58956224480113528398128884692700462576308469655060770093558089132685 79338921386560873168564095306973507787453445216634333195600873200349 632089270046257630846…....
D
Data Analy(cs Without Seeing the Data 5 |
E
71175935987496430338623223060201843925208459762815635262949815592595 16861516633702469933935260534155369128712003211669147527394965883186 98743040588706948658192655353713280945959536474253285115856347911583 77797185627083578174160157299579445890692023902698424427665636040729 38327792655060957281939887206011322264791188672934779233385835564950 538042608146734818512597109….......... 65535371328094595953647425328511585634791158377797185627083578174160 15729957944589069202390269842442766563604072976104138715920619699952 17697451818900805720754176976456091364980410538327792655060957281939 88720601132226479118867293477923338583556495053804260814673481851259 70093558089132685793389213865608731685640953069735077874534452166343 33195600873200349632089…....
E
95364742532851158563479115837779718562708357817416015729957944589069 20239026984244276656360407297610413871592061969995217697451818900805 11886729347792333858355649505380426081467348185125971095628099782109 58956224480113528398128884692700462576308469655060770093558089132685 79338921386560873168564095306973507787453445216634333195600873200349 632089270046257630846…....
D
6 | Data Analy(cs Without Seeing the Data
Compute
Data
Dept 2
Compute
Data
N1 Secure compute Confidentiality boundary
Data always remains confiden1al to the source ins(tu(on
Dept 1
Compute
Coordinator
Messages containing encrypted data
7 | Data Analy(cs Without Seeing the Data
Dataset A Dataset B
Tori Mckone 7/06/1921 F Tori Mackon 6/07/1921 F Victoria Mckon 7/06/1921 F
? ?
8 | Data Analy(cs Without Seeing the Data
Jane Doe Paul Doe Jim Clark Kate Clark Shan Bo Reg Pal Janet Doe Bob Doe Jim Clark Kat Clark Shan Bo Joe Smith a8bf342 f72630b 14ce54 a72bef4 7830530 4bf6021 a8bf242 b3894f3 14ce54 672bef4 7830530 80ac364
Fuzzy Matching
One way hash func(ons One way hash func(ons
9 | Data Analy(cs Without Seeing the Data
Model
Own Data Other Data
Quality
11 | Data Analy(cs Without Seeing the Data
Need to report?
Model Builder
12 | Data Analy(cs Without Seeing the Data
Model Builder Own Data Gov Data
13 | Data Analy(cs Without Seeing the Data
Own Data Model Builder
14 | Data Analy(cs Without Seeing the Data
Data Analytics Without Seeing the Data
Model of normal behaviour
OK OK NG OK
Private Modeling
learn deploy
OK NG OK
15 |
Partial Homomorphic Encryption Somewhat Homomorphic Encryption Fully Homomorphic Encryption Allows either addition or multiplication of encrypted numbers Allows evaluation of low order polynomials Allows evaluation of arbitrary functions Mor More gener e general al Faster aster
Data Analy(cs Without Seeing the Data 17 |
Encryption of m:
D E m1
m2 modn2
Addition of encrypted numbers: Multiplication of encrypted number by a scalar:
Data Analy(cs Without Seeing the Data 18 |
Encryption of m:
Addition of encrypted numbers: Multiplication of encrypted number by a scalar:
m2 = gm1m2
Data Analy(cs Without Seeing the Data 19 |
20 | Data Analy(cs Without Seeing the Data
Compute
Data
Org 2
Compute
Data
N1 Secure compute Confidentiality boundary
Data always remains confiden1al to the source organisa(on
Org 1
Compute
Coordinator
Messages containing ONLY encrypted data
Data Analy(cs Without Seeing the Data 22 |
Domains CE CE CE DF DF CE DF CE
Coordinator Worker Workers
Properties
M M
M
M M
Messages M
JSON Message
CE
AKKA actors
DF
Data frames
23 | Data Analy(cs Without Seeing the Data
Privacy Technologies
Partial homomorphic encryption Private Record Linkage Irreversible aggregation
Distributed Graph Computation Engine Analytics
Statistics Regression Clustering
Data Auth Machine Learning
Learn Evaluate Deploy
Network
Data Analy(cs Without Seeing the Data 24 |
p x;θ
( ) =
1 1+e
−θ.x
L θ
( ) =
yi log p xi;θ
( )+ 1− yi ( )
i=0 n
log 1− p xi;θ
( )
( )
Logis(c func(on Log likelihood Minimise for : Evaluate:
θ Requires “secure log” and “secure inverse” protocol using Paillier encryp(on
25 | Data Analy(cs Without Seeing the Data
Org B
CE CE
Coordinator Worker
Secure Log Logistic Learner Secure Inverse
M
JSON Message
CE
AKKA actors
DF
Data frames
Gradient Descent
Private key holder Features & labels Features Org A N1Analytics
26 | Data Analy(cs Without Seeing the Data
accuracy as unencrypted calcula(ons
slower due to encrypted computa(ons. Learning (mes are several hours.
(me (<50ms)
the score remains private.
27 | Data Analy(cs Without Seeing the Data
Coordinator Data Provider 1 Data Provider 2
Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker■ ■ ■ ■
◆ ◆ ◆ ◆
100 200 300 400 Cores 5 10 50 100 500 Minutes Learning time scaling
■
100,000x10 features
◆ 1,000,000x10 features
28 | Data Analy(cs Without Seeing the Data
Dataset A Dataset B
Tori Mckone 7/06/1921 F Tori Mackon 6/07/1921 F Victoria Mckon 7/06/1921 F
? ?
30 | Data Analy(cs Without Seeing the Data
Jane Doe Paul Doe Jim Clark Kate Clark Shan Bo Reg Pal Janet Doe Bob Doe Jim Clark Kat Clark Shan Bo Joe Smith a8bf342 f72630b 14ce54 a72bef4 7830530 4bf6021 a8bf242 b3894f3 14ce54 672bef4 7830530 80ac364
Fuzzy Matching
One way hash func(ons One way hash func(ons
31 | Data Analy(cs Without Seeing the Data
Fuzzy Matcher
Shared Secret Salt
Hasher Personally Iden(fiable Informa(on Anonymous Bloom filter Hasher Personally Iden(fiable Informa(on Anonymous Bloom filter
Linkage Table
N1 Company A Company B
PII cannot be recovered from the hashes
32 | Data Analy(cs Without Seeing the Data
Common categorical features
(e.g post code, age range, gender)
Record linkage can be a privacy issue
33 | Data Analy(cs Without Seeing the Data
Features Labels Rados
N instances M features per instance M features per Rado R<<N Rados
Irreversible transforma(on Can provide differen(al privacy guarantees
π j = 1 2 yi −σ ji
( )xi
i
σ ji ∈ −1,1
{ }
N
yi ∈ −1,1
{ }
Nock, Patrini, & Friedman, ICML 2015, h@p://jmlr.org/proceedings/papers/v37/nock15.html
34 | Data Analy(cs Without Seeing the Data
50 100 150 200 250 300 0.6 0.7 0.8 0.9
Test Accuracy against number of shared feature categories
Accuracy from DP1 Accuracy from DP2 Accuracy from both
10,000 instances No label in DP2 1 shared categorical feature No en(ty resolu(on
35 | Data Analy(cs Without Seeing the Data
techniques on confiden(al data:
linkage
access control
Dept 1 Org 2 Comp3 Private record linkage Sta(s(cs Classifiers Anomaly Detec(on Private analy(cs
Federated model – No central database
Data is kept local to the source
36 | Data Analy(cs Without Seeing the Data
www.csiro.au
Max O> … with input from the en1re N1 Team max.o>@data61.csiro.au