Privacy Protections as an Incentive for Collaborative Research on - - PowerPoint PPT Presentation

privacy protections as an incentive for collaborative
SMART_READER_LITE
LIVE PREVIEW

Privacy Protections as an Incentive for Collaborative Research on - - PowerPoint PPT Presentation

DIMACS 1 / 23 Privacy Protections as an Incentive for Collaborative Research on Human Health Anand D. Sarwate Department of Electrical and Computer Engineering Rutgers, the State University of New Jersey April 24, 2017 Rutgers Sarwate


slide-1
SLIDE 1

DIMACS 1 / 23

Privacy Protections as an Incentive for Collaborative Research on Human Health

Anand D. Sarwate

Department of Electrical and Computer Engineering Rutgers, the State University of New Jersey

April 24, 2017

Rutgers Sarwate

slide-2
SLIDE 2

DIMACS > Human health research 2 / 23

Human health research

There are many data sharing challenges in human health research

  • Secondary use of clinical data for research
  • Multi-site studies on QA or comparative effectiveness
  • Joint (secondary) analyses on aggregated research data

Rutgers Sarwate

slide-3
SLIDE 3

DIMACS > Human health research 3 / 23

Institutions often want to share data

Rutgers Sarwate

slide-4
SLIDE 4

DIMACS > Human health research 3 / 23

Institutions often want to share data

  • Different research groups using the same type of measurements

want to do a joint analysis.

Rutgers Sarwate

slide-5
SLIDE 5

DIMACS > Human health research 3 / 23

Institutions often want to share data

  • Different research groups using the same type of measurements

want to do a joint analysis.

  • Sharing requires lawyers at each institution to generate Data Use

Agreements.

Rutgers Sarwate

slide-6
SLIDE 6

DIMACS > Human health research 3 / 23

Institutions often want to share data

  • Different research groups using the same type of measurements

want to do a joint analysis.

  • Sharing requires lawyers at each institution to generate Data Use

Agreements.

  • Resulting months of negotiation makes even small-scale

collaboration too complicated.

Rutgers Sarwate

slide-7
SLIDE 7

DIMACS > Human health research 4 / 23

Collaborative research systems

Research consortia are common in many research areas involving human health:

Rutgers Sarwate

slide-8
SLIDE 8

DIMACS > Human health research 4 / 23

Collaborative research systems

Research consortia are common in many research areas involving human health:

  • Foster collaborative research about a particular condition

(Alzheimer’s, autism, breast cancer, etc.)

Rutgers Sarwate

slide-9
SLIDE 9

DIMACS > Human health research 4 / 23

Collaborative research systems

Research consortia are common in many research areas involving human health:

  • Foster collaborative research about a particular condition

(Alzheimer’s, autism, breast cancer, etc.)

  • Automated sharing is challenging, but this is changing.

Rutgers Sarwate

slide-10
SLIDE 10

DIMACS > Human health research 4 / 23

Collaborative research systems

Research consortia are common in many research areas involving human health:

  • Foster collaborative research about a particular condition

(Alzheimer’s, autism, breast cancer, etc.)

  • Automated sharing is challenging, but this is changing.

Goal: use privacy protections to encourage consortium growth.

Rutgers Sarwate

slide-11
SLIDE 11

DIMACS > Human health research 5 / 23

COllaborative Informatics Neuroimaging Suite

  • End-to-end system for managing data for studies on the brain
  • Current usage: 37,903 participants in 42,961 scan sessions from

612 studies for a total of 486,955 clinical assessments.

  • Data from 34 states, 38 countries
  • Partners with research consortia such as the Autism Brain

Imaging Data Exchange (ABIDE)

Rutgers Sarwate

slide-12
SLIDE 12

DIMACS > Human health research 6 / 23

Example: schizophrenia research

Private SVM

wpriv,1 wpriv,2 D1 D2 DM Rd

  • P

r i v a t e S V M A g g r e g a t

  • r

D0

˜ xi = W>xi

Private SVM

  • Private

SVM

  • ˆ

y = sgn(w>

privW>x)

wpriv RM wpriv,M

final classification rule Rutgers Sarwate

slide-13
SLIDE 13

DIMACS > Human health research 6 / 23

Example: schizophrenia research

Private SVM

wpriv,1 wpriv,2 D1 D2 DM Rd

  • P

r i v a t e S V M A g g r e g a t

  • r

D0

˜ xi = W>xi

Private SVM

  • Private

SVM

  • ˆ

y = sgn(w>

privW>x)

wpriv RM wpriv,M

final classification rule

  • Goal: build a system that can identify schizophrenia.

Rutgers Sarwate

slide-14
SLIDE 14

DIMACS > Human health research 6 / 23

Example: schizophrenia research

Private SVM

wpriv,1 wpriv,2 D1 D2 DM Rd

  • P

r i v a t e S V M A g g r e g a t

  • r

D0

˜ xi = W>xi

Private SVM

  • Private

SVM

  • ˆ

y = sgn(w>

privW>x)

wpriv RM wpriv,M

final classification rule

  • Goal: build a system that can identify schizophrenia.
  • Data: MRIs from multiple studies (healthy controls and

schizophrenics).

Rutgers Sarwate

slide-15
SLIDE 15

DIMACS > Human health research 6 / 23

Example: schizophrenia research

Private SVM

wpriv,1 wpriv,2 D1 D2 DM Rd

  • P

r i v a t e S V M A g g r e g a t

  • r

D0

˜ xi = W>xi

Private SVM

  • Private

SVM

  • ˆ

y = sgn(w>

privW>x)

wpriv RM wpriv,M

final classification rule

  • Goal: build a system that can identify schizophrenia.
  • Data: MRIs from multiple studies (healthy controls and

schizophrenics).

  • Algorithm: classification using machine learning (e.g. support

vector machine).

Rutgers Sarwate

slide-16
SLIDE 16

DIMACS > Human health research 6 / 23

Example: schizophrenia research

Private SVM

wpriv,1 wpriv,2 D1 D2 DM Rd

  • P

r i v a t e S V M A g g r e g a t

  • r

D0

˜ xi = W>xi

Private SVM

  • Private

SVM

  • ˆ

y = sgn(w>

privW>x)

wpriv RM wpriv,M

final classification rule

  • Goal: build a system that can identify schizophrenia.
  • Data: MRIs from multiple studies (healthy controls and

schizophrenics).

  • Algorithm: classification using machine learning (e.g. support

vector machine).

  • Privacy risk: each study has to allow access to sensitive subject

data.

Rutgers Sarwate

slide-17
SLIDE 17

DIMACS > Status quo ante 7 / 23

State of the art: ENIGMA

http://enigma.ini.usc.edu “The ENIGMA Network brings together researchers in imaging genomics to understand brain structure, function, and disease, based

  • n brain imaging and genetic data.”
  • MA = meta analysis : focused on
  • Goals: improve reproducibility, sample sizes
  • Validation: found genetic variations associated with

neurophysiological characteristics (e.g. hippocampal/intercranial volumes)

Rutgers Sarwate

slide-18
SLIDE 18

DIMACS > Status quo ante 8 / 23

Workflows in ENIGMA

http://enigma.ini.usc.edu ENIGMA has 30+ working groups on diseases, genomics, population variation, and methods. To do a study:

  • Study proposal is approved by ENIGMA managers.
  • Analyses performed on local sites and emailed to ENIGMA

manager as Excel spreadsheets.

  • Manager has to perform “manual” meta-analysis.

Rutgers Sarwate

slide-19
SLIDE 19

DIMACS > Status quo ante 9 / 23

Low-hanging fruit: automate this

COINSTAC works in a different way: data is registered in the system and analyses are performed/aggregated automatically through message passing.

Rutgers Sarwate

slide-20
SLIDE 20

DIMACS > Status quo ante 9 / 23

Low-hanging fruit: automate this

COINSTAC works in a different way: data is registered in the system and analyses are performed/aggregated automatically through message passing.

  • Study is proposed specifying data needed.

Rutgers Sarwate

slide-21
SLIDE 21

DIMACS > Status quo ante 9 / 23

Low-hanging fruit: automate this

COINSTAC works in a different way: data is registered in the system and analyses are performed/aggregated automatically through message passing.

  • Study is proposed specifying data needed.
  • Local sites approve access to data.

Rutgers Sarwate

slide-22
SLIDE 22

DIMACS > Status quo ante 9 / 23

Low-hanging fruit: automate this

COINSTAC works in a different way: data is registered in the system and analyses are performed/aggregated automatically through message passing.

  • Study is proposed specifying data needed.
  • Local sites approve access to data.
  • Analyses are run and aggregated automatically.

Rutgers Sarwate

slide-23
SLIDE 23

DIMACS > Status quo ante 9 / 23

Low-hanging fruit: automate this

COINSTAC works in a different way: data is registered in the system and analyses are performed/aggregated automatically through message passing.

  • Study is proposed specifying data needed.
  • Local sites approve access to data.
  • Analyses are run and aggregated automatically.

This can be significantly faster than the ENIGMA approach.

Rutgers Sarwate

slide-24
SLIDE 24

DIMACS > COINSTAC 10 / 23

The COINSTAC workflow

In COINSTAC, research groups install the software and register their data in the system:

  • Form ongoing and ad-hoc “consortia” (slow, requires approval)
  • Once established, consortium members can initiate a joint analysis
  • Computation is performed locally and messages passed between

sites

Rutgers Sarwate

slide-25
SLIDE 25

DIMACS > COINSTAC 11 / 23

What’s in the medium term

COINSTAC prototype is currently “demo-able” but not up and running.

  • Compute more than summary statistics, ridge regression, etc.
  • Improve user interface and usability for practitioners, including

visualization tools.

  • Initial subject focus for new results: addiction studies.
  • Incorporate/test differentially private methods for machine

learning.

Rutgers Sarwate

slide-26
SLIDE 26

DIMACS > COINSTAC 12 / 23

Focusing on “old” algorithms

Because the focus is on usability, we are working on methods popular in neuroimaging:

  • Feature discovery: ICA, IVA, NMF, deep learning, etc.
  • Regression and classification: ridge regression, LASSO, SVM, etc.
  • Visualization: t-SNE, network visualization, etc.

Rutgers Sarwate

slide-27
SLIDE 27

DIMACS > COINSTAC 13 / 23

COINSTAC vs. other health data systems

COINSTAC is a solution that works for typical neuroimaging research initiatives.

Rutgers Sarwate

slide-28
SLIDE 28

DIMACS > COINSTAC 13 / 23

COINSTAC vs. other health data systems

COINSTAC is a solution that works for typical neuroimaging research initiatives.

1 Data is “big” from the perspective of the domain area.

Rutgers Sarwate

slide-29
SLIDE 29

DIMACS > COINSTAC 13 / 23

COINSTAC vs. other health data systems

COINSTAC is a solution that works for typical neuroimaging research initiatives.

1 Data is “big” from the perspective of the domain area. 2 Methods with asymptotic guarantees may not be ideal.

Rutgers Sarwate

slide-30
SLIDE 30

DIMACS > COINSTAC 13 / 23

COINSTAC vs. other health data systems

COINSTAC is a solution that works for typical neuroimaging research initiatives.

1 Data is “big” from the perspective of the domain area. 2 Methods with asymptotic guarantees may not be ideal. 3 Strong formal privacy may be trumped by utility requirements.

Rutgers Sarwate

slide-31
SLIDE 31

DIMACS > Privacy challenges 14 / 23

Sad news: no privacy is enough?

D2 D3 DN D1

Aggregator

From the perspective of IRBs and other regulatory bodies, decentralized/distributed algorithms may be “good enough.”

  • Getting them to work on the computing infrastructure is itself

challenging.

  • Threat models and surface are different than “typical” data

sharing scenarios.

  • Provides a useful test case for “newer” privacy technologies:

differential privacy, multiparty computation.

Rutgers Sarwate

slide-32
SLIDE 32

DIMACS > Privacy challenges 15 / 23

Making formal privacy guarantees

Currently working on making differentially private versions of existing

  • algorithms. Differential privacy involves introducing randomization

(e.g. noise) in computations.

  • Small number of subjects → larger noise → more error.
  • Neuroimaging data is high-dimensional: need some dimension

reduction.

  • Preference for stronger (ε, 0) guarantees, but improved analyses

give (ε, δ).

Rutgers Sarwate

slide-33
SLIDE 33

DIMACS > Privacy challenges 16 / 23

Dealing with federated infrastructure

Uploading all the data to EC2 or Azure is not an acceptable (yet?)

Rutgers Sarwate

slide-34
SLIDE 34

DIMACS > Privacy challenges 16 / 23

Dealing with federated infrastructure

Uploading all the data to EC2 or Azure is not an acceptable (yet?)

  • Local storage overhead can be challenging.

Rutgers Sarwate

slide-35
SLIDE 35

DIMACS > Privacy challenges 16 / 23

Dealing with federated infrastructure

Uploading all the data to EC2 or Azure is not an acceptable (yet?)

  • Local storage overhead can be challenging.
  • Local processing costs are heterogeneous.

Rutgers Sarwate

slide-36
SLIDE 36

DIMACS > Privacy challenges 16 / 23

Dealing with federated infrastructure

Uploading all the data to EC2 or Azure is not an acceptable (yet?)

  • Local storage overhead can be challenging.
  • Local processing costs are heterogeneous.
  • Communication can act as a real bottleneck.

Rutgers Sarwate

slide-37
SLIDE 37

DIMACS > Privacy challenges 17 / 23

Compromises, compromises

At the moment we are making many compromises:

  • Utility first: practical values of ε for differential privacy are large.
  • Low communication: focus on one-shot aggregation over iterative

methods.

  • Simple tasks: stick with developing distributed methods for known

algorithms.

Rutgers Sarwate

slide-38
SLIDE 38

DIMACS > Privacy challenges 18 / 23

Policy and privacy and systems, oh my!

Data sharing in health research may be different than open sharing or industry/academia sharing.

  • Different regulations around human subjects for experimental data
  • r for PHI in clinical data.
  • Informed consent model allows subject-level and study-level

privacy preferences.

  • Data sharing is contingent and possibly transient: revert to access
  • nly.

Rutgers Sarwate

slide-39
SLIDE 39

DIMACS > Future opportunities 19 / 23

Recap

Shared-access models with “privacy protections” (formal or not) can encourage researchers to join consortia.

  • Benefits/risks align with the desires of data holders/researchers.
  • Data holders retain control over access and allowed computations.
  • Data users can use automated computations for hypothesis

generation.

Rutgers Sarwate

slide-40
SLIDE 40

DIMACS > Future opportunities 20 / 23

Some lessons learned

Rutgers Sarwate

slide-41
SLIDE 41

DIMACS > Future opportunities 20 / 23

Some lessons learned

  • start small: variability of problem types is large

Rutgers Sarwate

slide-42
SLIDE 42

DIMACS > Future opportunities 20 / 23

Some lessons learned

  • start small: variability of problem types is large
  • challenging to bridge gaps between algorithmist and developer

Rutgers Sarwate

slide-43
SLIDE 43

DIMACS > Future opportunities 20 / 23

Some lessons learned

  • start small: variability of problem types is large
  • challenging to bridge gaps between algorithmist and developer
  • communication requirements are nontrivial

Rutgers Sarwate

slide-44
SLIDE 44

DIMACS > Future opportunities 21 / 23

Another application: secondary use of clinical data

Differentially private system

count of patients, classifier, aggregate statistic differentially private answer

The iDASH center at UCSD is working on larger-scale human health research involving clinical records.

Rutgers Sarwate

slide-45
SLIDE 45

DIMACS > Future opportunities 21 / 23

Another application: secondary use of clinical data

Differentially private system

count of patients, classifier, aggregate statistic differentially private answer

The iDASH center at UCSD is working on larger-scale human health research involving clinical records.

  • Goal: to make clinical data warehouse more useful to researchers

Rutgers Sarwate

slide-46
SLIDE 46

DIMACS > Future opportunities 21 / 23

Another application: secondary use of clinical data

Differentially private system

count of patients, classifier, aggregate statistic differentially private answer

The iDASH center at UCSD is working on larger-scale human health research involving clinical records.

  • Goal: to make clinical data warehouse more useful to researchers
  • Diverse range of problems in compression, genomics, NLP, etc.

with privacy

Rutgers Sarwate

slide-47
SLIDE 47

DIMACS > Future opportunities 21 / 23

Another application: secondary use of clinical data

Differentially private system

count of patients, classifier, aggregate statistic differentially private answer

The iDASH center at UCSD is working on larger-scale human health research involving clinical records.

  • Goal: to make clinical data warehouse more useful to researchers
  • Diverse range of problems in compression, genomics, NLP, etc.

with privacy

  • Spurring research transition through data challenges, internships,

etc.

Rutgers Sarwate

slide-48
SLIDE 48

DIMACS > Future opportunities 21 / 23

Another application: secondary use of clinical data

Differentially private system

count of patients, classifier, aggregate statistic differentially private answer

The iDASH center at UCSD is working on larger-scale human health research involving clinical records.

  • Goal: to make clinical data warehouse more useful to researchers
  • Diverse range of problems in compression, genomics, NLP, etc.

with privacy

  • Spurring research transition through data challenges, internships,

etc. The features of these problems are very different!

Rutgers Sarwate

slide-49
SLIDE 49

DIMACS > Future opportunities 22 / 23

Moving the hub forward w.r.t. health

6= 6= 6=

Rutgers Sarwate

slide-50
SLIDE 50

DIMACS > Future opportunities 22 / 23

Moving the hub forward w.r.t. health

6= 6= 6=

  • Recognize that “medical data” is at best a placeholder and at

worst semantically void.

Rutgers Sarwate

slide-51
SLIDE 51

DIMACS > Future opportunities 22 / 23

Moving the hub forward w.r.t. health

6= 6= 6=

  • Recognize that “medical data” is at best a placeholder and at

worst semantically void.

  • Spend some time delineating the problem space and

domain-specific challenges.

Rutgers Sarwate

slide-52
SLIDE 52

DIMACS > Future opportunities 22 / 23

Moving the hub forward w.r.t. health

6= 6= 6=

  • Recognize that “medical data” is at best a placeholder and at

worst semantically void.

  • Spend some time delineating the problem space and

domain-specific challenges.

  • For theorists: can we get out of asymptopia?

Rutgers Sarwate

slide-53
SLIDE 53

DIMACS > Future opportunities 22 / 23

Moving the hub forward w.r.t. health

6= 6= 6=

  • Recognize that “medical data” is at best a placeholder and at

worst semantically void.

  • Spend some time delineating the problem space and

domain-specific challenges.

  • For theorists: can we get out of asymptopia?
  • For practitioners: what do you want to do versus how do you

want to do it?

Rutgers Sarwate