Center for Causal Discovery (CCD) of Biomedical Knowledge from Big - - PowerPoint PPT Presentation

center for causal discovery ccd
SMART_READER_LITE
LIVE PREVIEW

Center for Causal Discovery (CCD) of Biomedical Knowledge from Big - - PowerPoint PPT Presentation

Center for Causal Discovery (CCD) of Biomedical Knowledge from Big Data University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center Yale University PIs: Ivet Bahar, Jeremy Berg, Greg Cooper Outline The U.S. NIH


slide-1
SLIDE 1

Center for Causal Discovery (CCD)

  • f Biomedical Knowledge from Big Data

University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center Yale University PIs: Ivet Bahar, Jeremy Berg, Greg Cooper

slide-2
SLIDE 2

Outline

  • The U.S. NIH big data to knowledge (BD2K) initiative
  • Why focus on the discovery of causal knowledge from big

biomedical data?

  • Why establish a Center for Causal Discovery (CCD)?
  • What are some basic methods being used by CCD?
  • What are the goals of the CCD?
slide-3
SLIDE 3

NIH Big Data to Knowledge (BD2K) Initiative

For more information, see: https://datascience.nih.gov/bd2k/

The ability to harvest the wealth of information contained in biomedical Big Data will advance our understanding of human health and disease; however, lack of appropriate tools, poor data accessibility, and insufficient training, are major impediments to rapid translational impact. To meet this challenge, the U.S. National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative in 2012. BD2K is a trans-NIH initiative with the following major aims:

  • Facilitate broad use of biomedical digital assets
  • Conduct research and develop the methods, software, and tools

needed to analyze biomedical Big Data

  • Enhance training in the development and use of methods and

tools necessary for biomedical Big Data science.

  • Support a data ecosystem that accelerates biomedical knowledge

discovery

slide-4
SLIDE 4

NIH BD2K Centers of Excellence

  • The Centers of Excellence are part of the overall NIH BD2K

initiative.

  • The goal is to develop and disseminate computational

methods to assist biomedical researchers in using big data to significantly advance biomedical science.

  • Project components include research, software development

and dissemination, training, and joint Center activities.

  • As of September 2014, NIH began funding 11 BD2K Centers of

Excellence.

  • Funding is for 4 years.
  • More information is available at:

https://datascience.nih.gov/bd2k/funded-programs/centers

slide-5
SLIDE 5

Causal Discovery in Biomedicine

Science is centrally concerned with the discovery of causal relationships in nature.

  • Understanding
  • Prediction
  • Control

Examples:

  • Determine the genes and cell signaling pathways that

cause breast cancer

  • Discover the clinical effects of a new drug
  • Uncover the mechanisms of pathogenicity of a recently

mutated virus that is spreading rapidly in the population

slide-6
SLIDE 6

Why Establish a Center for Causal Discovery Now?

  • Algorithmic Advances

+

  • Availability of Big Biomedical Data
slide-7
SLIDE 7

Algorithmic Advances

  • In the past 25 years, there has been

tremendous progress in the development of computational methods for representing and discovering causal networks from a combination of data and knowledge.

  • These methods are often applicable to

biomedical data.

slide-8
SLIDE 8

Availability of Big Biomedical Data

  • The variety, richness, and quantity of biomedical data have

been increasing very rapidly.

  • High-throughput molecular data (e.g., whole-genome sequencing)
  • Clinical EMR data
  • Population health data from social media and mobile sensors
  • The appropriate analysis of these data has great potential

to advance biomedical science.

http://aldousvoice.files.wordpress.com/2014/06/database.jpg

slide-9
SLIDE 9

The Time Seems Right to Disseminate These Algorithms to Scientists to Use in Analyzing Biomedical Data for Causal Relationships

Big Biomedical Data Causal Discovery Algorithms Causal Networks

slide-10
SLIDE 10

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Data

slide-11
SLIDE 11

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Data

Both

  • bservational

and experimental data

slide-12
SLIDE 12

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Data

slide-13
SLIDE 13

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Causal Hypotheses

Data

slide-14
SLIDE 14

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Causal Hypotheses

Experiments Data

slide-15
SLIDE 15

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Causal Hypotheses

Experiments Data

slide-16
SLIDE 16

Basic Causal Discovery Workflow

Causal Networks Prior Knowledge

Causal Analysis

Causal Hypotheses

Experiments Data

slide-17
SLIDE 17

An Example of Causal Network Discovery from Biomedical Data

Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529.

slide-18
SLIDE 18

A Portion of a Cell Signaling Network

(and Points of Experimental Intervention)

Sachs K, et al. Science 308 (2005) 523-529. (The figure above appears in this paper.)

slide-19
SLIDE 19

Overview of Experimental Design and Data Analysis

Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529. (The figure above appears in this paper.)

slide-20
SLIDE 20

Results of Causal Network Analysis for the Example

Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529. (The figure above appears in this paper.)

slide-21
SLIDE 21

Basic Components Needed to Learn Causal Networks from Data

  • Model representation
  • Model search
  • Model evaluation
slide-22
SLIDE 22

Model Representation: Causal Bayesian Networks (CBNs)

  • Nodes represent variables
  • Arcs represent direct causation
  • A directed acyclic graph
  • A variable is modeled as independent of its non-effects,

given its causal parents Example:

A B C

slide-23
SLIDE 23

Model Representation: Causal Bayesian Networks (CBNs)

  • Nodes represent variables
  • Arcs represent direct causation
  • A directed acyclic graph
  • A variable is modeled as independent of its non-effects,

given its causal parents Example:

A B C CBN structure

}

slide-24
SLIDE 24

Model Representation: Causal Bayesian Networks (CBNs)

  • Nodes represent variables
  • Arcs represent direct causation
  • A directed acyclic graph
  • A variable is modeled as independent of its non-effects,

given its causal parents Example:

  • There is a factorization of the joint probability distribution

Example: P(A, B, C) = P(A) P(B | A) P(C| B)

A B C CBN structure

}

CBN parameters

}

slide-25
SLIDE 25

Model Search

  • The space of CBNs is very large
  • Heuristic search is generally applied in seeking to

find the most likely CBNs

  • We search for the most likely CBN structures
  • Once a highly likely CBN structure is found, we can

parameterize it using the data

  • We can also model average over highly probable

substructures (e.g., a causal arc from X to Y)

slide-26
SLIDE 26

Model Search

A B C A B C A B C A B C A B C A B C A B C A B C

slide-27
SLIDE 27

Model Evaluation: Two Primary Approaches

  • 1. Constraint based
  • 2. Bayesian
slide-28
SLIDE 28

Model Evaluation: Two Primary Approaches

  • 1. Constraint based
  • 2. Bayesian
slide-29
SLIDE 29

Model Evaluation The Constraint-Based Approach

  • 1. Determine constraints that hold among the

nodes (e.g., independence conditions based on statistical tests)

  • 2. Use the patterns of constraints to narrow the

causal possibilities

slide-30
SLIDE 30

Constraint-Based Evaluation: An Example

Suppose in searching over CBNs we apply statistical tests to the observational data* on A, B, and C and obtain the following constraints:

  • A dep B
  • B dep C
  • A dep C

Which of the following models is consistent with the above constraints?

A B C A B C

* More generally, a combination of observational data, experimental data, and background knowledge can be provided as input.

slide-31
SLIDE 31

Constraint-Based Evaluation: An Example

Suppose in searching over CBNs we apply statistical tests to the observational data on A, B, and C and obtain the following constraints:

  • A dep B
  • B dep C
  • A dep C

Which of the following models is consistent with those constraints?

A B C

slide-32
SLIDE 32

Several Key Causal Relationships

slide-33
SLIDE 33

Some Key Characteristics

  • f Causal Discovery Problems
slide-34
SLIDE 34

Types of Big Data Problems Include …

  • Volume of data
  • Number of samples
  • Number of variables per sample
  • Variety of data – the different types of data
  • Velocity of data – how fast the data are being generated
  • Veracity of data – the uncertainty in the data (e.g., noise, biases)
slide-35
SLIDE 35

What is the Big Data Problem on which the CCD is Primarily Focused?

slide-36
SLIDE 36

Causal Network Discovery Methods Have Been Applied Successfully to Small Biomedical Datasets

Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529. (The figure above appears in this paper.)

slide-37
SLIDE 37

Carro MS, et al. The transcriptional network for mesenchymal transformation of brain tumours. Nature 463 (2010) 318-325. . (The figure above appears in this paper.)

The Methods Have Also Been Successfully Applied to Medium Sized Biomedical Datasets

slide-38
SLIDE 38

Most Algorithms Are Not Able to Handle Big Data Containing Many Thousands of Variables

Yang X, et al. Validation of candidate causal genes for obesity that affect shared metabolic pathways and networks. Nature Genetics 41 (2009) 415-423.

slide-39
SLIDE 39

The Number of Causal Models as a Function of the Number of Measured Variables*

Number of nodes Number of Causal Models 1 1 2 3

* Assumes there are no latent variables and no directed cycles.

slide-40
SLIDE 40

The Number of Causal Models as a Function of the Number of Measured Variables*

Number of nodes Number of Causal Models 1 1 2 3 3 25 4 543

* Assumes there are no latent variables and no directed cycles.

slide-41
SLIDE 41

The Number of Causal Models as a Function of the Number of Measured Variables*

Number of nodes Number of Causal Models 1 1 2 3 3 25 4 543 5 29,281 6 3,781,503 7 1.1 x 109 8 7.8 x 1011 9 1.2 x 1015 10 4.2 x 1018

* Assumes there are no latent variables and no directed cycles.

slide-42
SLIDE 42

Our Big Data Problem:

Analyze biomedical datasets containing a large number of variables in order to generate plausible hypotheses of the causal relationships that hold among those variables

slide-43
SLIDE 43

Big Data Problems Being Pursued in CCD

  • Volume of data – the scale of the data
  • Number of samples
  • Number of variables per sample
  • Variety of data – the different types of data
  • Velocity of data – how fast the data are being generated
  • Veracity of data – the uncertainty in the data (e.g., noise, biases)
slide-44
SLIDE 44

Big Data Problems Being Pursued in CCD

  • Volume of data – the scale of the data
  • Number of samples
  • Number of variables per sample
  • Variety of data – the different types of data
  • Velocity of data – how fast the data are being generated
  • Veracity of data – the uncertainty in the data (e.g., noise, biases)
slide-45
SLIDE 45

Some Recent Progress in Developing Highly Efficient Causal Discovery Algorithms

Recently we have optimized a popular causal discovery algorithm (GES) to be much more efficient than before (FastGES)

  • Approach

– Optimize the single processor version – Parallelize the algorithm

slide-46
SLIDE 46

Preliminary Evaluation of FastGES

  • Evaluation method

– Simulated data from a linear Gaussian model – Number node nodes (N) = number of edges – Number of samples = 1000

  • Results

– N = 50,000  13 minutes on a laptop with 8 cores & 16 GB RAM – N = 1,000,000  18 hours with 40 cores & 384 GB

  • For more information: http://arxiv.org/ftp/arxiv/papers/1507/1507.07749.pdf

N Adjacency TPR Adjacency TNR Orientation TPR Orientation TNR 50,000 99.3% 97.5% 98.2% 96.1% 1,000,000 99.9% 93.5% 99.9% 90.4%

slide-47
SLIDE 47

Primary Goals of the CCD

  • Goal 1. Develop and implement state-of-the-art methods for

causal modeling and discovery (CMD) of knowledge from biomedical big data

– Make the best existing CMD methods available – Develop new CMD methods

slide-48
SLIDE 48

Primary Goals of the CCD

  • Goal 2. Investigate three biomedical projects

– Evaluate the usefulness of CMD methods on these problems – Drive further the development of the CMD methods

slide-49
SLIDE 49

Primary Goals of the CCD

  • Goal 3. Disseminate CMD methods and knowledge widely to

biomedical researchers and data scientists

– Software

  • Algorithms

– Implement a suite of causal discovery algorithms and make them available as application programming interfaces (APIs) – Open source and free

  • Desktop application: Develop an easy-to-use causal modeling and

discovery (CMD) system with a desktop interface, which is open source and free – Training – Collaborative activities with other BD2K Centers

slide-50
SLIDE 50

Driving Biomedical Projects (DBPs)

  • Discovery of cell signaling networks in

cancer

  • Discovery of the mechanisms of disease
  • nset and progression in chronic
  • bstructive pulmonary disease and

idiopathic pulmonary fibrosis

  • Discovery of the functional (causal)

connectivity of regions of the human brain from fMRI data

slide-51
SLIDE 51

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)
slide-52
SLIDE 52

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)

Available Cancer Types # Cases with Data

Acute Myeloid Leukemia [LAML] 200 Adrenocortical carcinoma [ACC] 80 Bladder Urothelial Carcinoma [BLCA] 412 Brain Lower Grade Glioma [LGG] 516 Breast invasive carcinoma [BRCA] 1098

slide-53
SLIDE 53

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)

Available Cancer Types # Cases with Data

Acute Myeloid Leukemia [LAML] 200 Adrenocortical carcinoma [ACC] 80 Bladder Urothelial Carcinoma [BLCA] 412 Brain Lower Grade Glioma [LGG] 516 Breast invasive carcinoma [BRCA] 1098

slide-54
SLIDE 54

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)
slide-55
SLIDE 55

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)
slide-56
SLIDE 56

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)
  • Methods: Search for somatic alterations (A) that the

data support as causing changes in the cellular behavior of tumors (G)

slide-57
SLIDE 57

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)
  • Methods: Search for somatic alterations (A) that the

data support as causing changes in the cellular behavior of tumors (G)

slide-58
SLIDE 58

Cancer DBP: Goal 1

  • Develop methods to identify driver (disease causing)

somatic genomic alterations (SGAs) of tumors

  • Big Data: The Cancer Genome Atlas (TCGA)
  • Methods: Search for somatic alterations (A) that the

data support as causing changes in the cellular behavior of tumors (G)

  • General findings:
  • Found many known drivers of cancer
  • Also found some mutations not known to be drivers of

cancer that we plan to test experimentally

slide-59
SLIDE 59

apply population-wide learning method inference

training set patient case prediction

Population-Wide Modeling Approach

population-wide model

slide-60
SLIDE 60

apply a personalized learning method inference

training set patient case prediction

Personalized Modeling Approach

personalized model

slide-61
SLIDE 61

Summary

  • The NIH BD2K initiative is focused on developing ways

to enhance the translation of increasing amounts of digital data into biomedical knowledge.

  • Causal relationships are a central type of biomedical

knowledge.

  • The Center for Causal Discovery (CCD) is focused on

developing and making readily available algorithms and systems for generating plausible causal hypotheses from big biomedical data.

  • The CCD is exploring three example biomedical

problems and is investigating methods for personalized causal modeling of health and disease.

slide-62
SLIDE 62

Additional Information

Spirtes P, Glymour C, Scheines R, Tillman R. Automated search for causal relations: Theory and practice. In Heuristics, Probability, and Causality: A Tribute to Judea Pearl, edited by Rina Dechter, Hector Geffner, and Joseph Halpern (College Publications, 2010, Chapter 28, pages 467-506). http://repository.cmu.edu/cgi/viewcontent.cgi?article=1423&context=philosophy Kalisch M, Buhlmann P. Causal structure learning and inference: A selective review. Quality Technology and Quantitative Management, 11 (2014) 3-21. http://web.it.nctu.edu.tw/~qtqm/qtqmpapers/2014V11N1/2014V11N1_F1.pdf Cooper GF, Bahar I, Becich MJ, Benos PV, Berg J, Espino JU, Jacobson RC, Kienholz M, Lee AV, Lu X, Scheines R, Center for Causal Discovery team. The Center for Causal Discovery of biomedical knowledge from Big Data. Journal of the American Medical Informatics Association 2015. PMID: 26138794

slide-63
SLIDE 63

Acknowledgements

  • Thanks to the 40+ members of the Center for Causal

Discovery for their contributions to the Center activities that are described here.

  • The Center for Causal Discovery is supported by grant

U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans- NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The content of this presentation is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

slide-64
SLIDE 64

Thank you

gfc@pitt.edu www.ccd.pitt.edu