Extreme Spatio Temporal Data Analysis in Biomedical Informatics - - PowerPoint PPT Presentation

extreme spatio temporal data analysis in biomedical
SMART_READER_LITE
LIVE PREVIEW

Extreme Spatio Temporal Data Analysis in Biomedical Informatics - - PowerPoint PPT Presentation

Extreme Spatio Temporal Data Analysis in Biomedical Informatics Joel Saltz MD, PhD Director Center for Comprehensive Informatics Center for Com prehensive I nform atics Contributions Computer Science: Methods and middleware for


slide-1
SLIDE 1

Extreme Spatio Temporal Data Analysis in Biomedical Informatics

  • Joel Saltz MD, PhD
  • Director Center for Comprehensive

Informatics

slide-2
SLIDE 2

Center for Com prehensive I nform atics

Contributions

  • Computer Science: Methods and middleware

for analysis, classification of very large datasets from low dimensional spatio- temporal sensors; methods to carry out comparisons and change detection between sensor datasets

  • Biomedical: Mine whole slide image datasets

to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets

slide-3
SLIDE 3

Center for Com prehensive I nform atics

Outline of Talk

  • Pathology: Analysis of Digitized Tissue for Research

and Practice

  • Feature Clustering: Morphologic Tumor Subtypes in

GBM Brain Tumors and Relationship to “omic” classifications

  • Whole Slide Image Analysis in Clinical Practice:

Neuroblastoma

  • Tissue Flow: Multiplex Quantum Dot
  • HPC/ BIGDATA Feature Pipeline
  • Pathology data analytic tools and techniques
slide-4
SLIDE 4

Center for Com prehensive I nform atics

Whole Slide Imaging: Scale

slide-5
SLIDE 5

Center for Com prehensive I nform atics

Pathology Computer Assisted Diagnosis

Shimada, Gurcan, Kong, Saltz

slide-6
SLIDE 6

Computerized Classification System for Grading Neuroblastoma

  • Background Identification
  • Image Decomposition (Multi-

resolution levels)

  • Image Segmentation

(EMLDA)

  • Feature Construction (2nd
  • rder statistics, Tonal

Features)

  • Feature Extraction (LDA) +

Classification (Bayesian)

  • Multi-resolution Layer

Controller (Confidence Region)

No Yes

Image Tile Initialization I = L Background? Label Create Image I(L) Segmentation Feature Construction Feature Extraction Classification Segmentation Feature Construction Feature Extraction Classifier Training Down-sampling Training Tiles Within Confidence Region ? I = I -1 I > 1?

Yes Yes No No

TRAINING TESTING

slide-7
SLIDE 7

Center for Com prehensive I nform atics

slide-8
SLIDE 8

Center for Com prehensive I nform atics

Direct Study of Relationship Between vs

slide-9
SLIDE 9

In Silico Brain Tumor Center

Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)

slide-10
SLIDE 10

Center for Com prehensive I nform atics

Morphological Tissue Classification

Nuclei Segmentation Cellular Features

Lee Cooper, Jun Kong

Whole Slide Imaging

slide-11
SLIDE 11

Center for Comprehensive Informatics Consensus clustering of m orphological signatures

Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering

Nuclear Features Used to Classify GBMs

3 2 1

20 40 60 80 100 120 140 160

2 3 4 5 6 7 25 30 35 40 45 50 # Clusters Silhouette Area 0.5 1 1 2 3 Silhouette Value Cluster

slide-12
SLIDE 12

Center for Comprehensive Informatics

Clustering identifies three morphological groups

  • Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
  • Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB)

  • Prognostically-significant (logrank p= 4.5e-4)

Feature Indices

CC CM PB

10 20 30 40 50 500 1000 1500 2000 2500 3000 0.2 0.4 0.6 0.8 1 Days Survival

CC CM PB

slide-13
SLIDE 13

Center for Comprehensive Informatics

  • Cox proportional hazards

– Gene expression class not significant p= 0.58 – Morphology clustering p= 5.0e-3

Gene Expression Class Associations

CC CM PB 20 40 60 80 100 Cluster Subtype Percentage (%) Classical Mesenchymal Neural Proneural

slide-14
SLIDE 14

Center for Comprehensive Informatics

Clustering Validation

  • Separate set of 84 GBMs from Henry Ford Hospital
  • ClusterRepro: CC p= 7.2e-3, CM p= 1.3e-2

Feature Indices

CC Mixed CM

10 20 30 40 50

20 40 60 80 100 0.2 0.4 0.6 0.8 1 Months Survival

CC Mixed CM

slide-15
SLIDE 15

Center for Comprehensive Informatics

Associations

slide-16
SLIDE 16

Novel Pathology Modalities

Imaging

Excellent Spatial Resolution Limited Molecular Resolution

Genomics

Excellent Molecular Resolution Limited Spatial Resolution

1000’s of genes

slide-17
SLIDE 17

Quantum Dots

Professor Robin Bostick

slide-18
SLIDE 18

Imaging Pipeline – Feature Extraction

slide-19
SLIDE 19

Example Application: Cancer Stem Cell Niche

  • Cancer stem cells

– Rare(?), proliferative cells, regenerative – Do they prefer to live near blood vessels, or necrosis?

slide-20
SLIDE 20

Center for Comprehensive Informatics

  • Leverage exascale data and

computer resources to squeeze the most out of image, sensor or simulation data

  • Run lots of different

algorithms to derive sam e features

  • Run lots of algorithms to

derive com plem entary features

  • Data models and data

management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms

Extreme Spatio-Temporal Sensor Data Analytics

slide-21
SLIDE 21

Center for Comprehensive Informatics

Application Targets

  • Multi-dimensional spatial-temporal datasets

– Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations

  • Can we analyze 100,000+ microscopy images per

hour?

  • Correlative and cooperative analysis of data from

multiple sensor modalities and sources

  • What-if scenarios and multiple design choices or

initial conditions

slide-22
SLIDE 22

Center for Comprehensive Informatics

Biomass Monitoring (joint with ORNL)

  • Investigate changes in vegetation and land use
  • Hierarchical, multi-resolution coarse/fine-grained analytics into a

unified framework

  • Changes identified using high temporal/low spatial resolution MODIS

data

  • Segmentation and classification methods used to characterize

changes using higher resolution data (e.g. multitemporal AWiFS data)

  • Segmentation and classification to identify man-made structures.
slide-23
SLIDE 23

Center for Comprehensive Informatics

slide-24
SLIDE 24

Core Transformations

  • Data Cleaning and Low Level Transformations
  • Data Subsetting, Filtering, Subsampling
  • Spatio-temporal Mapping and Registration
  • Object Segmentation
  • Feature Extraction, Object Classification
  • Spatio-temporal Aggregation
  • Change Detection, Comparison, and Quantification
slide-25
SLIDE 25

Extreme DataCutter

DataCutter

Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System S

Extreme DataCutter

Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm

slide-26
SLIDE 26

Center for Comprehensive Informatics

Extreme DataCutter – Two Level Model

slide-27
SLIDE 27

Center for Comprehensive Informatics

Node Level Work Scheduling

  • Features of Node Level Architectures

– Nodes contain CPUs, GPUs – Each CPU contains multiple cores – GPU has complex internal architecture – Data locality within node – Data paths between CPUs and GPUs

Keeneland Node

slide-28
SLIDE 28

Center for Comprehensive Informatics

Node Level Work Scheduling

  • Attempt to minimize data movement
  • Identify and assign operations that perform

well on GPU

  • Balance load between CPUs and GPUs
  • Prefetch data
  • Identify and use high bandwidth CPU/ GPU

data paths

  • Schedule exclusive GPU access for

components (e.g. morphological reconstruction) requiring fine grained parallelism

slide-29
SLIDE 29

Center for Comprehensive Informatics

Node Level Work Scheduling

slide-30
SLIDE 30

Center for Comprehensive Informatics

Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)

slide-31
SLIDE 31

Center for Comprehensive Informatics

Control Structures for Handling Fine Grained/ Runtime Dependent Parallelism in GPUs

Morphological Reconstruction:

8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs

slide-32
SLIDE 32

Center for Comprehensive Informatics

Large Scale Data Management

  • Implemented

with IBM DB2 for large scale pathology image metadata (~ million markups per slide)

  • Represented by a complex data model capturing

multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.

  • Support

for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships

  • Highly optimized spatial query and analyses
slide-33
SLIDE 33

Spatial Centric – Pathology Imaging “GIS”

Point query: human marked point inside a nucleus

.

Window query: return markups contained in a rectangle Spatial join query: algorithm validation/comparison Containment query: nuclear feature aggregation in tumor regions

slide-34
SLIDE 34

PAIS (Pathology Analytical Imaging Standards)

Supported by caBIG, R01 and ACTSI

  • PAIS Logical Model
  • 62 UML classes
  • markups, annotations,

imageReferences, provenance

  • PAIS Data Representation
  • XML (compressed) or HDF5
  • PAIS Databases
  • loading, managing and

querying and sharing data

  • Native XML DBMS or

RDBMS + SDBMS

PAI S

slide-35
SLIDE 35

Center for Comprehensive Informatics

Example Query for Integrative Studies

  • Find mean nuclear feature vector and covariance on

tumor regions for each patient grouped by tumor subtype

PAI S: Exam ple Queries

SELECT c.pais_uid, pc.subtype, AVG(area), AVG(perimeter), AVG(eccentricity), COVARIANCE(area, perimeter), COVARIANCE(area, eccentricity) FROM pais.calculation_flat c,TCGA.PATIENT_CHARACTERISTIC pc, pais.patient p WHERE p.patientid = pc.patient_id AND p.pais_uid = c.pais_uid GROUP BY c.pais_uid, pc.subtype;

2 1 3 4

50 100 150 20 40 60 80 100 120 140 160

Feature Indices 10 20 30 40 50 60 70 80 90 100 110 500 1000 1500 2000 2500 3000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Days Survival Cluster 1 Cluster 2 Cluster 3 Cluster 4
slide-36
SLIDE 36

Algorithm Validation: Intersection between Two Result Sets (Spatial Join)

PAI S: Exam ple Queries

. .

slide-37
SLIDE 37

Center for Comprehensive Informatics

VLDB 2012

Change Detection, Comparison, and Quantification

slide-38
SLIDE 38

Center for Com prehensive I nform atics

Summary and Perspective

  • Large scale integrative data analytic methods and

tools to integrate clinical, molecular, Pathology, Radiology data

  • Characterize new cancer subtypes and biomarkers,

predict outcome, treatment response

  • Algorithms to quantify Pathology classification
  • HPC/ BIGDATA analysis pipelines
slide-39
SLIDE 39

Center for Com prehensive I nform atics

Importance:

  • Computer Science: general approaches to analysis

and classification of very large datasets from low dimensional spatio-temporal sensors

  • Biomedical: generate basic insights into

pathophysiology, clues to new treatments, better ways of evaluating existing treatments and core infrastructure needed for comparative effectiveness research studies

slide-40
SLIDE 40

Thanks to:

  • In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David

Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)

  • caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin

Kurc, Himanshu Rathod Emory leads

  • caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon,

Daniel Rubin, Fred Prior, Larry Tarbox and many others

  • In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
  • Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul

Pantalone

  • Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony

Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)

  • NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima

Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe

  • ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc,

Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado- Ramos

  • NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman,

Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi

slide-41
SLIDE 41

Thanks!