Analytics Joel Saltz MD, PhD Director Center for Comprehensive - - PowerPoint PPT Presentation
Analytics Joel Saltz MD, PhD Director Center for Comprehensive - - PowerPoint PPT Presentation
Tools, Techniques and Methods for Integrative Data Analytics Joel Saltz MD, PhD Director Center for Comprehensive Informatics Center for Comprehensive Informatics Contributions Computer Science: Methods and middleware for analysis,
Center for Comprehensive Informatics
Contributions
- Computer Science: Methods and middleware for
analysis, classification of very large datasets from low dimensional spatio-temporal sensors; methods to carry out comparisons and change detection between sensor datasets
- Biomedical: Mine whole slide image datasets to
better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets
- CFD: Quantitative characterization of spatio-
temporal features generated by large scale simulations, comparisons with experimental results, uncertainty quantification
Center for Comprehensive Informatics
- Leverage exascale data and
computer resources to squeeze the most out of image, sensor or simulation data
- Run lots of different
algorithms to derive same features
- Run lots of algorithms to
derive complementary features
- Data models and data
management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
Extreme Spatio-Temporal Data Analytics
Center for Comprehensive Informatics
Application Targets
- Multi-dimensional spatial-temporal datasets
– Microscopy image analyses – Biomass monitoring using satellite imagery – Weather prediction using satellite and ground sensor data – Large scale simulations
- Can we analyze 100,000+ microscopy images per
hour?
- Correlative and cooperative analysis of data from
multiple sensor modalities and sources
- What-if scenarios and multiple design choices or
initial conditions
Center for Comprehensive Informatics
Core Transformations
- Data Cleaning and Low Level Transformations
- Data Subsetting, Filtering, Subsampling
- Spatio-temporal Mapping and Registration
- Object Segmentation
- Feature Extraction, Object Classification
- Spatio-temporal Aggregation
- Change Detection, Comparison, and Quantification
Digital Pathology Analytics
Anaplastic Astrocytoma (WHO grade III) Glioblastoma (WHO grade IV)
Center for Comprehensive Informatics
Morphological Tissue Classification
Nuclei Segmentation Cellular Features
Lee Cooper, Jun Kong
Whole Slide Imaging
Center for Comprehensive Informatics
Whole Slide Imaging: Scale
Center for Comprehensive Informatics
Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results
Center for Comprehensive Informatics
Pathology Computer Assisted Diagnosis
Shimada, Gurcan, Kong, Saltz
Computerized Classification System for Grading Neuroblastoma
- Background Identification
- Image Decomposition (Multi-
resolution levels)
- Image Segmentation
(EMLDA)
- Feature Construction (2nd
- rder statistics, Tonal
Features)
- Feature Extraction (LDA) +
Classification (Bayesian)
- Multi-resolution Layer
Controller (Confidence Region)
No Yes
Image Tile Initialization I = L Background? Label Create Image I(L) Segmentation Feature Construction Feature Extraction Classification Segmentation Feature Construction Feature Extraction Classifier Training Down-sampling Training Tiles Within Confidence Region ? I = I -1 I > 1?
Yes Yes No No
TRAINING TESTING
Center for Comprehensive Informatics
Direct Study of Relationship Between vs
Center for Comprehensive Informatics Consensus clustering of morphological signatures
Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients Each possibility evaluated using 2000 iterations of K- means to quantify co-clustering
Nuclear Features Used to Classify GBMs
3 2 1
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
2 3 4 5 6 7 25 30 35 40 45 50 # Clusters Silhouette Area 0.5 1 1 2 3 Silhouette Value Cluster
Center for Comprehensive Informatics
Clustering identifies three morphological groups
- Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)
- Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM), Protein Biosynthesis (PB)
- Prognostically-significant (logrank p=4.5e-4)
Feature Indices
CC CM PB
10 20 30 40 50 500 1000 1500 2000 2500 3000 0.2 0.4 0.6 0.8 1 Days Survival
CC CM PB
Novel Pathology Modalities
Imaging
Excellent Spatial Resolution Limited Molecular Resolution
Genomics
Excellent Molecular Resolution Limited Spatial Resolution
1000’s of genes
Center for Comprehensive Informatics
Extreme DataCutter Prototype
DataCutter
Pipeline of filters connected though logical streams In transit processing Flow control between filters and streams Developed 1990s-2000s; led to IBM System S
Extreme DataCutter
Two level hierarchical pipeline framework In transit processing Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes Fine grained pipeline operations managed at the node level Both levels employ filter/stream paradigm Bottom line – everything ends up as DAGS
Center for Comprehensive Informatics
Extreme DataCutter – Two Level Model
Center for Comprehensive Informatics
Node Level Work Scheduling
Center for Comprehensive Informatics
Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)
Center for Comprehensive Informatics
Structured/Unstructured Grid Calculations with Unpredictable Runtime Dependencies
Key Kernel in Distance Transform, Morphological Reconstruction, Delaney Triagulation
Center for Comprehensive Informatics
Control Structures for Handling Fine Grained/Runtime Dependent Parallelism in GPUs
Morphological Reconstruction:
8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs
Center for Comprehensive Informatics
“Speedup” relative to single CPU core
Center for Comprehensive Informatics
Large Scale Data Management
- Represented by a complex data model capturing
multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.
- Support for complex relationships and spatial
query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships
- Highly optimized spatial query and analyses
- Implemented in a variety of ways including
- ptimized CPU/GPU, Hadoop/HDFS and IBM DB2
Spatial Centric – Pathology Imaging “GIS”
Point query: human marked point inside a nucleus
.
Window query: return markups contained in a rectangle Spatial join query: algorithm validation/comparison Containment query: nuclear feature aggregation in tumor regions
Algorithm Validation: Intersection between Two Result Sets (Spatial Join)
PAIS: Example Queries
. .
Center for Comprehensive Informatics
VLDB 2012
Change Detection, Comparison, and Quantification
Center for Comprehensive Informatics
CPU/GPU Methods for Comparing Many Polygons
- Cross-compare two sets of polygons, segmented by
different algorithms or the same algorithm with different parameters
- Jaccard similarity of P and Q -- two sets of
polygons representing the spatial boundaries of
- bjects generated by two methods from the same
image.
- PixelBox accepts an array of polygon pairs as input
and computes their areas of intersection and union.
Center for Comprehensive Informatics
Performance Improvement from PixelBox (VLDB 2012)
Center for Comprehensive Informatics
Summary and Perspective
- Extreme Spatio temporal data analytics
- Quantitative characterization of spatio-temporal
features generated by large scale simulations, comparisons with experimental results
- Methods and tools for extreme scale data analysis
pipelines
- Uncertainty quantification, comparison with
experimental results
Thanks to:
- In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David
Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
- caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin
Kurc, Himanshu Rathod Emory leads
- caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon,
Daniel Rubin, Fred Prior, Larry Tarbox and many others
- In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz
- Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul
Pantalone
- Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony
Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)
- NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima
Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe
- ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc,
Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado- Ramos
- NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman,
Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi