Fast Parallel Construction of Correlation Similarity Matrices for - PowerPoint PPT Presentation

Correlation Similarity Matrices on Multicore Clusters Fast Parallel Construction of Correlation Similarity Matrices for Gene Co-Expression Networks on Multicore Clusters Jorge González-Domínguez , María J. Martín Computer Architecture Group, University of A Coruña, Spain {jgonzalezd,mariam}@udc.es International Conference on Computational Science ICCS 2017

Correlation Similarity Matrices on Multicore Clusters Introduction 1 Parallel Construction of Similarity Matrices 2 Experimental Results 3 Conclusions 4

Correlation Similarity Matrices on Multicore Clusters Introduction Introduction 1 Parallel Construction of Similarity Matrices 2 Experimental Results 3 Conclusions 4

Correlation Similarity Matrices on Multicore Clusters Introduction Gene Co-Expression Networks (I) Graphical models to illustrate interactions among genes Connected groups of genes indicate biological relationships Genes controlled by the same transcriptional regulatory program Functionally related genes Members of the same protein complex More... Nodes represent genes. Edges represent interesting correlations.

Correlation Similarity Matrices on Multicore Clusters Introduction Gene Co-Expression Networks (and II)

Correlation Similarity Matrices on Multicore Clusters Introduction Calculation of Co-Expression Networks (I) Read expression matrix 1 Construct similarity matrix 2 Calculate the threshold for the network 3 Construct the network discarding those elements lower 4 than threshold

Correlation Similarity Matrices on Multicore Clusters Introduction Calculation of Co-Expression Networks (II) Read expression 1 matrix Construct similarity 2 matrix Calculate the 3 threshold for the network Construct the 4 network discarding The intensity of fluorescence in those elements Microarrays or RNASeqs for each gene lower than and sample threshold Quantifies the expression of that gene in that sample

Correlation Similarity Matrices on Multicore Clusters Introduction Calculation of Co-Expression Networks (III) Read expression 1 matrix Construct 2 similarity matrix Calculate the 3 threshold for the network Construct the 4 network discarding Pearson’s or other correlation measure those elements for each gene pair lower than threshold

Correlation Similarity Matrices on Multicore Clusters Introduction Calculation of Co-Expression Networks (IV) Read expression 1 matrix Construct similarity 2 matrix Calculate the 3 threshold for the network Construct the 4 network discarding Based on the measures of the similarity those elements matrices lower than threshold

Correlation Similarity Matrices on Multicore Clusters Introduction Calculation of Co-Expression Networks (and V) Read expression 1 matrix Construct similarity 2 matrix Calculate the 3 threshold for the network Construct the 4 network discarding those elements lower than threshold

Correlation Similarity Matrices on Multicore Clusters Introduction Background: RMTGeneNet Scott M. Gibson, Stephen P . Ficklin, Sven Isaacson, Feng Luo, Frank A. Feltus, and Melissa C. Smith. Massive-Scale Gene Co-Expression Network Construction and Robustness Testing Using Random Matrix Theory. PLOS One, 8(2), 2013. Three modules: Pearson’s correlation to construct similarity matrix Random Matrix Theory (RMT) to calculate the threshold Discard links with correlation value lower than threshold Networks with high robustness and sensitivity C++ implementation available at https://github.com/spficklin/RMTGeneNet

Correlation Similarity Matrices on Multicore Clusters Introduction Goal of the work Module of RMTGeneNet to construct similarity matrices requires most of time Acceleration of construction of similarity matrices with Pearson’s correlation MPICorMat Improvement of memory accesses in the sequential computation Exploitation of multicore clusters with MPI and OpenMP Useful for large networks (Big Data) It can substitute first module of RMTGeneNet Available at https://sourceforge.net/projects/mpicormat/

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Introduction 1 Parallel Construction of Similarity Matrices 2 Experimental Results 3 Conclusions 4

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Programming technologies MPI (Message Passing Interface) De-facto standard for distributed memory systems Several processes with associated local memory Each process is associated to one core or a group of cores Data exchange performed through communication routines (often main performance bottleneck) OpenMP Interface for shared memory systems A set of compiler directives inserted in the code Fork-join model: master thread creates slave threads that can perform different tasks

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Data replication (and II) 2 nodes; 4 cores per node

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Workload distribution All pairs (X,Y) with the same X to the same process Variable number of rows to balance the workload Similar computational cost for each pair

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Pseudocode of MPICorMat Read input matrix M with the expression values; 1 Calculate myIniRow and myLastRow ; 2 Initialize matrix of private scores myS := 1; 3 Initialize iterator iter := 0; 4 #pragma omp parallel for schedule(dynamic); 5 foreach row i from myIniRow to myLastRow { 6 foreach column j from 0 to i − 1 { 1 myS [ iter ] := CalculatePearson ( i , j ) ; # GSL routine 1 iter + + ; } 2 iter + + ; } # Score for diagonal elements is 1.0; 2 Write partial result with MPI _ File _ Write ( myS ) ; 7

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Data replication (I) Advantage All processes have their own copy of the expression matrix Communication avoidance: no communication during the matrix construction Drawback Memory overhead We reduced it thanks to several threads working over the same copy of the matrix

Correlation Similarity Matrices on Multicore Clusters Parallel Construction of Similarity Matrices Data replication (and II) Only MPI One process per node M × N × 8 floats MPI+OpenMP One process per node, C threads per process M × N × 2 floats

Correlation Similarity Matrices on Multicore Clusters Experimental Results Introduction 1 Parallel Construction of Similarity Matrices 2 Experimental Results 3 Conclusions 4

Correlation Similarity Matrices on Multicore Clusters Experimental Results System characteristics Hardware 16 nodes connected through InfiniBand FDR Two 8-core Intel Xeon E5-2660 Sandy-Bridge processors per node (16 cores) Non Uniform Memory Access (NUMA) with 32MB per processor Software OpenMPI v.1.7.2 Support for OpenMP v.3.0 GSL v.1.13 for Pearson’s correlation

Correlation Similarity Matrices on Multicore Clusters Experimental Results Datasets Real data downloaded from the Geo Expression Omnibus (GEO) Dataset Browser available at the National Center for Biotechnology Information (NCBI) website Name Number of Genes Number of Samples GDS5037 41,000 108 GDS3242 61,170 128 GDS3244 61,170 160 GDS3795 54,675 200

Correlation Similarity Matrices on Multicore Clusters Experimental Results Summary of results (Runtime in seconds) MPICorMat : Two processes per node (one per processor) and eight threads per process. Pearson’s correlation. TINGe : One process per core (no multithreading support). Mutual Information. Cores Tool GDS5037 GDS3242 GDS3244 GDS3795 5,336.51 13,124.76 16,004.83 15,470.96 RMTGeneNet 1 5,206.48 12,442.99 14,664.41 12,965.41 TINGe 2,539.96 6,652.75 8,365.14 8,139.81 MPICorMat 398.78 1,041.91 1,400.93 1,016.63 TINGe 16 186.81 488.93 595.44 572.06 MPICorMat 48.91 114.47 129.71 119.35 TINGe 256 13.84 33.46 41.02 39.26 MPICorMat

Correlation Similarity Matrices on Multicore Clusters Experimental Results Scalability results GDS5037 GDS3242 200 200 MPICorMat MPICorMat 180 TINGe 180 TINGe 160 160 140 140 120 120 Speedup Speedup 100 100 80 80 60 60 40 40 20 20 0 0 1 2 4 8 16 1 2 4 8 16 Number of nodes Number of nodes GDS3244 GDS3795 220 220 MPICorMat MPICorMat 200 200 TINGe TINGe 180 180 160 160 140 140 Speedup Speedup 120 120 100 100 80 80 60 60 40 40 20 20 0 0 1 2 4 8 16 1 2 4 8 16 Number of nodes Number of nodes

Correlation Similarity Matrices on Multicore Clusters Conclusions Introduction 1 Parallel Construction of Similarity Matrices 2 Experimental Results 3 Conclusions 4

Fast Parallel Construction of Correlation Similarity Matrices for - PowerPoint PPT Presentation

Correlation Similarity Matrices on Multicore Clusters Fast Parallel Construction of Correlation Similarity Matrices for Gene Co-Expression Networks on Multicore Clusters Jorge Gonzlez-Domnguez , Mara J. Martn Computer Architecture Group,

Correlation Course Title Correlation Correlation coe ffi cient between -1 and 1 Sign

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Theory of correlation transfer and correlation structure in recurrent networks Ruben Moreno-Bote

Business Statistics CONTENTS The correlation coefficient The rank correlation coefficient

Fast correlation attacks on certain stream ciphers Willi Meier FHNW Switzerland 1 Overview

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

PETER Fast similarity searches and similarity joins in Oracle DB Astrid Rheinlnder, Ulf Leser

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Notes about correlation (for Asgn 2) Sharon Goldwater Sharon Goldwater Correlation Overview of

An Introduction to the DClinSci Professor Anne White PhD, FRSB MAHSE Deputy Director for HSST

heterogeneity Population > patient > tissue > genome Florian Markowetz CRUK Cambridge

Introduction - Agenda What is Physical Synthesis? What is PC? Design Sizes are Increasing

Introduction to H4 for NP04 Beam instrumentation WG Joint NP02 and NP04 Co-conveners: Y.

New Trier Mobile Learning Initiative WHERE WE HAVE BEEN -

SAXS and I TC studies of nucleoplasmin and its nucleoplasmin and its complexes with histones and

A Sem antic W eb approach to data integration for the histone code case M. Scott Marshall

Practical Bioinformatics Mark Voorhies 6/3/2013 Mark Voorhies Practical Bioinformatics

Fast Parallel Construction of Correlation Similarity Matrices for - PowerPoint PPT Presentation

Correlation Similarity Matrices on Multicore Clusters Fast Parallel Construction of Correlation Similarity Matrices for Gene Co-Expression Networks on Multicore Clusters Jorge Gonzlez-Domnguez , Mara J. Martn Computer Architecture Group,

Correlation Course Title Correlation Correlation coe ffi cient between -1 and 1 Sign

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Theory of correlation transfer and correlation structure in recurrent networks Ruben Moreno-Bote

Business Statistics CONTENTS The correlation coefficient The rank correlation coefficient

Fast correlation attacks on certain stream ciphers Willi Meier FHNW Switzerland 1 Overview

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

PETER Fast similarity searches and similarity joins in Oracle DB Astrid Rheinlnder, Ulf Leser

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Notes about correlation (for Asgn 2) Sharon Goldwater Sharon Goldwater Correlation Overview of

An Introduction to the DClinSci Professor Anne White PhD, FRSB MAHSE Deputy Director for HSST

heterogeneity Population &gt; patient &gt; tissue &gt; genome Florian Markowetz CRUK Cambridge

Introduction - Agenda What is Physical Synthesis? What is PC? Design Sizes are Increasing

Introduction to H4 for NP04 Beam instrumentation WG Joint NP02 and NP04 Co-conveners: Y.

New Trier Mobile Learning Initiative WHERE WE HAVE BEEN -

SAXS and I TC studies of nucleoplasmin and its nucleoplasmin and its complexes with histones and

A Sem antic W eb approach to data integration for the histone code case M. Scott Marshall

Practical Bioinformatics Mark Voorhies 6/3/2013 Mark Voorhies Practical Bioinformatics

heterogeneity Population > patient > tissue > genome Florian Markowetz CRUK Cambridge