SPRINT: a Simple Parallel R INTerface to High Performance Computing - PowerPoint PPT Presentation

SPRINT: a Simple Parallel R INTerface to High Performance Computing (HPC) and a Parallel R function Library Muriel Mewissen Division of Pathway Medicine, University of Edinburgh, UK useR!2010, Gaithersburgh, 21 st July 2010 useR!2010, Gaithersburgh, 1 USA, 21st July 2010.

Talk Outline  Motivation: Pathway Biology  High-throughput technologies, HPC & R  SPRINT: • Functionality and Releases • Architecture • Performance  Future work useR!2010, Gaithersburgh, 2 USA, 21st July 2010.

What is Pathway Biology? Pathway biology is …. A systems biology approach for understanding a biological process  empirically by functional association of multiple gene products & metabolites  computationally by defining networks of cause-effect relationships.  Pathway Models link molecular; cellular; whole organism levels. Microarray DB Pathway DB Pathway Analysis useR!2010, Gaithersburgh, 3 USA, 21st July 2010.

Differentially expressed genes in neonates control vs Infected (FDR p>1x10 -5 , FC ± 4)  High throughput approaches to mapping and understanding host- response to infection.  Targeting the host NOT the “bug” as anti - infective strategy Story starts at the bed side. useR!2010, Gaithersburgh, 4 USA, 21st July 2010.

High throughput data requires high throughput analysis Very Large Post Very Large Post Post Genomic Genomic Data Genomic Data Data HPC SPRINT Biological Results Biological Results Using all or many genes (Exons , SNPs, …) will either crash or be very slow:  Space limitation (“… cannot allocate vector of size …”)  Time limitation useR!2010, Gaithersburgh, 5 USA, 21st July 2010.

Issues with Existing Parallel R packages Parallel building blocks:  Bespoke implementation each time  Difficult to program: require scientist to also be a parallel programmer!  Rmpi, rpvm, nws & sleigh Task farms:  Require substantial changes to existing scripts  No data dependencies allowed: Can’t be used to solve certain class of problems  SNOW, R/Parallel, papply, BioPara, taskPR useR!2010, Gaithersburgh, 6 USA, 21st July 2010.

SPRINT : Simple Parallel R INTerface Aims to overcome limitations on data size and analysis time by providing easy access to HPC for all R users SPRINT:  An intelligent HPC harness: • Implemented in C & MPI • Scalable (RAM & CPU), portable and flexible  R parallel function library: • Popular functions, complex functions, open to contributions • Optimized  User Friendly: • Aimed at biologists and biostatisticians • Minimum changes, R interface useR!2010, Gaithersburgh, 7 USA, 21st July 2010.

SPRINT User Requirement Survey Function # SPRINT Release requested function Standard R functions 15 Permutation, bootstrapping 10 pmaxT() Beta 2 (Jun 2010) pboot() Beta 4 (TBC) Machine learning algorithms 9 ppam() Beta 3 (Soon) Correlation functions 8 pcor() Beta 1 (Jan 2010) Normalisation 8 Standard Statistics 7 Matrix operations 7 Other 12  No GUI  Full report available at www.r-sprint.org useR!2010, Gaithersburgh, 8 USA, 21st July 2010.

SPRINT Architecture Master Processor 0 Worker Processor 1-N Start R Start R R imports the SPRINT framework R imports the SPRINT framework R script calls parallel R function Command passed to SPRINT Command received SPRINT Parallel function executed by all nodes Results return to R Shutdown SPRINT Shutdown SPRINT Exit R Exit R useR!2010, Gaithersburgh, 9 USA, 21st July 2010.

Code Modification data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- mt.maxT(smallgd, classlabel, test="t", side="abs") quit(save="no") library("sprint") data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- p maxT(smallgd, classlabel, test="t", side="abs") pterminate() quit(save="no") useR!2010, Gaithersburgh, 10 USA, 21st July 2010.

SPRINT Pearson Correlation: pcor()  Parallel implementation of cor() .  uses ff package: memory-efficient storage of large data on disk and fast access functions (available from CRAN). Implements fast memory mapped access to flat files.  ff objects can be created, stored, used and removed, almost like standard R RAM objects.  Allows to process datasets that do not fit into CPU physical memory.  ff objects are perfect to read the same data from many R processes. Input Matrix Output Matrix Serial Parallel Size Size Run Time Run Time 11,000 x 320 0.9 GB 4.76 secs 63.18 secs 26.85 MB 22,000 x 320 Insufficient 3.6 GB 13.87 secs 53.7 MB memory 35,000 x 320 9.12 GB 36.64 secs Crashed 85.44 MB 45,000 x 320 15.08 GB 42.18 secs Crashed 109.86 MB Benchmark on HECToR - UK National Supercomputing Service on 256 cores. useR!2010, Gaithersburgh, 11 USA, 21st July 2010.

SPRINT Permutation Test: pmaxT()  Parallel implementation of mt.maxT() from multtest package (available from CRAN). Benchmark on HECToR – UK National Supercomputing Service on 512 cores for 150,000 Input Matrix # Permutations Serial Run Time Parallel Run permutations of 6102 x 76 matrix Size (estimated) Time 36,612 x 76 500,000 6 hrs 73.18 secs 36,612 x 76 1,000,000 12 hrs 146.64 secs 36,612 x 76 2,000,000 23 hrs 290.22 secs 73,224 x 76 500,000 10 hrs 148.46 secs 73,224 x 76 1,000,000 20 hrs 294.61 secs 73,224 x 76 2,000,000 39 hrs 591.48 secs Benchmark on HECToR - UK National Supercomputing Service on 256 cores. useR!2010, Gaithersburgh, 12 USA, 21st July 2010.

SPRINT Clustering: ppam()  Parallel implementation of pam() from cluster package (available from CRAN).  Optimisation of serial version through memory and data storage management.  Increase capacity by using external memory (ff objects). Benchmark on a shared memory cluster with 8 Parallel Run Input Data # Clusters Serial Run Time dual-core 2.6GHz AMD Opteron processors Time Size Pam() with 2GB of RAM per core. Ppam() 2400 12 11.3 secs 1.1 secs 2400 24 52.5 secs 2.2 secs 4800 12 83.3 secs 4.4 secs 4800 24 434.7 secs 15.9 secs 10000 12 17 mins 22.3 secs 10000 24 99 mins 77.1 secs Insufficient 22374 24 270.5 secs memory useR!2010, Gaithersburgh, 13 USA, 21st July 2010.

What next?  SPRINT future releases: • Other distance metrics, bootstrapping, clustering, apply functionality,...  Open source project for and by the R community: • Tell us what functionality you want • Add your own functions to SPRINT  Started in biology but statistics methods can be apply to any subject. useR!2010, Gaithersburgh, 14 USA, 21st July 2010.

Division of Pathway Medicine and Edinburgh Parallel Computing Centre at University of Edinburgh. EPCC Team:  Terry Sloan DPM Team:   Michal Piotrowski Prof. Peter Ghazal   Savvas Petrou Thorsten Forster   Bartek Dobrzelecki Muriel Mewissen  Jon Hill http:// www.r-sprint.org  Florian Scharinger This work was supported by the Wellcome Trust grant [086696/Z/08/Z], by an HECToR computational science and engineering support award (dCSE) administered by NAG Ltd and edikt2. useR!2010, Gaithersburgh, 15 USA, 21st July 2010.

SPRINT: a Simple Parallel R INTerface to High Performance Computing - PowerPoint PPT Presentation

SPRINT: a Simple Parallel R INTerface to High Performance Computing (HPC) and a Parallel R function Library Muriel Mewissen Division of Pathway Medicine, University of Edinburgh, UK useR!2010, Gaithersburgh, 21 st July 2010 useR!2010,

Blood Pressure Measurement in SPRINT Karen C. Johnson, MD, MPH, FAHA Vice Chair, SPRINT Steering

Child Welfare Digital Services Sprint Review Presentation Sprint Review of Iteration 4.2

How does the sprint planning look like? Working arrangements, guidelines & support Why an

forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline

IN SCRUM PROJECTS Ramesh Shiraddi Bugs Current sprint bugs -- Created and found in current

1. 8U GIRLS SPRINT CHAMPIONSHIPS 3 RD CHARLOTTE NICHOLLS, 2 ND EMILIA RAYNER, 1 ST ABIGAIL GOWERS

CS314 Software Engineering Sprint 3 Dave Matthews Sprint 3 Summary Add Level 2 and 3

Installing SPRINT Supercomputers SPRINT is already installed on HPC Wales and

Scrum Scrum framework Roles Ceremo monies Artifact cts Product Owner Sprint Planning

Google what can possibly go wrong with my Sprint?

CS314 Software Engineering Sprint 4 - Worldwide Trips! Dave Matthews Sprint 4 Summary Use

CS314 Software Engineering Sprint 2 Dave Matthews Sprint 2 Summary Use Level 2 software

CS314 Software Engineering Sprint 5 - Release! Dave Matthews Sprint 5 Summary Use Level 2

Sprint Canoe Diane Tam UDLS - Nov 2, 2012 1 2 What is sprint canoe? Olympic sport since

CS314 Software Engineering Sprint 4 - Worldwide Trips! Dave Matthews Sprint 4 Summary Use

CS314 Software Engineering Sprint 5 - Release! Dave Matthews Sprint 5 Summary Use software

Department of Internal Medicine Coordinating H2020 grants Immunotherapy in infectious disease -

In search of new markers in chronic lymphocy3c leukemia and

Sequencing Treatments in Relapsed Hodgkin Lymphoma Leonard T. Heffner, Jr., M.D. July 27, 2017

Dimension Reduction and High-Dimensional Data Estimation and Inference with Application to

Fiocruz intramural INOVA program to accelerate science and technology for health. Scientific and

Therapeutic Strategies for Elderly Patients with DLBCL Michael Pfreundschuh German High-Grade

Prior-Driven Cluster Allocation in Bayesian Mixture Models Sally Paganin

New Frontiers in Infectious & Autoimmune Encephalitis Michael Wilson, MD, MAS Assistant

SPRINT: a Simple Parallel R INTerface to High Performance Computing - PowerPoint PPT Presentation

SPRINT: a Simple Parallel R INTerface to High Performance Computing (HPC) and a Parallel R function Library Muriel Mewissen Division of Pathway Medicine, University of Edinburgh, UK useR!2010, Gaithersburgh, 21 st July 2010 useR!2010,

Blood Pressure Measurement in SPRINT Karen C. Johnson, MD, MPH, FAHA Vice Chair, SPRINT Steering

Child Welfare Digital Services Sprint Review Presentation Sprint Review of Iteration 4.2

How does the sprint planning look like? Working arrangements, guidelines &amp; support Why an

forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline

IN SCRUM PROJECTS Ramesh Shiraddi Bugs Current sprint bugs -- Created and found in current

1. 8U GIRLS SPRINT CHAMPIONSHIPS 3 RD CHARLOTTE NICHOLLS, 2 ND EMILIA RAYNER, 1 ST ABIGAIL GOWERS

CS314 Software Engineering Sprint 3 Dave Matthews Sprint 3 Summary Add Level 2 and 3

Installing SPRINT Supercomputers SPRINT is already installed on HPC Wales and

Scrum Scrum framework Roles Ceremo monies Artifact cts Product Owner Sprint Planning

Google what can possibly go wrong with my Sprint?

CS314 Software Engineering Sprint 4 - Worldwide Trips! Dave Matthews Sprint 4 Summary Use

CS314 Software Engineering Sprint 2 Dave Matthews Sprint 2 Summary Use Level 2 software

CS314 Software Engineering Sprint 5 - Release! Dave Matthews Sprint 5 Summary Use Level 2

Sprint Canoe Diane Tam UDLS - Nov 2, 2012 1 2 What is sprint canoe? Olympic sport since

CS314 Software Engineering Sprint 4 - Worldwide Trips! Dave Matthews Sprint 4 Summary Use

CS314 Software Engineering Sprint 5 - Release! Dave Matthews Sprint 5 Summary Use software

Department of Internal Medicine Coordinating H2020 grants Immunotherapy in infectious disease -

In search of new markers in chronic lymphocy3c leukemia and

Sequencing Treatments in Relapsed Hodgkin Lymphoma Leonard T. Heffner, Jr., M.D. July 27, 2017

Dimension Reduction and High-Dimensional Data Estimation and Inference with Application to

Fiocruz intramural INOVA program to accelerate science and technology for health. Scientific and

Therapeutic Strategies for Elderly Patients with DLBCL Michael Pfreundschuh German High-Grade

Prior-Driven Cluster Allocation in Bayesian Mixture Models Sally Paganin

New Frontiers in Infectious &amp; Autoimmune Encephalitis Michael Wilson, MD, MAS Assistant

How does the sprint planning look like? Working arrangements, guidelines & support Why an

New Frontiers in Infectious & Autoimmune Encephalitis Michael Wilson, MD, MAS Assistant