Cloud Computing Paradigms for Pleasingly Parallel Biomedical - PowerPoint PPT Presentation

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University

Introduction • Forth Paradigm – Data intensive scientific discovery – DNA Sequencing machines, LHC • Loosely coupled problems – BLAST, Monte Carlo simulations, many image processing applications, parametric studies • Cloud platforms – Amazon Web Services, Azure Platform • MapReduce Frameworks – Apache Hadoop, Microsoft DryadLINQ

Cloud Computing • On demand computational services over web – Spiky compute needs of the scientists • Horizontal scaling with no additional cost – Increased throughput • Cloud infrastructure services – Storage, messaging, tabular storage – Cloud oriented services guarantees – Virtually unlimited scalability

Amazon Web Services • Elastic Compute Service (EC2) – Infrastructure as a service • Cloud Storage (S3) • Queue service (SQS) EC2 compute Actual CPU Cost per Instance Type Memory units cores hour Large 7.5 GB 4 2 X (~2Ghz) 0.34$ Extra Large 15 GB 8 4 X (~2Ghz) 0.68$ High CPU Extra Large 7 GB 20 8 X (~2.5Ghz) 0.68$ High Memory 4XL 68.4 GB 26 8X (~3.25Ghz) 2.40$

Microsoft Azure Platform • Windows Azure Compute – Platform as a service • Azure Storage Queues • Azure Blob Storage Instance CPU Memory Local Disk Cost per Type Cores Space hour Small 1 1.7 GB 250 GB 0.12$ Medium 2 3.5 GB 500 GB 0.24$ Large 4 7 GB 1000 GB 0.48$ ExtraLarge 8 15 GB 2000 GB 0.96$

Classic cloud architecture

MapReduce • General purpose massive data analysis in brittle environments – Commodity clusters – Clouds • Fault Tolerance • Ease of use • Apache Hadoop – HDFS • Microsoft DryadLINQ

MapReduce Architecture HDFS Input Data Set Data File Map() Map() Executable exe exe Optional Reduce Reduce Phase HDFS Results

AWS/ Azure Hadoop DryadLINQ Programming Independent job MapReduce DAG execution, patterns execution MapReduce + Other patterns Fault Tolerance Task re-execution based Re-execution of failed Re-execution of failed on a time out and slow tasks. and slow tasks. Data Storage S3/Azure Storage. HDFS parallel file Local files system. Environments EC2/Azure, local Linux cluster, Amazon Windows HPCS cluster compute resources Elastic MapReduce Ease of EC2 : ** **** **** Programming Azure: *** Ease of use EC2 : *** *** **** Azure: ** Scheduling & Dynamic scheduling Data locality, rack Data locality, network Load Balancing through a global queue, aware dynamic task topology aware Good natural load scheduling through a scheduling. Static task balancing global queue, Good partitions at the node natural load balancing level, suboptimal load balancing

Performance • Parallel Efficiency • Per core per computation time

Cap3 – Sequence Assembly • Assembles DNA sequences by aligning and merging sequence fragments to construct whole genome sequences • Increased availability of DNA Sequencers. • Size of a single input file in the range of hundreds of KBs to several MBs. • Outputs can be collected independently, no need of a complex reduce step.

Sequence Assembly Performance with different EC2 Instance Types Amortized Compute Cost 6.00 Compute Cost (per hour units) Compute Time 2000 5.00 Compute Time (s) 4.00 1500 Cost ($) 3.00 1000 2.00 500 1.00 0 0.00

Sequence Assembly in the Clouds Cap3 – Per core per file (458 Cap3 parallel efficiency reads in each file) time to process sequences

Cost to assemble to process 4096 FASTA files * • Amazon AWS total :11.19 $ Compute 1 hour X 16 HCXL (0.68$ * 16) = 10.88 $ 10000 SQS messages = 0.01 $ Storage per 1GB per month = 0.15 $ Data transfer out per 1 GB = 0.15 $ • Azure total : 15.77 $ Compute 1 hour X 128 small (0.12 $ * 128) = 15.36 $ 10000 Queue messages = 0.01 $ Storage per 1GB per month = 0.15 $ Data transfer in/out per 1 GB = 0.10 $ + 0.15 $ • Tempest (amortized) : 9.43 $ – 24 core X 32 nodes, 48 GB per node – Assumptions : 70% utilization, write off over 3 years, including support * ~ 1 GB / 1875968 reads (458 reads X 4096)

GTM & MDS Interpolation • Finds an optimal user-defined low-dimensional representation out of the data in high-dimensional space – Used for visualization • Multidimensional Scaling (MDS) – With respect to pairwise proximity information • Generative Topographic Mapping ( GTM) – Gaussian probability density model in vector space • Interpolation – Out-of-sample extensions designed to process much larger data points with minor trade-off of approximation.

GTM Interpolation performance with different EC2 Instance Types 600 Amortized Compute Cost 5 Compute Cost (per hour units) 4.5 Compute Time 500 4 3.5 Compute Time (s) 400 3 Cost ($) 300 2.5 2 200 1.5 1 100 0.5 0 0 • EC2 HM4XL best performance. EC2 HCXL most economical. EC2 Large most efficient

Dimension Reduction in the Clouds - GTM interpolation GTM Interpolation – Time per core GTM Interpolation parallel to process 100k data points per efficiency core • 26.4 million pubchem data • DryadLINQ using a 16 core machine with 16 GB, Hadoop 8 core with 48 GB, Azure small instances with 1 core with 1.7 GB.

Dimension Reduction in the Clouds - MDS Interpolation • DryadLINQ on 32 nodes X 24 Cores cluster with 48 GB per node. Azure using small instances

Next Steps • AzureMapReduce  AzureTwister

AzureMapReduce SWG SWG Pairwise Distance 10k Sequences 7 Time Per Alignment Per Instance 6 Alignment Time (ms) 5 4 3 2 1 0 0 32 64 96 128 160 Number of Azure Small Instances

Conclusions • Clouds offer attractive computing paradigms for loosely coupled scientific computation applications. • Infrastructure based models as well as the Map Reduce based frameworks offered good parallel efficiencies given sufficiently coarser grain task decompositions • The higher level MapReduce paradigm offered a simpler programming model • Selecting an instance type which suits your application can give significant time and monetary advantages.

Acknowlegedments • SALSA Group (http://salsahpc.indiana.edu/) – Jong Choi – Seung-Hee Bae – Jaliya Ekanayake & others • Chemical informatics partners – David Wild – Bin Chen • Amazon Web Services for AWS compute credits • Microsoft Research for technical support on Azure & DryadLINQ

Thank You!! • Questions?

Cloud Computing Paradigms for Pleasingly Parallel Biomedical - PowerPoint PPT Presentation

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University Introduction Forth Paradigm Data

Chapter 4 Cloud Computing Applications and Paradigms Cloud Computing: Theory and Practice. 1

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

cloud computing Ridwaan Boda Director | Technology, Media and Telecommunications Overview

Breaking Paradigms in Control Building Design By Robert Frye Tennessee Valley Authority April 6,

Cloud Computing and Cloud Storage By: Maurice Kelly History of Internet and Cloud Computing

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

Patterns for Cloud Computing Simon Guest Senior Director, Technical Strategy Microsoft

Introduction to PaaS and IaaS Cloud Computing Roberto Beraldi Models for Cloud Computing

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Distributed Computing Paradigms Distributed Application Paradigms level of abstraction high

Linux Containers Drive P2P Social Cloud Computing By Alex Karasulu Social cloud computing ,

Cloud Computing Tom Hendrickx RESEARCH QUESTION Define Cloud Computing in context of the higher

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Secure Outsourcing Computation Li Xiong Outline Cloud computing Computing on encrypted

Security in Cloud Computing A survey of the unique challenges and risks inherent to the Cloud

Presenter: Amen Hussain Segmental Evaluation Diagnostic Rhyme Test Modified Rhyme Test

Yun Raymond Fu Assistant Professor Electrical and Computer Engineering (ECE), COE College of

The perceived impact of external evaluation: the organisation vs the individual Riin Seema, Maiki

WPI Precision Personnel Location System: Automatic Antenna Geometry Estimation Benjamin Woodacre

Co-clustering documents and words using Bipartite Spectral Graph Partitioning Inderjit S. Dhillon

Data Fusion at Scale Markus De Shon, Ph.D. Hive Data, LLC Situation awareness Situation

Study of / i /

Resource Efficiency and Circular Economy Approaches towards an Inclusive and Sustainable

Cloud Computing Paradigms for Pleasingly Parallel Biomedical - PowerPoint PPT Presentation

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University Introduction Forth Paradigm Data

Chapter 4 Cloud Computing Applications and Paradigms Cloud Computing: Theory and Practice. 1

Cloud Computing &amp; Cloud Models Cloud Models Topics Defining cloud computing

cloud computing Ridwaan Boda Director | Technology, Media and Telecommunications Overview

Breaking Paradigms in Control Building Design By Robert Frye Tennessee Valley Authority April 6,

Cloud Computing and Cloud Storage By: Maurice Kelly History of Internet and Cloud Computing

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

Patterns for Cloud Computing Simon Guest Senior Director, Technical Strategy Microsoft

Introduction to PaaS and IaaS Cloud Computing Roberto Beraldi Models for Cloud Computing

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Distributed Computing Paradigms Distributed Application Paradigms level of abstraction high

Linux Containers Drive P2P Social Cloud Computing By Alex Karasulu Social cloud computing ,

Cloud Computing Tom Hendrickx RESEARCH QUESTION Define Cloud Computing in context of the higher

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Secure Outsourcing Computation Li Xiong Outline Cloud computing Computing on encrypted

Security in Cloud Computing A survey of the unique challenges and risks inherent to the Cloud

Presenter: Amen Hussain Segmental Evaluation Diagnostic Rhyme Test Modified Rhyme Test

Yun Raymond Fu Assistant Professor Electrical and Computer Engineering (ECE), COE College of

The perceived impact of external evaluation: the organisation vs the individual Riin Seema, Maiki

WPI Precision Personnel Location System: Automatic Antenna Geometry Estimation Benjamin Woodacre

Co-clustering documents and words using Bipartite Spectral Graph Partitioning Inderjit S. Dhillon

Data Fusion at Scale Markus De Shon, Ph.D. Hive Data, LLC Situation awareness Situation

Study of / i /

Resource Efficiency and Circular Economy Approaches towards an Inclusive and Sustainable

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing