A cloud framework for high throughput biological data processing - PowerPoint PPT Presentation

A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna, Austria ISGC 2011 Biomedicine & Life Science Applications 25.03.2011

Contents • Description of Application Mapping tryptid peptide fragmentation mass spectra data by utilizing the ProMEX database • Description of Infrastructure The Vienna Cloud Environment (VCE) The Vienna Cloud Environment (VCE) • Cloud-enabling molecular systems biology application • Evaluation of ProMEX Application • Ongoing work Martin Köhler - 2

Problem statement of molecular systems biologists • Molecular systems biology community has built – ProMEX database storing public mass spectral references – 23810 tryptic peptide product ion spectra entries • Chlamydomonas reinhardii, Arabidopsis thaliana, … – http://promex.pph.univie.ac.at/promex/ • Application – Matching tryptic peptide fragmentation mass spectra data against mass spectral reference database (ProMEX) • ProMEX algorithm – Matches input file to each entry in ProMEX database – Computes similarity of input file to database entries Martin Köhler - 3

ProMEX database – Web interface Martin Köhler - 4

Recast application to Cloud programming model and framework • ProMEX database – Increasing amount of data • User requests – Execution of parameter studies – Execution via Client API – • MoSys application • MoSys application – Algorithm compares input file with each database entry – Many independent tasks -> no need for communication – Use of massively parallel programming model • Provisioning of MoSys application – On demand compute and storage resources – MapReduce Programming Model – Vienna Cloud Environment (VCE) Martin Köhler - 5

Recast application to MapReduce model • Migrate ProMEX database to HDFS • Each map task compares – input file with local part of database • Similarity (percentage) of all entries reduced and sorted • Parameter studies supported via DistributedCache Martin Köhler - 6

Vienna Cloud Environment (VCE) • Cloud framework for on-demand supercomputing – Software as a Service (SaaS) – Web Service and Virtualization Technologies • Partly developed within EU Projects: – FP 5 GEMSS - Grid-enabling Medical Simulation Services – FP 6 @neurIST - Integrated Biomedical Informatics for the Management of Cerebral Aneurysms – FP 7 VPHShare - Virtual Physiological Human: Sharing for Healthcare • Provisioning of virtual appliances – Scientific applications – Data access and integration – Scientific workflows • VCE enables resources as – Web Services – hosted on virtual appliances Martin Köhler - 7

VCE Virtual Appliances • Data Appliances – Data access and integration – Based on OGSA-DAI/DQP – Integrated data mediation engine • Global data scheme • Mapping rules • Workflow Appliances – Integrated Workflow enactment engine (WEEP) – composed execution of BPEL workflows • Application Appliances – Providing applications as Services – HPC applications (OpenMP, MPI) – MapReduce application • Adaptive Execution framework Martin Köhler - 8

VCE MoSys Architecture • VCE Web Service interface • Resources – Local cluster resources – Private Cloud • Job execution – DRMAA – Sun Grid Engine (SGE) – Dynamic creation and configuration of Hadoop execution environment Martin Köhler - 9

System Configuration • Recasting and optimizing an application for MapReduce results in configuration challenges at three layers: • Application layer – Configuring parameter studies • Execution framework (Apache Hadoop) • Execution framework (Apache Hadoop) – Number of Reduce Tasks – Memory Allocation – Parallel tasks,… • Resources – Compute Resources: number of nodes – Storage Resources (HDFS): block size, replication factor Martin Köhler - 10

Evaluation of application • Utilization of CORA cluster resources – 72 CPU-cores (8 x 2 x Intel Xeon quad-core X5550) – 9088 GPU-cores (8 x 2 x Tesla C2050) – main memory of 216GB – Disk space of 64TB – Disk space of 64TB • Private Cloud environment – 4 nodes with two 12-core AMD Opteron™ – main memory is 192GB – KVM Virtualization Martin Köhler - 11

Configuration of experiments • First experiments on – Resource scaling – Parameter studies • Database size – 100 GB and 500 GB – Block Size 128 MB, Replication factor: 1 – Block Size 128 MB, Replication factor: 1 • Hadoop Configuration – 8 parallel map and reduce tasks per node – new JVM per task – 1GB memory per JVM • Parameter studies – Number of input files: 1 to 1000 Martin Köhler - 12

Performance Results: Node Scaling Martin Köhler - 13

Performance Results: Parameter Study Martin Köhler - 14

Ongoing Work • Performance tests regarding virtual execution nodes – Regarding to integration of VMs into Cluster as Hadoop execution nodes hosted in private/public Clouds • Adaptive Framework for automatic configuration of application, environment, and job – Based on MAPE-K loop (autonomic computing) – Hadoop environment configuration • Number of map/reduce tasks, memory allocation • Parallel tasks per node, hadoop scheduler, … – HDFS configuration • Block size, replication factor – Resources • Data-local computation • Number of resources (cpus) Martin Köhler - 15

Questions? Questions? Martin Köhler - 16

A cloud framework for high throughput biological data processing - PowerPoint PPT Presentation

A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna,

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Analysis of High-Throughput Biological Data Part I: Scalable High Performance Algorithms and

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

www.m-shot.com Biological microscope Introcuction Biological microscope ML31 is a high quality

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

ConTour Data Abstraction Data Abstraction History View Pathway View Compound View Drug

Methodological considerations Biosocial research framework Biological data quality issues

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

Computational Advances in High- Throughput Biological Data Analysis Mike Langston Professor

Analysis of High-Throughput Biological Data Part II: Computational Bottlenecks and Novel

Biological Data Management, part 2 Biological Data Management, part 2 H. V. Jagadish University

Comp/Phys/APSc 715 Patterns, Gestalt, Perceived contours, Transparency, Motion, Uncertainty

Intracranial Intracranial spontaneous hemorrhage spontaneous hemorrhage Intracranial

CLUSTER ANALYSIS OF VORTICAL FLOW IN SIMULATIONS OF CEREBRAL ANEURYSM HEMODYNAMICS Paper by:

Objectives Update on Aneurysm and To understand evidence-based recommendations for

based Metal Artifact Reduction in X-ray Computed Tomography Yanbo Zhang, Hengyong Yu Department

#aSAH Method Hannah Shotton 2 Introduction SAH Rupturing aneurysm Poor outlook

Sentencing 101 Drug and Gun Guidelines Gun case This was an offense charged under 18 U.S.C.

Two Extremes: Grid Computing and Globus vs. Massively parallel computing for big science

A cloud framework for high throughput biological data processing - PowerPoint PPT Presentation

A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna,

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Analysis of High-Throughput Biological Data Part I: Scalable High Performance Algorithms and

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

www.m-shot.com Biological microscope Introcuction Biological microscope ML31 is a high quality

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Cloud Computing &amp; Cloud Models Cloud Models Topics Defining cloud computing

ConTour Data Abstraction Data Abstraction History View Pathway View Compound View Drug

Methodological considerations Biosocial research framework Biological data quality issues

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

Computational Advances in High- Throughput Biological Data Analysis Mike Langston Professor

Analysis of High-Throughput Biological Data Part II: Computational Bottlenecks and Novel

Biological Data Management, part 2 Biological Data Management, part 2 H. V. Jagadish University

Comp/Phys/APSc 715 Patterns, Gestalt, Perceived contours, Transparency, Motion, Uncertainty

Intracranial Intracranial spontaneous hemorrhage spontaneous hemorrhage Intracranial

CLUSTER ANALYSIS OF VORTICAL FLOW IN SIMULATIONS OF CEREBRAL ANEURYSM HEMODYNAMICS Paper by:

Objectives Update on Aneurysm and To understand evidence-based recommendations for

based Metal Artifact Reduction in X-ray Computed Tomography Yanbo Zhang, Hengyong Yu Department

#aSAH Method Hannah Shotton 2 Introduction SAH Rupturing aneurysm Poor outlook

Sentencing 101 Drug and Gun Guidelines Gun case This was an offense charged under 18 U.S.C.

Two Extremes: Grid Computing and Globus vs. Massively parallel computing for big science

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing