A cloud framework for high throughput biological data processing - - PowerPoint PPT Presentation

a cloud framework for high throughput biological data
SMART_READER_LITE
LIVE PREVIEW

A cloud framework for high throughput biological data processing - - PowerPoint PPT Presentation

A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna,


slide-1
SLIDE 1

A cloud framework for high throughput biological data processing

Martin Koehler1, Yuriy Kaniovskyi1, Siegfried Benkner1, Volker Egelhofer2, Wolfram Weckwerth2 Faculty of Computer Science1 Faculty of Life Sciences2 University of Vienna, Austria ISGC 2011 Biomedicine & Life Science Applications 25.03.2011

slide-2
SLIDE 2

Contents

  • Description of Application

Mapping tryptid peptide fragmentation mass spectra data by utilizing the ProMEX database

  • Description of Infrastructure

The Vienna Cloud Environment (VCE) The Vienna Cloud Environment (VCE)

  • Cloud-enabling molecular systems biology application
  • Evaluation of ProMEX Application
  • Ongoing work

Martin Köhler - 2

slide-3
SLIDE 3

Problem statement of molecular systems biologists

  • Molecular systems biology community has built

– ProMEX database storing public mass spectral references – 23810 tryptic peptide product ion spectra entries

  • Chlamydomonas reinhardii, Arabidopsis thaliana, …

– http://promex.pph.univie.ac.at/promex/

  • Application

– Matching tryptic peptide fragmentation mass spectra data against mass spectral reference database (ProMEX)

  • ProMEX algorithm

– Matches input file to each entry in ProMEX database – Computes similarity of input file to database entries

Martin Köhler - 3

slide-4
SLIDE 4

ProMEX database – Web interface

Martin Köhler - 4

slide-5
SLIDE 5

Recast application to Cloud programming model and framework

  • ProMEX database

– Increasing amount of data

  • User requests

– Execution of parameter studies – Execution via Client API –

  • MoSys application
  • MoSys application

– Algorithm compares input file with each database entry – Many independent tasks -> no need for communication – Use of massively parallel programming model

  • Provisioning of MoSys application

– On demand compute and storage resources – MapReduce Programming Model – Vienna Cloud Environment (VCE)

Martin Köhler - 5

slide-6
SLIDE 6

Recast application to MapReduce model

  • Migrate ProMEX database to HDFS
  • Each map task compares

– input file with local part of database

  • Similarity (percentage) of all entries reduced and sorted
  • Parameter studies supported via DistributedCache

Martin Köhler - 6

slide-7
SLIDE 7

Vienna Cloud Environment (VCE)

  • Cloud framework for on-demand supercomputing

– Software as a Service (SaaS) – Web Service and Virtualization Technologies

  • Partly developed within EU Projects:

– FP 5 GEMSS - Grid-enabling Medical Simulation Services – FP 6 @neurIST - Integrated Biomedical Informatics for the Management of Cerebral Aneurysms – FP 7 VPHShare - Virtual Physiological Human: Sharing for Healthcare

  • Provisioning of virtual appliances

– Scientific applications – Data access and integration – Scientific workflows

  • VCE enables resources as

– Web Services – hosted on virtual appliances

Martin Köhler - 7

slide-8
SLIDE 8

VCE Virtual Appliances

  • Data Appliances

– Data access and integration – Based on OGSA-DAI/DQP – Integrated data mediation engine

  • Global data scheme
  • Mapping rules
  • Workflow Appliances

– Integrated Workflow enactment engine (WEEP) – composed execution of BPEL workflows

  • Application Appliances

– Providing applications as Services – HPC applications (OpenMP, MPI) – MapReduce application

  • Adaptive Execution framework

Martin Köhler - 8

slide-9
SLIDE 9

VCE MoSys Architecture

  • VCE Web Service interface
  • Resources

– Local cluster resources – Private Cloud

  • Job execution

– DRMAA – Sun Grid Engine (SGE) – Dynamic creation and configuration of Hadoop execution environment

Martin Köhler - 9

slide-10
SLIDE 10

System Configuration

  • Recasting and optimizing an application for MapReduce

results in configuration challenges at three layers:

  • Application layer

– Configuring parameter studies

  • Execution framework (Apache Hadoop)
  • Execution framework (Apache Hadoop)

– Number of Reduce Tasks – Memory Allocation – Parallel tasks,…

  • Resources

– Compute Resources: number of nodes – Storage Resources (HDFS): block size, replication factor

Martin Köhler - 10

slide-11
SLIDE 11

Evaluation of application

  • Utilization of CORA cluster resources

– 72 CPU-cores (8 x 2 x Intel Xeon quad-core X5550) – 9088 GPU-cores (8 x 2 x Tesla C2050) – main memory of 216GB – Disk space of 64TB – Disk space of 64TB

  • Private Cloud environment

– 4 nodes with two 12-core AMD Opteron™ – main memory is 192GB – KVM Virtualization

Martin Köhler - 11

slide-12
SLIDE 12

Configuration of experiments

  • First experiments on

– Resource scaling – Parameter studies

  • Database size

– 100 GB and 500 GB – Block Size 128 MB, Replication factor: 1 – Block Size 128 MB, Replication factor: 1

  • Hadoop Configuration

– 8 parallel map and reduce tasks per node – new JVM per task – 1GB memory per JVM

  • Parameter studies

– Number of input files: 1 to 1000

Martin Köhler - 12

slide-13
SLIDE 13

Performance Results: Node Scaling

Martin Köhler - 13

slide-14
SLIDE 14

Performance Results: Parameter Study

Martin Köhler - 14

slide-15
SLIDE 15

Ongoing Work

  • Performance tests regarding virtual execution nodes

– Regarding to integration of VMs into Cluster as Hadoop execution nodes hosted in private/public Clouds

  • Adaptive Framework for automatic configuration of

application, environment, and job – Based on MAPE-K loop (autonomic computing) – Hadoop environment configuration

  • Number of map/reduce tasks, memory allocation
  • Parallel tasks per node, hadoop scheduler, …

– HDFS configuration

  • Block size, replication factor

– Resources

  • Data-local computation
  • Number of resources (cpus)

Martin Köhler - 15

slide-16
SLIDE 16

Questions? Questions?

Martin Köhler - 16