a cloud framework for high throughput biological data
play

A cloud framework for high throughput biological data processing - PowerPoint PPT Presentation

A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna,


  1. A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna, Austria ISGC 2011 Biomedicine & Life Science Applications 25.03.2011

  2. Contents • Description of Application Mapping tryptid peptide fragmentation mass spectra data by utilizing the ProMEX database • Description of Infrastructure The Vienna Cloud Environment (VCE) The Vienna Cloud Environment (VCE) • Cloud-enabling molecular systems biology application • Evaluation of ProMEX Application • Ongoing work Martin Köhler - 2

  3. Problem statement of molecular systems biologists • Molecular systems biology community has built – ProMEX database storing public mass spectral references – 23810 tryptic peptide product ion spectra entries • Chlamydomonas reinhardii, Arabidopsis thaliana, … – http://promex.pph.univie.ac.at/promex/ • Application – Matching tryptic peptide fragmentation mass spectra data against mass spectral reference database (ProMEX) • ProMEX algorithm – Matches input file to each entry in ProMEX database – Computes similarity of input file to database entries Martin Köhler - 3

  4. ProMEX database – Web interface Martin Köhler - 4

  5. Recast application to Cloud programming model and framework • ProMEX database – Increasing amount of data • User requests – Execution of parameter studies – Execution via Client API – • MoSys application • MoSys application – Algorithm compares input file with each database entry – Many independent tasks -> no need for communication – Use of massively parallel programming model • Provisioning of MoSys application – On demand compute and storage resources – MapReduce Programming Model – Vienna Cloud Environment (VCE) Martin Köhler - 5

  6. Recast application to MapReduce model • Migrate ProMEX database to HDFS • Each map task compares – input file with local part of database • Similarity (percentage) of all entries reduced and sorted • Parameter studies supported via DistributedCache Martin Köhler - 6

  7. Vienna Cloud Environment (VCE) • Cloud framework for on-demand supercomputing – Software as a Service (SaaS) – Web Service and Virtualization Technologies • Partly developed within EU Projects: – FP 5 GEMSS - Grid-enabling Medical Simulation Services – FP 6 @neurIST - Integrated Biomedical Informatics for the Management of Cerebral Aneurysms – FP 7 VPHShare - Virtual Physiological Human: Sharing for Healthcare • Provisioning of virtual appliances – Scientific applications – Data access and integration – Scientific workflows • VCE enables resources as – Web Services – hosted on virtual appliances Martin Köhler - 7

  8. VCE Virtual Appliances • Data Appliances – Data access and integration – Based on OGSA-DAI/DQP – Integrated data mediation engine • Global data scheme • Mapping rules • Workflow Appliances – Integrated Workflow enactment engine (WEEP) – composed execution of BPEL workflows • Application Appliances – Providing applications as Services – HPC applications (OpenMP, MPI) – MapReduce application • Adaptive Execution framework Martin Köhler - 8

  9. VCE MoSys Architecture • VCE Web Service interface • Resources – Local cluster resources – Private Cloud • Job execution – DRMAA – Sun Grid Engine (SGE) – Dynamic creation and configuration of Hadoop execution environment Martin Köhler - 9

  10. System Configuration • Recasting and optimizing an application for MapReduce results in configuration challenges at three layers: • Application layer – Configuring parameter studies • Execution framework (Apache Hadoop) • Execution framework (Apache Hadoop) – Number of Reduce Tasks – Memory Allocation – Parallel tasks,… • Resources – Compute Resources: number of nodes – Storage Resources (HDFS): block size, replication factor Martin Köhler - 10

  11. Evaluation of application • Utilization of CORA cluster resources – 72 CPU-cores (8 x 2 x Intel Xeon quad-core X5550) – 9088 GPU-cores (8 x 2 x Tesla C2050) – main memory of 216GB – Disk space of 64TB – Disk space of 64TB • Private Cloud environment – 4 nodes with two 12-core AMD Opteron™ – main memory is 192GB – KVM Virtualization Martin Köhler - 11

  12. Configuration of experiments • First experiments on – Resource scaling – Parameter studies • Database size – 100 GB and 500 GB – Block Size 128 MB, Replication factor: 1 – Block Size 128 MB, Replication factor: 1 • Hadoop Configuration – 8 parallel map and reduce tasks per node – new JVM per task – 1GB memory per JVM • Parameter studies – Number of input files: 1 to 1000 Martin Köhler - 12

  13. Performance Results: Node Scaling Martin Köhler - 13

  14. Performance Results: Parameter Study Martin Köhler - 14

  15. Ongoing Work • Performance tests regarding virtual execution nodes – Regarding to integration of VMs into Cluster as Hadoop execution nodes hosted in private/public Clouds • Adaptive Framework for automatic configuration of application, environment, and job – Based on MAPE-K loop (autonomic computing) – Hadoop environment configuration • Number of map/reduce tasks, memory allocation • Parallel tasks per node, hadoop scheduler, … – HDFS configuration • Block size, replication factor – Resources • Data-local computation • Number of resources (cpus) Martin Köhler - 15

  16. Questions? Questions? Martin Köhler - 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend