A cloud framework for high throughput biological data processing - - PowerPoint PPT Presentation
A cloud framework for high throughput biological data processing - - PowerPoint PPT Presentation
A cloud framework for high throughput biological data processing Martin Koehler 1 , Yuriy Kaniovskyi 1 , Siegfried Benkner 1 , Volker Egelhofer 2 , Wolfram Weckwerth 2 Faculty of Computer Science 1 Faculty of Life Sciences 2 University of Vienna,
Contents
- Description of Application
Mapping tryptid peptide fragmentation mass spectra data by utilizing the ProMEX database
- Description of Infrastructure
The Vienna Cloud Environment (VCE) The Vienna Cloud Environment (VCE)
- Cloud-enabling molecular systems biology application
- Evaluation of ProMEX Application
- Ongoing work
Martin Köhler - 2
Problem statement of molecular systems biologists
- Molecular systems biology community has built
– ProMEX database storing public mass spectral references – 23810 tryptic peptide product ion spectra entries
- Chlamydomonas reinhardii, Arabidopsis thaliana, …
– http://promex.pph.univie.ac.at/promex/
- Application
– Matching tryptic peptide fragmentation mass spectra data against mass spectral reference database (ProMEX)
- ProMEX algorithm
– Matches input file to each entry in ProMEX database – Computes similarity of input file to database entries
Martin Köhler - 3
ProMEX database – Web interface
Martin Köhler - 4
Recast application to Cloud programming model and framework
- ProMEX database
– Increasing amount of data
- User requests
– Execution of parameter studies – Execution via Client API –
- MoSys application
- MoSys application
– Algorithm compares input file with each database entry – Many independent tasks -> no need for communication – Use of massively parallel programming model
- Provisioning of MoSys application
– On demand compute and storage resources – MapReduce Programming Model – Vienna Cloud Environment (VCE)
Martin Köhler - 5
Recast application to MapReduce model
- Migrate ProMEX database to HDFS
- Each map task compares
– input file with local part of database
- Similarity (percentage) of all entries reduced and sorted
- Parameter studies supported via DistributedCache
Martin Köhler - 6
Vienna Cloud Environment (VCE)
- Cloud framework for on-demand supercomputing
– Software as a Service (SaaS) – Web Service and Virtualization Technologies
- Partly developed within EU Projects:
– FP 5 GEMSS - Grid-enabling Medical Simulation Services – FP 6 @neurIST - Integrated Biomedical Informatics for the Management of Cerebral Aneurysms – FP 7 VPHShare - Virtual Physiological Human: Sharing for Healthcare
- Provisioning of virtual appliances
– Scientific applications – Data access and integration – Scientific workflows
- VCE enables resources as
– Web Services – hosted on virtual appliances
Martin Köhler - 7
VCE Virtual Appliances
- Data Appliances
– Data access and integration – Based on OGSA-DAI/DQP – Integrated data mediation engine
- Global data scheme
- Mapping rules
- Workflow Appliances
– Integrated Workflow enactment engine (WEEP) – composed execution of BPEL workflows
- Application Appliances
– Providing applications as Services – HPC applications (OpenMP, MPI) – MapReduce application
- Adaptive Execution framework
Martin Köhler - 8
VCE MoSys Architecture
- VCE Web Service interface
- Resources
– Local cluster resources – Private Cloud
- Job execution
– DRMAA – Sun Grid Engine (SGE) – Dynamic creation and configuration of Hadoop execution environment
Martin Köhler - 9
System Configuration
- Recasting and optimizing an application for MapReduce
results in configuration challenges at three layers:
- Application layer
– Configuring parameter studies
- Execution framework (Apache Hadoop)
- Execution framework (Apache Hadoop)
– Number of Reduce Tasks – Memory Allocation – Parallel tasks,…
- Resources
– Compute Resources: number of nodes – Storage Resources (HDFS): block size, replication factor
Martin Köhler - 10
Evaluation of application
- Utilization of CORA cluster resources
– 72 CPU-cores (8 x 2 x Intel Xeon quad-core X5550) – 9088 GPU-cores (8 x 2 x Tesla C2050) – main memory of 216GB – Disk space of 64TB – Disk space of 64TB
- Private Cloud environment
– 4 nodes with two 12-core AMD Opteron™ – main memory is 192GB – KVM Virtualization
Martin Köhler - 11
Configuration of experiments
- First experiments on
– Resource scaling – Parameter studies
- Database size
– 100 GB and 500 GB – Block Size 128 MB, Replication factor: 1 – Block Size 128 MB, Replication factor: 1
- Hadoop Configuration
– 8 parallel map and reduce tasks per node – new JVM per task – 1GB memory per JVM
- Parameter studies
– Number of input files: 1 to 1000
Martin Köhler - 12
Performance Results: Node Scaling
Martin Köhler - 13
Performance Results: Parameter Study
Martin Köhler - 14
Ongoing Work
- Performance tests regarding virtual execution nodes
– Regarding to integration of VMs into Cluster as Hadoop execution nodes hosted in private/public Clouds
- Adaptive Framework for automatic configuration of
application, environment, and job – Based on MAPE-K loop (autonomic computing) – Hadoop environment configuration
- Number of map/reduce tasks, memory allocation
- Parallel tasks per node, hadoop scheduler, …
– HDFS configuration
- Block size, replication factor
– Resources
- Data-local computation
- Number of resources (cpus)
Martin Köhler - 15
Questions? Questions?
Martin Köhler - 16