2010 Computing on Grids and Supercomputers Improving Many-Task - - PowerPoint PPT Presentation

2010
SMART_READER_LITE
LIVE PREVIEW

2010 Computing on Grids and Supercomputers Improving Many-Task - - PowerPoint PPT Presentation

MTAGS 3rd IEEE Workshop on Many-Task 2010 Computing on Grids and Supercomputers Improving Many-Task Computing in Scientific Workflows Using P2P Techniques Jonas Dias Eduardo Ogasawara Daniel de Oliveira Esther Pacitti Marta Mattoso COPPE,


slide-1
SLIDE 1

3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers

MTAGS 2010

Improving Many-Task Computing in Scientific Workflows Using P2P Techniques

Jonas Dias Eduardo Ogasawara Daniel de Oliveira Esther Pacitti Marta Mattoso

COPPE, Federal University of Rio de Janeiro, Brazil INRIA & LIRMM, Montpellier, France

slide-2
SLIDE 2

MTAGS 2010 Introduction

  • Scientific Experiments
  • Petascale Computing

– Behavior of hundreds of thousands processors – Parallel Execution failures

  • Scientific Workflows

– Represent the chaining of activities of an experiment – Scientific Workflow Management Systems (SWfMS)

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 2

Pre-processing Execution Kernel Pos-processing

Typical Scientific Workflow

slide-3
SLIDE 3

MTAGS 2010 Experiment Execution

  • The same workflow may run several times

– 5000 parameter combinations to try – 3 workflow variations – Total of 15000 instances to be executed

  • Motivation to parallelize

– Accomplish the results timely – Clusters, Grids and Clouds

  • Utility Computing model

– Give the answer when they are still necessary

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 3

slide-4
SLIDE 4

MTAGS 2010 Difficulties in Workflow Parallelism

  • MPI

– Complex and legacy codes – Dynamic resource management – A job’s process may fail

  • Compromise the whole execution
  • Resubmitting relies on the scientist manual control

– Not feasible for a huge number of tasks

  • Grid Schedulers

– Submit many Jobs simultaneously – Waiting time on resource management queues

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 4

slide-5
SLIDE 5

MTAGS 2010 MTC Workflow Parallelism

  • Many-task computing (MTC)

– Improve Parameter Sweep and Data Parallelism

  • HPC Cluster Systems

– Not very easy to setup Jobs to be submitted – Centralized control – Compute nodes may fail

  • Open Issues

– Best approaches to setup an experiment execution – Load balancing – Dynamic resource management – Control the failures

  • What has failed and needs to be rescheduled?

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 5

slide-6
SLIDE 6

MTAGS 2010 MTC, Workflows and Clusters

  • The Heracles Approach

– Approach to execute workflow activities

  • More transparent setup
  • Load Balancing
  • Quality of service
  • Distributed Provenance Gathering

– Uses the P2P model

  • To be implemented in a cluster scheduler
  • Not P2P infrastructure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 6

slide-7
SLIDE 7

MTAGS 2010 Heracles Overview

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 7

Scientific Workflow Management System Workflow MTC Scheduler Heracles Cluster

slide-8
SLIDE 8

MTAGS 2010 Heracles Structure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 8

SWfMS

slide-9
SLIDE 9

MTAGS 2010 Heracles Structure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 9

SWfMS

Workflow Instances Wrapper Workflow MTC Scheduler Cluster Scheduling

slide-10
SLIDE 10

MTAGS 2010 Heracles Structure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 10

SWfMS

Workflow Instances Wrapper Workflow MTC Scheduler Cluster Scheduling

Heracles

Task

slide-11
SLIDE 11

MTAGS 2010 Heracles Structure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 11

SWfMS

Workflow Instances Wrapper Workflow MTC Scheduler Cluster Scheduling

Heracles

Task

Task Task Task Execution Monitoring Distributed Table Executer Overlay Handler Heracles Process

slide-12
SLIDE 12

MTAGS 2010 Heracles Structure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 12

SWfMS

Workflow Instances Wrapper Workflow MTC Scheduler Cluster Scheduling

Heracles

Task

Task Task Task Execution Monitoring Distributed Table Executer Overlay Handler Heracles Process Process

slide-13
SLIDE 13

MTAGS 2010 Heracles Structure

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 13

SWfMS

Workflow Instances Wrapper Workflow MTC Scheduler Cluster Scheduling

Heracles

Task

Task Task Task Execution Monitoring Distributed Table Executer Overlay Handler Heracles Process Process Resource Manager Node Process Node Process Node Process Node Process Cluster

slide-14
SLIDE 14

MTAGS 2010 P2P view

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 14

Resource Manager Node Process Node Process Node Process Node Process Cluster Process Process Process Process Heracles virtual P2P network view

slide-15
SLIDE 15

MTAGS 2010 Heracles

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 15

slide-16
SLIDE 16

MTAGS 2010 Transparency

  • Setup the deadline, not the number of nodes
  • Heracles controls the number of involved

nodes

– Execution partial efficiency – Automatically refresh the number of necessary processors

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 16

slide-17
SLIDE 17

MTAGS 2010 Dynamic Scheduling example

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 17

20 40 60 80 100 120 140 160 180 200 5 10 15 20 Hours Completed tasks per hour Processing Cores 173 tasks per hour 64 cores

slide-18
SLIDE 18

MTAGS 2010 Efficiency

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 18

0.2 0.4 0.6 0.8 1 5 10 15 20 Hours

slide-19
SLIDE 19

MTAGS 2010 Load Balancing

  • Clusters depend on the head

node control.

  • Tasks can have their autonomy

– Like P2P dynamic control

  • Hierarchical organization

– Based on P2P hierarchical networks – Group leaders – Working nodes

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 19

slide-20
SLIDE 20

MTAGS 2010 Quality of Service

  • Job’s process failure

– Hard to reschedule on traditional approaches – Manual reschedule not feasible – How to address it in the provenance collection

  • P2P model can help

– Autonomy of the nodes – Unfinished or failed tasks can be rescheduled – Provenance may register all execution attempts or the last execution attempt

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 20

slide-21
SLIDE 21

MTAGS 2010 When rescheduling?

  • Group leaders are responsible for the decision

– Distributed table data

  • Status of the tasks on the distributed table

– Pending, running or finished

  • Average execution time of a task
  • To reschedule means to change the status of the

task to pending

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 21

slide-22
SLIDE 22

MTAGS 2010 Case Study

  • Analyze the impact of churn events on tasks

execution on clusters

– Many workflow activities to be executed – Activities are decomposed into tasks

  • Suffer with churn events

– Activities producing 512, 1024, 2048 and 4096 tasks – Tasks is classified as small, medium and large – Seven days simulated – Calibrated using real experiment data

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 22

slide-23
SLIDE 23

MTAGS 2010 Rescheduling Types

  • Manual Rescheduling

– Scientists checks activity status every twelve hours – If a failure happens, all the tasks of the activity are rescheduled

  • Automatic Rescheduling

– Only the task that has failed is rescheduled

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 23

slide-24
SLIDE 24

MTAGS 2010 Small Tasks

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 24

slide-25
SLIDE 25

MTAGS 2010 Medium Tasks

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 25

slide-26
SLIDE 26

MTAGS 2010 Big Tasks

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 26

slide-27
SLIDE 27

MTAGS 2010 Conclusions

  • Empowering scientific experiments execution

– Scientific Workflow parallelization on huge clusters – Many task computing – Process failures, poor load balancing, usability issues

  • Heracles Approach

– Transparency, load balance and quality of service – Using P2P model on clusters

  • Case study showed the gains with automatic

rescheduling

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 27

slide-28
SLIDE 28

MTAGS 2010 Future Work

  • Analyze the advantages that MTC schedulers

can achieve when using full Heracles approach

  • Using Heracles on real experiments

– Implementing it on real schedulers such as Hydra

  • Evaluate other fault tolerant mechanisms such

as redundant executions

11/15/2010 Improving Many-Task Computing in Scientific Workflows Using P2P Techniques 28

slide-29
SLIDE 29

MTAGS 2010 Acknowledgements

6/24/2010 A P2P Approach to Many Tasks Computing for Scientific Workflows 29

slide-30
SLIDE 30

3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers

MTAGS 2010

Improving Many-Task Computing in Scientific Workflows Using P2P Techniques

COPPE, Federal University of Rio de Janeiro, Brazil INRIA & LIRMM, Montpellier, France