Parallelization techniques: Applying Map, Reduce and Cross concepts - - PowerPoint PPT Presentation

parallelization techniques applying map reduce and cross
SMART_READER_LITE
LIVE PREVIEW

Parallelization techniques: Applying Map, Reduce and Cross concepts - - PowerPoint PPT Presentation

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies


slide-1
SLIDE 1

1

bioKepler - September, 2012

bioKepler.org

Ilkay ALTINTAS, Ph.D.

Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies altintas@sdsc.edu

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors

slide-2
SLIDE 2

2

bioKepler - September, 2012

bioKepler.org

What is Parallelization?

slide-3
SLIDE 3

3

bioKepler - September, 2012

bioKepler.org

What is Parallelization?

slide-4
SLIDE 4

4

bioKepler - September, 2012

bioKepler.org

Distributed Computing Environments

  • Figure 1: Grids and Clouds Overview

Grid Computing aims to “enable resource sharing and

Figure 1 FROM: “Cloud Computing and Grid Computing 360-Degree Compared”, Ian Foster, Yong Zhao, Ioan Raicu, Shiyong Lu. Grid Computing Environments Workshop (GCE), 2008.

slide-5
SLIDE 5

5

bioKepler - September, 2012

bioKepler.org

Parallelization Solutions in Distributed Environments

  • Traditional parallel programming interfaces

– Examples: MPI and OpenMP – Hard to implement – Original sequential tools cannot be reused

  • Parallel job execution

– Examples: SGE and Condor – Original sequential tools can be reused – Create small jobs by splitting data or tasks – Hard to achieve data locality for each job

  • Data parallel job execution

– Examples: Hadoop and Stratosphere – Original sequential tools can be reused – Support customized and automatic data partition and distribution – Support data locality for each job through special distributed file system, HDFS

slide-6
SLIDE 6

6

bioKepler - September, 2012

bioKepler.org

Data Parallel Task Execution

  • Static executables run as processes
  • Independent data items are assigned to

processes

P1

D3 D2 D1 D4

P2 P3 P4

D7 D6 D5 D8

slide-7
SLIDE 7

7

bioKepler - September, 2012

bioKepler.org

Distributed Data Parallel (DDP)Task Execution

  • Static executables run as processes on

distributed environments

  • Independent data items are assigned to

processes

P1

D3 D2 D1 D4

P2 P3 P4

D7 D6 D5 D8

slide-8
SLIDE 8

8

bioKepler - September, 2012

bioKepler.org

MapReduce:
 A Typical DDP Execution Pattern

  • Chop the data based on a feature of interest

(value)

  • (key)
  • Iterate a function on each value
  • Order the intermediate data products’
  • (intermediate value)
  • Stitch the intermediate values
  • Can execute using a specialized engine

Examples: Hadoop and Nephele

slide-9
SLIDE 9

9

bioKepler - September, 2012

bioKepler.org

Many Other DDP Patterns

  • Images taken from:

http://www.stratosphere.eu

slide-10
SLIDE 10

10

bioKepler - September, 2012

bioKepler.org

Distributed Data-Parallel bioActors

  • Set of steps to execute a bioinformatics tool in

DDP environment

  • Customized from the ExecutionChoice actor
  • Includes:

– Data-parallel patterns, e.g., Map, Reduce, Cross, All-Pairs, etc., to specify data grouping – I/O to interface with storage – Data format specifying how to split and join

slide-11
SLIDE 11

11

bioKepler - September, 2012

bioKepler.org

A Workflow with Three bioActors

  • BLASTALL
slide-12
SLIDE 12

12

bioKepler - September, 2012

bioKepler.org

Configuring the BLASTALL bioActor

slide-13
SLIDE 13

13

bioKepler - September, 2012

bioKepler.org

Inside the LocalExecution Tab

  • External Execution
slide-14
SLIDE 14

14

bioKepler - September, 2012

bioKepler.org

Inside the MapReduce Tab

  • Stratosphere Blast
slide-15
SLIDE 15

15

bioKepler - September, 2012

bioKepler.org

Inside the MapReduce Tab

slide-16
SLIDE 16

16

bioKepler - September, 2012

bioKepler.org

BLASTALL with MapReduce

slide-17
SLIDE 17

17

bioKepler - September, 2012

bioKepler.org

Inside the Stratopshere Blast

slide-18
SLIDE 18

18

bioKepler - September, 2012

bioKepler.org

DDP BLAST Workflow via Splitting Query Sequences

  • Switch director to work

with other DDP engines, such as Hadoop execute with data partition

slide-19
SLIDE 19

19

bioKepler - September, 2012

bioKepler.org

DDP BLAST Workflow using Cross and Reduce

  • Reference data

partition for each execution Query data partition for each execution Same reduce sub-workflow with the Map workflow

slide-20
SLIDE 20

20

bioKepler - September, 2012

bioKepler.org

What if the bioActor I need is not available?

  • ExecutionChoice
slide-21
SLIDE 21

21

bioKepler - September, 2012

bioKepler.org

DDP bioActor Usage Model

  • A1

A2 An DDP Blast DDP Generic

  • 1. Search
  • 2a. Choose

Specific

  • 2b. Choose

Generic

  • 2b. Create

Sub-Workflow

  • 3. Add to

Workflow

Results

  • 4a. Execute
  • 4b. Add to

Larger Workflow

  • 4c. Save in

Library

Workflow

DDP Director

User: Workflow Developer

bioActor Library

slide-22
SLIDE 22

22

bioKepler - September, 2012

bioKepler.org NEXT:
 Kepler Interface and Introductory Examples on Using Kepler
 


Daniel Crawl

  • 1st Workshop on bioKepler Tools and Its Applications