Parallelization techniques: Applying Map, Reduce and Cross concepts - PowerPoint PPT Presentation

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors � Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies altintas@sdsc.edu bioKepler.org bioKepler - September, 2012 1

What is Parallelization? � bioKepler.org bioKepler - September, 2012 2

What is Parallelization? � bioKepler.org bioKepler - September, 2012 3

Distributed Computing Environments � Figure 1 FROM: “ Cloud Computing and Grid Computing 360-Degree Compared ”, Ian Foster, Yong Zhao, Ioan Raicu, Shiyong Lu. Grid Computing Environments Workshop (GCE), 2008. bioKepler.org Figure 1: Grids and Clouds Overview bioKepler - September, 2012 4 Grid Computing aims to “enable resource sharing and

Parallelization Solutions in Distributed Environments � • Traditional parallel programming interfaces � – Examples: MPI and OpenMP � – Hard to implement � – Original sequential tools cannot be reused � • Parallel job execution � – Examples: SGE and Condor � – Original sequential tools can be reused � – Create small jobs by splitting data or tasks � – Hard to achieve data locality for each job � • Data parallel job execution � – Examples: Hadoop and Stratosphere � – Original sequential tools can be reused � – Support customized and automatic data partition and distribution � – Support data locality for each job through special distributed file system, HDFS � bioKepler.org bioKepler - September, 2012 5

Data Parallel Task Execution � • Static executables run as processes � • Independent data items are assigned to processes � P1 D1 D5 P2 D2 D6 P3 D3 D7 P4 D4 D8 bioKepler.org bioKepler - September, 2012 6

Distributed Data Parallel (DDP)Task Execution � • Static executables run as processes on distributed environments � • Independent data items are assigned to processes � P1 D1 D5 D2 D6 P2 P3 D3 D7 P4 D4 D8 bioKepler.org bioKepler - September, 2012 7

MapReduce:   � A Typical DDP Execution Pattern • Chop the data based on a feature of interest � � ( value ) � � � � ( key ) � • Iterate a function on each value � • Order the intermediate data products’ � � � � ( intermediate value ) � • Stitch the intermediate values � • Can execute using a specialized engine Examples: Hadoop and Nephele bioKepler.org bioKepler - September, 2012 8

Many Other DDP Patterns � Images taken from: http://www.stratosphere.eu bioKepler.org bioKepler - September, 2012 9

Distributed Data-Parallel bioActors � • Set of steps to execute a bioinformatics tool in DDP environment � • Customized from the ExecutionChoice actor � • Includes: � – Data-parallel patterns, e.g., Map, Reduce, Cross, All-Pairs, etc., to specify data grouping � – I/O to interface with storage � – Data format specifying how to split and join � bioKepler.org bioKepler - September, 2012 10

A Workflow with Three bioActors � BLASTALL bioKepler.org bioKepler - September, 2012 11

Configuring the BLASTALL bioActor � bioKepler.org bioKepler - September, 2012 12

Inside the LocalExecution Tab � External Execution bioKepler.org bioKepler - September, 2012 13

Inside the MapReduce Tab � Stratosphere Blast bioKepler.org bioKepler - September, 2012 14

Inside the MapReduce Tab � bioKepler.org bioKepler - September, 2012 15

BLASTALL with MapReduce � bioKepler.org bioKepler - September, 2012 16

Inside the Stratopshere Blast � bioKepler.org bioKepler - September, 2012 17

DDP BLAST Workflow via Splitting Query Sequences � Switch director to work with other DDP engines, such as Hadoop � execute with data partition � bioKepler.org bioKepler - September, 2012 18

DDP BLAST Workflow using Cross and Reduce � Same reduce sub-workflow with the Map workflow � Reference data partition for each execution � Query data partition for each execution � bioKepler.org bioKepler - September, 2012 19

What if the bioActor I need is not available? � ExecutionChoice bioKepler.org bioKepler - September, 2012 20

DDP bioActor Usage Model � bioActor Library 1. Search 4c. Save in Library A1 A2 An 4b. Add to User: Larger Workflow 2a. Choose 2b. Choose Workflow Developer Specific Generic DDP Workflow Director DDP DDP Blast Generic 3. Add to Workflow 2b. Create 4a. Execute Sub-Workflow Results bioKepler.org bioKepler - September, 2012 21

  NEXT:   Kepler Interface and Introductory Examples on Using Kepler   � Daniel Crawl 1st Workshop on bioKepler Tools and Its Applications bioKepler.org bioKepler - September, 2012 22

Parallelization techniques: Applying Map, Reduce and Cross concepts - PowerPoint PPT Presentation

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Recap: Map-Reduce Map Phase Reduce Phase (per record

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Parallelization and Parallelization and Proling Proling Programming for Statistical

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

This Class Map Reduce Programming Framework Map Reduce

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

var ol3d = new olcs.OLCesium({map: map, target: id}); ol3d.setEnabled(true); var ol3d = new

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

Kicking Down the Cross Domain Door Techniques for Cross Domain Exploitation Billy K Rios (BK) and

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

CS 241: Systems Programming Lecture 4. Environment and expansion Fall 2019 Prof. Stephen

Functions Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Web

Best Vehicles for Estate Tax Planning Now and Best Ways to Draft Them SLATs, DAPTs, GRATs,

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 7: Mutable State (1/2)

Preview question In a 32-bit Linux/x86 program, which of these objects would have the lowest

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019