parallelization techniques applying map reduce and cross
play

Parallelization techniques: Applying Map, Reduce and Cross concepts - PowerPoint PPT Presentation

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies


  1. Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors � Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies altintas@sdsc.edu bioKepler.org bioKepler - September, 2012 1

  2. What is Parallelization? � bioKepler.org bioKepler - September, 2012 2

  3. What is Parallelization? � bioKepler.org bioKepler - September, 2012 3

  4. Distributed Computing Environments � Figure 1 FROM: “ Cloud Computing and Grid Computing 360-Degree Compared ”, Ian Foster, Yong Zhao, Ioan Raicu, Shiyong Lu. Grid Computing Environments Workshop (GCE), 2008. bioKepler.org Figure 1: Grids and Clouds Overview bioKepler - September, 2012 4 Grid Computing aims to “enable resource sharing and

  5. Parallelization Solutions in Distributed Environments � • Traditional parallel programming interfaces � – Examples: MPI and OpenMP � – Hard to implement � – Original sequential tools cannot be reused � • Parallel job execution � – Examples: SGE and Condor � – Original sequential tools can be reused � – Create small jobs by splitting data or tasks � – Hard to achieve data locality for each job � • Data parallel job execution � – Examples: Hadoop and Stratosphere � – Original sequential tools can be reused � – Support customized and automatic data partition and distribution � – Support data locality for each job through special distributed file system, HDFS � bioKepler.org bioKepler - September, 2012 5

  6. Data Parallel Task Execution � • Static executables run as processes � • Independent data items are assigned to processes � P1 D1 D5 P2 D2 D6 P3 D3 D7 P4 D4 D8 bioKepler.org bioKepler - September, 2012 6

  7. Distributed Data Parallel (DDP)Task Execution � • Static executables run as processes on distributed environments � • Independent data items are assigned to processes � P1 D1 D5 D2 D6 P2 P3 D3 D7 P4 D4 D8 bioKepler.org bioKepler - September, 2012 7

  8. MapReduce: 
 � A Typical DDP Execution Pattern • Chop the data based on a feature of interest � � ( value ) � � � � ( key ) � • Iterate a function on each value � • Order the intermediate data products’ � � � � ( intermediate value ) � • Stitch the intermediate values � • Can execute using a specialized engine Examples: Hadoop and Nephele bioKepler.org bioKepler - September, 2012 8

  9. Many Other DDP Patterns � Images taken from: http://www.stratosphere.eu bioKepler.org bioKepler - September, 2012 9

  10. Distributed Data-Parallel bioActors � • Set of steps to execute a bioinformatics tool in DDP environment � • Customized from the ExecutionChoice actor � • Includes: � – Data-parallel patterns, e.g., Map, Reduce, Cross, All-Pairs, etc., to specify data grouping � – I/O to interface with storage � – Data format specifying how to split and join � bioKepler.org bioKepler - September, 2012 10

  11. A Workflow with Three bioActors � BLASTALL bioKepler.org bioKepler - September, 2012 11

  12. Configuring the BLASTALL bioActor � bioKepler.org bioKepler - September, 2012 12

  13. Inside the LocalExecution Tab � External Execution bioKepler.org bioKepler - September, 2012 13

  14. Inside the MapReduce Tab � Stratosphere Blast bioKepler.org bioKepler - September, 2012 14

  15. Inside the MapReduce Tab � bioKepler.org bioKepler - September, 2012 15

  16. BLASTALL with MapReduce � bioKepler.org bioKepler - September, 2012 16

  17. Inside the Stratopshere Blast � bioKepler.org bioKepler - September, 2012 17

  18. DDP BLAST Workflow via Splitting Query Sequences � Switch director to work with other DDP engines, such as Hadoop � execute with data partition � bioKepler.org bioKepler - September, 2012 18

  19. DDP BLAST Workflow using Cross and Reduce � Same reduce sub-workflow with the Map workflow � Reference data partition for each execution � Query data partition for each execution � bioKepler.org bioKepler - September, 2012 19

  20. What if the bioActor I need is not available? � ExecutionChoice bioKepler.org bioKepler - September, 2012 20

  21. DDP bioActor Usage Model � bioActor Library 1. Search 4c. Save in Library A1 A2 An 4b. Add to User: Larger Workflow 2a. Choose 2b. Choose Workflow Developer Specific Generic DDP Workflow Director DDP DDP Blast Generic 3. Add to Workflow 2b. Create 4a. Execute Sub-Workflow Results bioKepler.org bioKepler - September, 2012 21

  22. 
 NEXT: 
 Kepler Interface and Introductory Examples on Using Kepler 
 � Daniel Crawl 1st Workshop on bioKepler Tools and Its Applications bioKepler.org bioKepler - September, 2012 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend