Tutorial http://www.iterativemapreduce.org/ Workshop Workshop - PowerPoint PPT Presentation

Tutorial http://www.iterativemapreduce.org/ Workshop Workshop Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University SALSA

Acknowledgements to: Team at IU – Seung–Hee Bae, – Jong Choi – Saliya Ekanayake – Geoffrey Fox – Geoffrey Fox – Thilina Gunarathne – Ryan Hartman – Adam Lee – Hui Li – Judy Qiu – Binging Shang – Stephen Wu – Ruan Yang SALSA

Resources • Twister Website http://www.iterativemapreduce.org/ • Twister Tutorial Package – http://salsahpc.indiana.edu/tutorial/apps/Twister.zip • Naradabrokering – http://www.naradabrokering.org/software.htm • Account Info – trainXXX@bigdata.india.futuregrid.org OR trainXXX@bigdata.sierra.futuregrid.org trainXXX@bigdata.sierra.futuregrid.org • Request a New Node – qsub -I • Tutorial Pages – http://salsahpc.indiana.edu/tutorial/twister-intro.html – http://salsahpc.indiana.edu/tutorial/twister_install.htm – http://salsahpc.indiana.edu/tutorial/twister_wordcount_user_guide.htm – http://salsahpc.indiana.edu/tutorial/twister_blast_user_guide.htm – http://salsahpc.indiana.edu/tutorial/twister_kmeans_user_guide.htm SALSA

Contents • Twister: Runtime for Iterative MapReduce • Sample Application & MapReduce algorithm • Code Walkthrough • Hands-on exercise Hands-on exercise SALSA

Motivation Data Data Classic Parallel MapReduce Deluge Deluge Runtimes (MPI) Data Centered, Efficient and Experiencing QoS Proven techniques in many domains Expand the Applicability of MapReduce to more Expand the Applicability of MapReduce to more Expand the Applicability of MapReduce to more Expand the Applicability of MapReduce to more classes of Applications classes of Applications Iterative MapReduce More Extensions Map-Only MapReduce iterations Input Input Input map map map Pij reduce reduce Output SALSA

Iterative MapReduce using Existing Runtimes Variable Data – Static Data e.g. Hadoop Loaded in Every Iteration distributed cache New map/reduce Map(Key, Value) tasks in every Iterate iteration Local disk -> HTTP -> Main Local disk Program Reduce (Key, List<Value>) Reduce outputs are saved into multiple files Focuses mainly on single step map->reduce computations • Considerable overheads from: • Reinitializing tasks • Reloading static data • Communication & data transfers • SALSA

Iterative MapReduce using Twister Static Data Loaded only once Iterate Configure() Main Long running Program Map(Key, Value) map/reduce tasks (cached) Direct data transfer Direct data transfer Reduce (Key, Reduce (Key, via pub/sub List<Value>) Combiner operation to collect all reduce Combine (Map<Key,Value>) outputs Distributed data access • Distinction on static data and variable data ( data flow vs. δ flow ) • Cacheable map/reduce tasks (long running tasks) • Combine operation • Support fast intermediate data transfers • SALSA

Twister Programming Model Worker Nodes configureMaps(..) Local Disk configureReduce(..) Cacheable map/reduce tasks while(condition){ runMapReduce(..) May send <Key,Value> pairs directly May send <Key,Value> pairs directly Map() Iterations Reduce() Combine() Communications/data transfers operation via the pub-sub broker network updateCondition() Two configuration options : 1. Using local disks (only for maps) } //end while 2. Using pub-sub bus close() User program’s process space SALSA

Twister Architecture Master Node B Pub/sub Twister Broker Network B Driver B B Main Program One broker serves several Twister Daemon Twister Daemon Twister daemons map reduc Cacheable tasks e Worker Pool Worker Pool Scripts perform: Data distribution, data collection, Local Disk Local Disk and partition file creation Worker Node Worker Node SALSA

Input/Output Handling Node 0 Node 1 Node n Data Manipulation Tool A common directory in local disks of individual nodes Partition File e.g. /tmp/twister_data • Data Manipulation Tool: • Data Manipulation Tool: – Provides basic functionality to manipulate data across the local disks of the compute nodes – Data partitions are assumed to be files (Contrast to fixed sized blocks in Hadoop) – Supported commands: • mkdir, rmdir, put,putall,get,ls, • Copy resources • Create Partition File SALSA

Partition File File No Node IP Daemon No File partition path 4 156.56.104.96 2 /home/jaliya/data/mds/GD-4D-23.bin 5 156.56.104.96 2 /home/jaliya/data/mds/GD-4D-0.bin 6 156.56.104.96 2 /home/jaliya/data/mds/GD-4D-27.bin 7 156.56.104.96 2 /home/jaliya/data/mds/GD-4D-20.bin 8 8 156.56.104.97 156.56.104.97 4 4 /home/jaliya/data/mds/GD-4D-23.bin /home/jaliya/data/mds/GD-4D-23.bin 9 156.56.104.97 4 /home/jaliya/data/mds/GD-4D-25.bin 10 156.56.104.97 4 /home/jaliya/data/mds/GD-4D-18.bin 11 156.56.104.97 4 /home/jaliya/data/mds/GD-4D-15.bin • Partition file allows duplicates • One data partition may reside in multiple nodes • In an event of failure, the duplicates are used to re- schedule the tasks SALSA

The use of pub/sub messaging • Intermediate data transferred via the broker network • Network of brokers used for load balancing – Different broker topologies • Interspersed computation and data transfer minimizes large message load at the brokers • Currently supports • Currently supports map task queues – NaradaBrokering – ActiveMQ Map workers E.g. 100 map tasks, 10 workers in 10 nodes Broker network ~ 10 tasks are Reduce() producing outputs at once SALSA

Scheduling • Twister supports long running tasks • Avoids unnecessary initializations in each iteration • Tasks are scheduled statically – Supports task reuse – Supports task reuse – May lead to inefficient resources utilization • Expect user to randomize data distributions to minimize the processing skews due to any skewness in data SALSA

Fault Tolerance • Recover at iteration boundaries • Does not handle individual task failures • Assumptions: – Broker network is reliable – Main program & Twister Driver has no failures • Any failures (hardware/daemons) result the • Any failures (hardware/daemons) result the following fault handling sequence – Terminate currently running tasks (remove from memory) – Poll for currently available worker nodes (& daemons) – Configure map/reduce using static data (re-assign data partitions to tasks depending on the data locality) – Re-execute the failed iteration SALSA

Twister API 1.configureMaps(PartitionFile partitionFile) 2.configureMaps(Value[] values) 3.configureReduce(Value[] values) 4.runMapReduce() 5.runMapReduce(KeyValue[] keyValues) 5.runMapReduce(KeyValue[] keyValues) 6.runMapReduceBCast(Value value) 7.map(MapOutputCollector collector, Key key, Value val) 8.reduce(ReduceOutputCollector collector, Key key,List<Value> > values) 9.combine(Map<Key, Value> keyValues) SALSA

Twister Tutorial • Complete Tutorial – http://salsahpc.indiana.edu/tutorial/twister- intro.html SALSA

Questions? <Break> SALSA SALSA

K-Means Clustering N- dimension space Euclidean Distance Distance • Points distributions in n dimensional space • Identify a given number of cluster centers • Use Euclidean distance to associate points to cluster centers • Refine the cluster centers iteratively SALSA

K-Means Clustering - MapReduce Each map task processes a data partition n th cluster centers map map map map map map map map While(){ While(){ Main Program reduce } (n+1) th cluster centers Map tasks calculates Euclidean distance from each point in its partition to each • cluster center Map tasks assign points to cluster centers and sum the partial cluster center • values Emit cluster center sums + number of points assigned • Reduce task sums all the corresponding partial sums and calculate new cluster • centers SALSA

Code Walkthrough – Main Program SALSA

Code Walkthrough – map/reduce SALSA

Login into Futuregrid Accounts 1. ssh trainXXX@bigdata.[india, sierra].futuregrid.org 2. [train200@s1 ~]$ qsub –I 3. Create 3 command line windows (shells) ssh trainXXX@bigdata.[india, sierra].futuregrid.org – ssh sxx – SALSA

Start NaradaBrokering In the first command windows (shell) 1. cd $NBHOME/bin 2. ./startbr.sh SALSA

Start Twister In the second command window (shell) cd $TWISTER_HOME/bin • ./star_twister.sh • If you see something like below • Make sure you are logged into the reserved node using qsub -I Make sure you are logged into the reserved node using qsub -I – – Edit twister.properties and change the following • daemon_port = 12500 //change this to something else – SALSA

Run K-Means Clustering (1) In the third command window (shell) 1. Go to the samples directory cd $TWISTER_HOME/samples/kmeans/bin – 2. Split data The data is already partitioned and is in – $TWISTER_HOME/samples/kmeans/input $TWISTER_HOME/samples/kmeans/input 3. Create a directory to hold these data cd $TWISTER_HOME/bin – ./twister.sh mkdir kmeans – SALSA

Tutorial http://www.iterativemapreduce.org/ Workshop Workshop - PowerPoint PPT Presentation

Tutorial http://www.iterativemapreduce.org/ Workshop Workshop Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University SALSA Acknowledgements to: Team at IU SeungHee Bae,

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

XDP hands-on tutorial Jesper Dangaard Brouer Toke Hiland-Jrgensen Bornhack Gelsted, August

Prose tutorial Edit New Page Sumit Gulwani edited this page 9 minutes ago 60 revisions

Tutorial on using the Google Cloud Platform (GCP) Tutorial on using the Google Cloud Platform

CS 525M Mobile and Ubiquitous Computing Tutorial 1: Introduction by Bucky Roberts (thenewboston)

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

MATH 105: Finite Mathematics 3-1: The Inverse of a Matrix Prof. Jonathan Duncan Walla Walla

Salsa An Automatic Tool to Improve the Numerical Accuracy of Programs Nasrine Damouche &

Concurrent Programming Actors, SALSA, Coordination Abstractions Carlos Varela Rensselaer

Natural Language Processing Info 159/259 Lecture 3: Text classification 2 (Aug 31, 2017)

on Emerging Architectures Big Simulation and Big Data Workshop January 9, 2017 Indiana University

AbOSE Report ( Ab ilene O perational S ecurity E xercise) T. Charles Yun, Internet2 Presentation

Cryptography Deian Stefan Adopted slides from Kirill Levchenko and Dan Boneh Cryptography

An Empirical View on Semantic Roles Part V Katrin Erk Sebastian Pado Saarland University

Tutorial http://www.iterativemapreduce.org/ Workshop Workshop - PowerPoint PPT Presentation

Tutorial http://www.iterativemapreduce.org/ Workshop Workshop Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University SALSA Acknowledgements to: Team at IU SeungHee Bae,

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

XDP hands-on tutorial Jesper Dangaard Brouer Toke Hiland-Jrgensen Bornhack Gelsted, August

Prose tutorial Edit New Page Sumit Gulwani edited this page 9 minutes ago 60 revisions

Tutorial on using the Google Cloud Platform (GCP) Tutorial on using the Google Cloud Platform

CS 525M Mobile and Ubiquitous Computing Tutorial 1: Introduction by Bucky Roberts (thenewboston)

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

MATH 105: Finite Mathematics 3-1: The Inverse of a Matrix Prof. Jonathan Duncan Walla Walla

Salsa An Automatic Tool to Improve the Numerical Accuracy of Programs Nasrine Damouche &amp;

Concurrent Programming Actors, SALSA, Coordination Abstractions Carlos Varela Rensselaer

Natural Language Processing Info 159/259 Lecture 3: Text classification 2 (Aug 31, 2017)

on Emerging Architectures Big Simulation and Big Data Workshop January 9, 2017 Indiana University

AbOSE Report ( Ab ilene O perational S ecurity E xercise) T. Charles Yun, Internet2 Presentation

Cryptography Deian Stefan Adopted slides from Kirill Levchenko and Dan Boneh Cryptography

An Empirical View on Semantic Roles Part V Katrin Erk Sebastian Pado Saarland University

Salsa An Automatic Tool to Improve the Numerical Accuracy of Programs Nasrine Damouche &