argobots and its application to charm
play

Argobots and its Application to Charm++ Sangmin Seo Assistant - PowerPoint PPT Presentation

Argobots and its Application to Charm++ Sangmin Seo Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016 Charm++ Workshop 2016 Argo Concurrency Team Argonne National Laboratory (ANL) Pavan Balaji


  1. Argobots and its Application to Charm++ Sangmin Seo Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016 Charm++ Workshop 2016

  2. Argo Concurrency Team Argonne National Laboratory (ANL) • – Pavan Balaji (co-lead) – Sangmin Seo – Abdelhalim Amer – Marc Snir – Pete Beckman (PI) University of Illinois at Urbana-Champaign (UIUC) • – Laxmikant Kale (co-lead) – Prateek Jindal – Jonathan Lifflander University of Tennessee, Knoxville (UTK) • – George Bosilca Past Team Members: – Thomas Herault Cyril Bordage (UIUC) • – Damien Genet Esteban Meneses • Pacific Northwest National Laboratory (PNNL) • (University of Pittsburgh) Huiwei Lu (ANL) – Sriram Krishnamoorthy • Yanhua Sun (UIUC) • 2 Charm++ Workshop 2016

  3. Massive On-node Parallelism The number of cores is increasing • Massive on-node parallelism is inevitable • Existing solutions do not effectively deal with such parallelism with • respect to on-node threading/tasking systems or with respect to off-node communication in the presence of such tasks/threads How to exploit? • core Core-level Parallelism 3 Charm++ Workshop 2016

  4. Shortcomings today? Pthreads (1/2) Execution time for 36 threads in the outer loop Nesting GCC/pthreads GCC/Argobots ULTs GCC/Argobots tasks int in[1000][1000], out[1000][1000]; 3.5 3.0 #pragma omp parallel for 2.5 for (i = 0; i < 1000; i++) { 2.0 Time (s) petsc_voodoo(i); 1.5 } Lower is 1.0 better 0.5 petsc_voodoo(int x) 0.0 { 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 #pragma omp parallel for # OMP Threads | Argobots ULTs/tasks (inner loop) Why is traditional OpenMP’s performance so bad? The for (j = 0; j < 1000; j++) compiler cannot analyze petsc_voodoo to know whether out[x][j] = cosine(in[x][j]); the function might ever block or yield, so it has to assume } that it might. Therefore a stack is needed to facilitate it. Creating additional Pthreads for each nesting is the simplest way to achieve this. 4 Charm++ Workshop 2016

  5. Shortcomings today? Pthreads (2/2) Tasks of application mapped to a group of Pthreads How about these communications? Wait or context switch? Work units intermixed with blocking calls (such as communication calls) can cause idle cores map & schedule Pthreads C C C C computation Need lightweight mechanisms to switch tasks! communication 5 Charm++ Workshop 2016

  6. Outline • Background • Argobots • Charm++ with Argobots • Other Programming Models • Summary 6 Charm++ Workshop 2016

  7. User-Level Threads (ULTs) What is user-level thread (ULT)? • ULT 1 – Provides thread semantics in user space – Execution model: cooperative timesharing Context • More than one ULT can be mapped to a single kernel ULT 2 switch thread • ULTs on the same OS thread do not execute concurrently Context Where to use? • switch – To better overlap computation and communication/IO timeline – To exploit fine-grained task parallelism ULTs : Kernel threads : Core Core Core Core Core Core Core Core 7 Charm++ Workshop 2016

  8. Pthreads vs. ULTs pthread ULT (Argobots) Avg. Create&Join Time/thread 100000 10000 1000 (ns) 100 10 1 1 2 4 8 16 32 64 128 256 512 1024 2048 Number of Threads Average time for creating and joining one thread • pthread: 6.6us - 21.2us (avg. 34,953 cycles) • ULT (Argobots): 78ns - 130ns (avg. 191 cycles) • ULT is 64x - 233x faster than Pthread • – How fast is ULT? L1$ access: 1.112ns, L2$ access: 5.648ns, memory access: 18.4ns • Context switch (2 processes): 1.64us • * measured using LMbench3 8 Charm++ Workshop 2016

  9. Growing Interests in ULTs ULT and task libraries • – Converse threads, Qthreads, MassiveThreads, Nanos++, Maestro, GnuPth, StackThreads/MP , Protothreads, Capriccio, StateThreads, TiNy-threads, etc. OS supports • – Windows fibers, Solaris threads Language and programming models • – Cilk, OpenMP task, C++11 task, C++17 coroutineproposal, Stackless Python, Go coroutines, etc. Pros • – Easy to use with Pthreads-like interface Cons • – Runtime tries to do something smart (e.g., work-stealing) – This may conflict with the characteristics and demands of applications 9 Charm++ Workshop 2016

  10. Argobots A low-level lightweight threading and tasking framework (http://collab.cels.anl.gov/display/argobots/) Overview Programming Models Separation of mechanisms and policies • (MPI, OpenMP, Charm++, PaRSEC, …) Massive parallelism • – Exec. Streams guarantee progress Argobots – Work Units execute to completion User-level threads (ULTs) vs. Tasklet • Shared pool Private pool Private pool Clearly defined memory semantics • U – Consistency domains U U T U T T Provide Eventual Consistency • U T – Software can manage consistency Execution Execution Execution Argobots Innovations Stream Stream Stream Enabling technology, but not a policy maker • – High-level languages/libraries such as OpenMP, Charm++ have more information about the user application (data locality, dependencies) Explicit model : • Processor core – Enables dynamism, but always managed Lightweight U User-Level Thread Tasklet by high-level systems T Work Units * Team members: Sangmin Seo, Abdelhalim Amer, Pavan Balaji (ANL), Laxmikant Kale, Prateek Jindal (UIUC) 10 Charm++ Workshop 2016

  11. Argobots Execution Model Execution Streams (ES) • ES 1 ES n – Sequential instruction stream Sched • Can consist of one or more work units – Mapped efficiently to a hardware resource E U S ... – Implicitly managed progress semantics U E T • One blocked ES cannot block other ESs S E T T User-level Threads (ULTs) • E U T T – Independent execution units in user space – Associated with an ES when running S U E T – Yieldable and migratable Pool ULT Tasklet Event Scheduler – Can make blocking calls Argobots Execution Model Tasklets • – Atomic units of work Scheduler • – Asynchronous completion via Stackable scheduler with pluggable – strategies notifications Synchronization primitives • – Not yieldable, migratable before Mutex, condition variable, barrier, future – execution Events • – Cannot make blocking calls Communication triggers – 11 Charm++ Workshop 2016

  12. Explicit Mapping ULT/Tasklet to ES • The user needs to map work units to ESs • No smart scheduling, no work-stealing unless the user wants to use ES 1 ES 2 • Benefits – Allow locality optimization U0 T1 • Execute work units on the same ES U1 T2 – No expensive lock is needed between ULTs on the same ES U2 U4 • They do not run concurrently U3 U5 • A flag is enough 12 Charm++ Workshop 2016

  13. Stackable Scheduler with Pluggable Strategies • Associated with an ES • Can handle ULTs and tasklets • Can handle schedulers – Allows to stack schedulers hierarchically • Can handle asynchronous events • Users can write schedulers Sched – Provides mechanisms , not policies – Replace the default scheduler S E U • E.g., FIFO, LIFO, Priority Queue, etc. • ULT can explicitly yield to another ULT U E T – Avoid scheduler overhead S E T T U E U S T U U U T yield() yield_to(target) 13 Charm++ Workshop 2016

  14. Performance: Create/Join Time Ideal scalability • – If the ULT runtime is perfectly scalable, the time should be the same regardless of the number of ESs Qthreads MassiveThreads (H) MassiveThreads (W) Argobots (ULT) Argobots (Tasklet) 10000 Create/Join Time per ULT (cycles) 1000 100 10 1 2 4 8 16 24 32 36 40 48 56 64 72 Number of Execution Streams (Workers) 14 Charm++ Workshop 2016

  15. Charm++ with Argobots Jonathan Lifflander, Prateek Jindal, Yanhua Sun Laxmikant Kale University of Illinois at Urbana-Champaign (UIUC) Charm++ Workshop 2016 15

  16. Charm++ with Argobots Goals • – Test the completeness and performance of Argobots with Charm++ programming model – Take advantage of Argobots features (tasklets, stackable schedulers, etc.) without modifying application codes – For Charm++ applications, interoperate with applications written in other models (MPI, Cilk, etc.) Mini-apps and real world applications Charm++ model Intelligent runtime Converse runtime Argobots (threading, messaging, scheduler) (ULTs, Tasks, scheduling, etc.) Communication libraries (MPI, uGNI, PAMI, Verbs) Charm++ infrastructure Charm++ with Argobots * Team members: Laxmikant Kale, Jonathan Lifflander, PrateekJindal (UIUC) 16 Charm++ Workshop 2016

  17. Replacing the Converse Runtime with Argobots Converse runtime Argobots (threading, messaging, scheduler) (ULTs, Tasks, scheduling, etc.) • Converse – The active messaging layer in Charm++ • Approaches – Each Charm++ Pthread inside a node (including the communication thread) is implemented as an Argobots ES • Create an ES for every Converse instance – A custom Argobots scheduler is created instead of using the Converse scheduler – Converse messages are enqueued into Argobots pools as tasklets – Converse threads (CthThread) are implemented on top of Argobots ULTs, with conditional variables to implement suspend/resume • Only 180 lines of code had to be changed! 17 Charm++ Workshop 2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend