Argobots and its Application to Charm++ Sangmin Seo Assistant - - PowerPoint PPT Presentation

argobots and its application to charm
SMART_READER_LITE
LIVE PREVIEW

Argobots and its Application to Charm++ Sangmin Seo Assistant - - PowerPoint PPT Presentation

Argobots and its Application to Charm++ Sangmin Seo Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016 Charm++ Workshop 2016 Argo Concurrency Team Argonne National Laboratory (ANL) Pavan Balaji


slide-1
SLIDE 1

Sangmin Seo

Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016

Argobots and its Application to Charm++

Charm++ Workshop 2016

slide-2
SLIDE 2

Argo Concurrency Team

  • Argonne National Laboratory (ANL)

– Pavan Balaji (co-lead) – Sangmin Seo – Abdelhalim Amer – Marc Snir – Pete Beckman (PI)

  • University of Illinois at Urbana-Champaign (UIUC)

– Laxmikant Kale (co-lead) – Prateek Jindal – Jonathan Lifflander

  • University of Tennessee, Knoxville (UTK)

– George Bosilca – Thomas Herault – Damien Genet

  • Pacific Northwest National Laboratory (PNNL)

– Sriram Krishnamoorthy

Past Team Members:

  • Cyril Bordage (UIUC)
  • Esteban Meneses

(University of Pittsburgh)

  • Huiwei Lu (ANL)
  • Yanhua Sun (UIUC)

Charm++ Workshop 2016

2

slide-3
SLIDE 3

Massive On-node Parallelism

  • The number of cores is increasing
  • Massive on-node parallelism is inevitable
  • Existing solutions do not effectively deal with such parallelism with

respect to on-node threading/tasking systems or with respect to

  • ff-node communication in the presence of such tasks/threads
  • How to exploit?

core Core-level Parallelism

3

Charm++ Workshop 2016

slide-4
SLIDE 4

Shortcomings today? Pthreads (1/2)

Nesting

int in[1000][1000],

  • ut[1000][1000];

#pragma omp parallel for for (i = 0; i < 1000; i++) { petsc_voodoo(i); } petsc_voodoo(int x) { #pragma omp parallel for for (j = 0; j < 1000; j++)

  • ut[x][j]

= cosine(in[x][j]); }

Execution time for 36 threads in the outer loop

Why is traditional OpenMP’s performance so bad? The compiler cannot analyze petsc_voodoo to know whether the function might ever block or yield, so it has to assume that it might. Therefore a stack is needed to facilitate it. Creating additional Pthreads for each nesting is the simplest way to achieve this.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Time (s) # OMP Threads | Argobots ULTs/tasks (inner loop) GCC/pthreads GCC/Argobots ULTs GCC/Argobots tasks

Lower is better

Charm++ Workshop 2016

4

slide-5
SLIDE 5

Shortcomings today? Pthreads (2/2)

Tasks of application mapped to a group of Pthreads

Need lightweight mechanisms to switch tasks!

computation communication

C Pthreads C C C map & schedule How about these communications? Wait or context switch? Work units intermixed with blocking calls (such as communication calls) can cause idle cores

Charm++ Workshop 2016

5

slide-6
SLIDE 6

Outline

  • Background
  • Argobots
  • Charm++ with Argobots
  • Other Programming Models
  • Summary

Charm++ Workshop 2016

6

slide-7
SLIDE 7

User-Level Threads (ULTs)

  • What is user-level thread (ULT)?

– Provides thread semantics in user space – Execution model: cooperative timesharing

  • More than one ULT can be mapped to a single kernel

thread

  • ULTs on the same OS thread do not execute concurrently
  • Where to use?

– To better overlap computation and communication/IO – To exploit fine-grained task parallelism

timeline

Context switch Context switch

ULT1 ULT2

Core Core Core Core Core Core Core Core

ULTs : Kernel threads : Charm++ Workshop 2016

7

slide-8
SLIDE 8

Pthreads vs. ULTs

  • Average time for creating and joining one thread
  • pthread: 6.6us - 21.2us (avg. 34,953 cycles)
  • ULT (Argobots): 78ns - 130ns (avg. 191 cycles)
  • ULT is 64x - 233x faster than Pthread

– How fast is ULT?

  • L1$ access: 1.112ns, L2$ access: 5.648ns, memory access: 18.4ns
  • Context switch (2 processes): 1.64us

1 10 100 1000 10000 100000 1 2 4 8 16 32 64 128 256 512 1024 2048

  • Avg. Create&Join Time/thread

(ns) Number of Threads pthread ULT (Argobots) * measured using LMbench3

Charm++ Workshop 2016

8

slide-9
SLIDE 9

Growing Interests in ULTs

  • ULT and task libraries

– Converse threads, Qthreads, MassiveThreads, Nanos++, Maestro, GnuPth, StackThreads/MP , Protothreads, Capriccio, StateThreads, TiNy-threads, etc.

  • OS supports

– Windows fibers, Solaris threads

  • Language and programming models

– Cilk, OpenMP task, C++11 task, C++17 coroutineproposal, Stackless Python, Go coroutines, etc.

  • Pros

– Easy to use with Pthreads-like interface

  • Cons

– Runtime tries to do something smart (e.g., work-stealing) – This may conflict with the characteristics and demands of applications

Charm++ Workshop 2016

9

slide-10
SLIDE 10

Argobots

Overview

  • Separation of mechanisms and policies
  • Massive parallelism

– Exec. Streams guarantee progress – Work Units execute to completion

  • User-level threads (ULTs) vs. Tasklet
  • Clearly defined memory semantics

– Consistency domains

  • Provide Eventual Consistency

– Software can manage consistency

Argobots Innovations

  • Enabling technology, but not a policy maker

– High-level languages/libraries such as OpenMP, Charm++ have more information about the user application (data locality, dependencies)

  • Explicit model:

– Enables dynamism, but always managed by high-level systems

Argobots

core

Processor

Programming Models

(MPI, OpenMP, Charm++, PaRSEC, …) U

User-Level Thread

T

Tasklet Lightweight Work Units

Execution Stream

Private pool Private pool Shared pool

U U U T T T U T U

Execution Stream Execution Stream

A low-level lightweight threading and tasking framework

(http://collab.cels.anl.gov/display/argobots/)

* Team members: Sangmin Seo, Abdelhalim Amer, Pavan Balaji (ANL), Laxmikant Kale, Prateek Jindal (UIUC)

Charm++ Workshop 2016

10

slide-11
SLIDE 11

Argobots Execution Model

  • Execution Streams (ES)

– Sequential instruction stream

  • Can consist of one or more work units

– Mapped efficiently to a hardware resource – Implicitly managed progress semantics

  • One blocked ES cannot block other ESs
  • User-level Threads (ULTs)

– Independent execution units in user space – Associated with an ES when running – Yieldable and migratable – Can make blocking calls

  • Tasklets

– Atomic units of work – Asynchronous completion via notifications – Not yieldable, migratable before execution – Cannot make blocking calls

S

Scheduler Pool

U

ULT

T

Tasklet

E

Event

ES1

Sched U U E E E E U S S T T T T T

Argobots Execution Model

...

ESn

  • Scheduler

– Stackable scheduler with pluggable strategies

  • Synchronization primitives

– Mutex, condition variable, barrier, future

  • Events

– Communication triggers

Charm++ Workshop 2016

11

slide-12
SLIDE 12

Explicit Mapping ULT/Tasklet to ES

  • The user needs to map work units to ESs
  • No smart scheduling, no work-stealing unless the user wants

to use

ES1

U0 U1 T1 T2 U2 U3

ES2

U4 U5

  • Benefits

– Allow locality optimization

  • Execute work units on the same ES

– No expensive lock is needed between ULTs on the same ES

  • They do not run concurrently
  • A flag is enough

Charm++ Workshop 2016

12

slide-13
SLIDE 13

Stackable Scheduler with Pluggable Strategies

  • Associated with an ES
  • Can handle ULTs and tasklets
  • Can handle schedulers

– Allows to stack schedulers hierarchically

  • Can handle asynchronous events
  • Users can write schedulers

– Provides mechanisms, not policies – Replace the default scheduler

  • E.g., FIFO, LIFO, Priority Queue, etc.
  • ULT can explicitly yield to another ULT

– Avoid scheduler overhead

Sched U U E E E E U S S T T T T T U S U U U

yield() yield_to(target)

Charm++ Workshop 2016

13

slide-14
SLIDE 14

Performance: Create/Join Time

  • Ideal scalability

– If the ULT runtime is perfectly scalable, the time should be the same regardless of the number of ESs

10 100 1000 10000 1 2 4 8 16 24 32 36 40 48 56 64 72 Create/Join Time per ULT (cycles) Number of Execution Streams (Workers) Qthreads MassiveThreads (H) MassiveThreads (W) Argobots (ULT) Argobots (Tasklet)

Charm++ Workshop 2016

14

slide-15
SLIDE 15

Jonathan Lifflander, Prateek Jindal, Yanhua Sun Laxmikant Kale University of Illinois at Urbana-Champaign (UIUC)

Charm++ with Argobots

15

Charm++ Workshop 2016

slide-16
SLIDE 16

Charm++ with Argobots

  • Goals

– Test the completeness and performance of Argobots with Charm++ programming model – Take advantage of Argobots features (tasklets, stackable schedulers, etc.) without modifying application codes – For Charm++ applications, interoperate with applications written in other models (MPI, Cilk, etc.)

16

Mini-apps and real world applications Charm++ model Converse runtime (threading, messaging, scheduler) Communication libraries (MPI, uGNI, PAMI, Verbs) Intelligent runtime Argobots (ULTs, Tasks, scheduling, etc.) Charm++ infrastructure Charm++ with Argobots

* Team members: Laxmikant Kale, Jonathan Lifflander, PrateekJindal (UIUC)

Charm++ Workshop 2016

slide-17
SLIDE 17

Replacing the Converse Runtime with Argobots

  • Converse

– The active messaging layer in Charm++

  • Approaches

– Each Charm++ Pthread inside a node (including the communication thread) is implemented as an Argobots ES

  • Create an ES for every Converse instance

– A custom Argobots scheduler is created instead of using the Converse scheduler – Converse messages are enqueued into Argobots pools as tasklets – Converse threads (CthThread) are implemented on top of Argobots ULTs, with conditional variables to implement suspend/resume

  • Only 180 lines of code had to be changed!

Charm++ Workshop 2016

17

Converse runtime (threading, messaging, scheduler) Argobots (ULTs, Tasks, scheduling, etc.)

slide-18
SLIDE 18

LeanMD Performance: Runtime Comparison

Charm++ Workshop 2016

18

  • Evaluation machine

– 2 x Intel Xeon E5-2699 v3 (2.30GHz): 36 cores (72 threads)

  • LeanMD simulation

– A total of 20 steps on a cell array of dimensions 7x7x7 – 1-away XYZ configuration and 1000 atoms per cell

Achieved comparable performance although it is a very simple implementation

slide-19
SLIDE 19

LeanMD Performance: Manual Implementation

Charm++ Workshop 2016

19

  • Manual implementation of LeanMD using the Argobots

– Exploited both ULTs and tasklets

  • A ULT for managing a cell and a tasklet for managing the interaction between cells

– Used futures for the waiting mechanism – Work stealing between pools

  • Better performance of our manual implementation implies that Charm++ with

Argobots could be improved

slide-20
SLIDE 20

Argobots: Interfaces for Shrink/Expand Events

20

ES0

Sched

ES1

Sched

ES2

Sched

NRM

ESn-1

Sched

...

E E E

Argobots socket

Charm++ CilkBot PaRSEC MPI+Argobots

programming model runtimes and applications callback functions 1. [Argobots] Connect to NRM using a socket on ABT_init() 2. [Runtimes/applications] Register callback functions for shrink/expand events 3. [Runtimes/applications] Deregister callback functions when they terminate 4. [Argobots] Disconnect from NRM on ABT_finalize()

ABT_ENV_POWER_EVENT_HOSTNAME ABT_ENV_POWER_EVENT_PORT ABT_event_add_callback() ABT_event_del_callback() Charm++ Workshop 2016

slide-21
SLIDE 21

Argobots: Shrink/Expand Event Handling

  • Shrinking

21

ES0

Sched

ES1

Sched

ES2

Sched E E E

  • prog. model runtime/application

1. ES1 picks an event, which requests ES2 to be stopped 2. Ask the runtime using callbacks whether ES2 can be stopped 3. If OK, mark ES2 to need to stop so when the scheduler on ES2 checks events, it can be stopped 4. Notify the runtime that ES2 will be stopped 5. Create a ULT on ES0 6. When the scheduler on ES2 stops, ES2 is terminated 7. After ES2 is terminated, the ULT frees ES2 and sends a response to NRM

  • Any scheduler on any ES

can check and handle events

  • ES0 cannot be stopped

U 1 2 3 4 5

NRM

7 6 Charm++ Workshop 2016

slide-22
SLIDE 22

Argobots: Shrink/Expand Event Handling

  • Expanding

22

ES0

Sched

ES1

Sched

ES2

Sched E E E

  • prog. model runtime/application

1. ES0 picks an event, which requests to create an ES2 2. Ask the runtime using callbacks whether it can create ES2 3. If OK, invoke a callback function so the runtime creates ES2 4. Create ES2 5. Send a response to NRM

1 2 3 5

NRM

4 Charm++ Workshop 2016

slide-23
SLIDE 23

Charm++ with Argobots: Implementation

  • Shrink/Expand Implementation

– Charm++ maintains a set of pools for each scheduler, mapped to an ES – Ranks in Charm++ are virtualized by saving the mapping of pool to an ES – When an ES is removed, the associated pools are put into a global list

  • To maintain correctness in Charm++, the rank of any tasks/threads

in the global list are derived from the pool (ranks are virtualized)

– Shrink

  • Other ESs execute work units from orphaned pools with some

added synchronization

– Expand

  • A new ES is created and takes over a set of orphaned pools

23

Charm++ Workshop 2016

slide-24
SLIDE 24

LeanMD Results of Shrinking/Expanding ESs

Charm++ Workshop 2016

24

slide-25
SLIDE 25

Argobots Ecosystem

ES1

Sched U U E E E E U S S T T T T T

Argobots

...

ESn

MPI+Argobots

ULT ES ULT ES

MPI

Argobots runtime Communication libraries Charm++ Applications

Charm++

Cilk “Worker” Argobots ES RWS ULT Fused ULT 1 Fused ULT 2 Fused ULT N …

CilkBots

PO GE TR SY TR TR PO GE GE TR TR SY SY GE PO TR SY PO SY SY

PaRSEC OpenMP Mercury RPC

Origin Target

RPC proc RPC proc

OmpSs GridFTP, Kokkos, RAJA, ROSE, TASCEL, XMP, etc. External Connections

Charm++ Workshop 2016

25

slide-26
SLIDE 26

Summary

  • Massive on-node parallelism is inevitable

– Need runtime systems utilizing such parallelism

  • Argobots

– A lightweight low-level threading/tasking framework – Provides efficient mechanisms, not policies, to users (library developers or compilers)

  • They can build their own solutions
  • Charm++ with Argobots

– Implemented by replacing the Converse runtime with Argobots – Achieved comparable performance on LeanMD – Incorporated the shrinking/expanding of ESs in Argobots in

  • rder to respond to the external events (e.g., power capping)

Charm++ Workshop 2016

26

slide-27
SLIDE 27

Try Argobots

  • git repository
  • http://git.mcs.anl.gov/argo/argobots.git/
  • Documentation

– Wiki

  • https://collab.cels.anl.gov/display/ARGOBOTS/

– Doxygen

  • http://www.mcs.anl.gov/~sseo/public/argobots/

Charm++ Workshop 2016

27

slide-28
SLIDE 28

Funding Acknowledgments

Funding Grant Providers Infrastructure Providers

Charm++ Workshop 2016

slide-29
SLIDE 29

Programming Models and Runtime Systems Group

Group Lead

– Pavan Balaji (computer scientist and group lead)

Current Staff Members

– Abdelhalim Amer (postdoc) – Yanfei Guo (postdoc) – Rob Latham (developer) – Lena Oden (postdoc) – Ken Raffenetti (developer) – SangminSeo(assistant computer scientist) – Min Si (postdoc) – Min Tian (visiting scholar)

Past Staff Members

– Antonio Pena (postdoc) – Wesley Bland (postdoc) – Darius T. Buntinas (developer) – James S. Dinan (postdoc) – David J. Goodell (developer) – Huiwei Lu (postdoc) – Yanjie Wei (visiting scholar) – Yuqing Xiong (visiting scholar) – Jian Yu (visiting scholar) – Junchao Zhang (postdoc) – Xiaomin Zhu (visiting scholar) – AshwinAji (Ph.D.) – Abdelhalim Amer (Ph.D.) –

  • Md. Humayun Arafat (Ph.D.)

– Alex Brooks (Ph.D.) – Adrian Castello(Ph.D.) – Dazhao Cheng (Ph.D.) – James S. Dinan (Ph.D.) – Piotr Fidkowski (Ph.D.) – Priyanka Ghosh (Ph.D.) – SayanGhosh (Ph.D.) – Ralf Gunter (B.S.) – Jichi Guo (Ph.D.) – Yanfei Guo (Ph.D.) – Marius Horga (M.S.) – John Jenkins (Ph.D.) – Feng Ji (Ph.D.) – Ping Lai (Ph.D.) – Palden Lama (Ph.D.) – Yan Li (Ph.D.) – Huiwei Lu (Ph.D.) – Jintao Meng (Ph.D.) – Ganesh Narayanaswamy(M.S.) – Qingpeng Niu(Ph.D.) – Ziaul Haque Olive (Ph.D.) – David Ozog (Ph.D.) – Renbo Pang (Ph.D.) – Sreeram Potluri (Ph.D.) – Li Rao(M.S.) – Gopal Santhanaraman(Ph.D.) – Thomas Scogland (Ph.D.) – Min Si (Ph.D.) – Brian Skjerven (Ph.D.) – Rajesh Sudarsan(Ph.D.) – Lukasz Wesolowski (Ph.D.) – Shucai Xiao (Ph.D.) – Chaoran Yang (Ph.D.) – Boyu Zhang (Ph.D.) – Xiuxia Zhang (Ph.D.) – Xin Zhao (Ph.D.)

Advisory Board

– Pete Beckman (senior scientist) – Rusty Lusk (retired, STA) – Marc Snir (division director) – Rajeev Thakur (deputy director)

Current and Recent Students

Charm++ Workshop 2016

slide-30
SLIDE 30

Q&A

§ Thank you for your attention!

Questions?

Charm++ Workshop 2016