[PPT] - Argobots and its Application to Charm++ Sangmin Seo Assistant PowerPoint Presentation

SLIDE 1

Sangmin Seo

Assistant Computer Scientist Argonne National Laboratory sseo@anl.gov April 19, 2016

Argobots and its Application to Charm++

Charm++ Workshop 2016

SLIDE 2

Argo Concurrency Team

Argonne National Laboratory (ANL)

– Pavan Balaji (co-lead) – Sangmin Seo – Abdelhalim Amer – Marc Snir – Pete Beckman (PI)

University of Illinois at Urbana-Champaign (UIUC)

– Laxmikant Kale (co-lead) – Prateek Jindal – Jonathan Lifflander

University of Tennessee, Knoxville (UTK)

– George Bosilca – Thomas Herault – Damien Genet

Pacific Northwest National Laboratory (PNNL)

– Sriram Krishnamoorthy

Past Team Members:

Cyril Bordage (UIUC)
Esteban Meneses

(University of Pittsburgh)

Huiwei Lu (ANL)
Yanhua Sun (UIUC)

Charm++ Workshop 2016

2

SLIDE 3

Massive On-node Parallelism

The number of cores is increasing
Massive on-node parallelism is inevitable
Existing solutions do not effectively deal with such parallelism with

respect to on-node threading/tasking systems or with respect to

ff-node communication in the presence of such tasks/threads
How to exploit?

core Core-level Parallelism

3

Charm++ Workshop 2016

SLIDE 4

Shortcomings today? Pthreads (1/2)

Nesting

int in[1000][1000],

ut[1000][1000];

#pragma omp parallel for for (i = 0; i < 1000; i++) { petsc_voodoo(i); } petsc_voodoo(int x) { #pragma omp parallel for for (j = 0; j < 1000; j++)

ut[x][j]

= cosine(in[x][j]); }

Execution time for 36 threads in the outer loop

Why is traditional OpenMP’s performance so bad? The compiler cannot analyze petsc_voodoo to know whether the function might ever block or yield, so it has to assume that it might. Therefore a stack is needed to facilitate it. Creating additional Pthreads for each nesting is the simplest way to achieve this.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Time (s) # OMP Threads | Argobots ULTs/tasks (inner loop) GCC/pthreads GCC/Argobots ULTs GCC/Argobots tasks

Lower is better

Charm++ Workshop 2016

4

SLIDE 5

Shortcomings today? Pthreads (2/2)

Tasks of application mapped to a group of Pthreads

Need lightweight mechanisms to switch tasks!

computation communication

C Pthreads C C C map & schedule How about these communications? Wait or context switch? Work units intermixed with blocking calls (such as communication calls) can cause idle cores

Charm++ Workshop 2016

5

SLIDE 6

Outline

Background
Argobots
Charm++ with Argobots
Other Programming Models
Summary

Charm++ Workshop 2016

6

SLIDE 7

User-Level Threads (ULTs)

What is user-level thread (ULT)?

– Provides thread semantics in user space – Execution model: cooperative timesharing

More than one ULT can be mapped to a single kernel

thread

ULTs on the same OS thread do not execute concurrently
Where to use?

– To better overlap computation and communication/IO – To exploit fine-grained task parallelism

timeline

Context switch Context switch

ULT1 ULT2

Core Core Core Core Core Core Core Core

ULTs : Kernel threads : Charm++ Workshop 2016

7

SLIDE 8

Pthreads vs. ULTs

Average time for creating and joining one thread
pthread: 6.6us - 21.2us (avg. 34,953 cycles)
ULT (Argobots): 78ns - 130ns (avg. 191 cycles)
ULT is 64x - 233x faster than Pthread

– How fast is ULT?

L1$ access: 1.112ns, L2$ access: 5.648ns, memory access: 18.4ns
Context switch (2 processes): 1.64us

1 10 100 1000 10000 100000 1 2 4 8 16 32 64 128 256 512 1024 2048

Avg. Create&Join Time/thread

(ns) Number of Threads pthread ULT (Argobots) * measured using LMbench3

Charm++ Workshop 2016

8

SLIDE 9

Growing Interests in ULTs

ULT and task libraries

– Converse threads, Qthreads, MassiveThreads, Nanos++, Maestro, GnuPth, StackThreads/MP , Protothreads, Capriccio, StateThreads, TiNy-threads, etc.

OS supports

– Windows fibers, Solaris threads

Language and programming models

– Cilk, OpenMP task, C++11 task, C++17 coroutineproposal, Stackless Python, Go coroutines, etc.

Pros

– Easy to use with Pthreads-like interface

Cons

– Runtime tries to do something smart (e.g., work-stealing) – This may conflict with the characteristics and demands of applications

Charm++ Workshop 2016

9

SLIDE 10

Argobots

Overview

Separation of mechanisms and policies
Massive parallelism

– Exec. Streams guarantee progress – Work Units execute to completion

User-level threads (ULTs) vs. Tasklet
Clearly defined memory semantics

– Consistency domains

Provide Eventual Consistency

– Software can manage consistency

Argobots Innovations

Enabling technology, but not a policy maker

– High-level languages/libraries such as OpenMP, Charm++ have more information about the user application (data locality, dependencies)

Explicit model:

– Enables dynamism, but always managed by high-level systems

Argobots

core

Processor

Programming Models

(MPI, OpenMP, Charm++, PaRSEC, …) U

User-Level Thread

T

Tasklet Lightweight Work Units

Execution Stream

Private pool Private pool Shared pool

U U U T T T U T U

Execution Stream Execution Stream

A low-level lightweight threading and tasking framework

(http://collab.cels.anl.gov/display/argobots/)

* Team members: Sangmin Seo, Abdelhalim Amer, Pavan Balaji (ANL), Laxmikant Kale, Prateek Jindal (UIUC)

Charm++ Workshop 2016

10

SLIDE 11

Argobots Execution Model

Execution Streams (ES)

– Sequential instruction stream

Can consist of one or more work units

– Mapped efficiently to a hardware resource – Implicitly managed progress semantics

One blocked ES cannot block other ESs
User-level Threads (ULTs)

– Independent execution units in user space – Associated with an ES when running – Yieldable and migratable – Can make blocking calls

Tasklets

– Atomic units of work – Asynchronous completion via notifications – Not yieldable, migratable before execution – Cannot make blocking calls

S

Scheduler Pool

U

ULT

T

Tasklet

E

Event

ES1

Sched U U E E E E U S S T T T T T

Argobots Execution Model

...

ESn

Scheduler

– Stackable scheduler with pluggable strategies

Synchronization primitives

– Mutex, condition variable, barrier, future

Events

– Communication triggers

Charm++ Workshop 2016

11

SLIDE 12

Explicit Mapping ULT/Tasklet to ES

The user needs to map work units to ESs
No smart scheduling, no work-stealing unless the user wants

to use

ES1

U0 U1 T1 T2 U2 U3

ES2

U4 U5

Benefits

– Allow locality optimization

Execute work units on the same ES

– No expensive lock is needed between ULTs on the same ES

They do not run concurrently
A flag is enough

Charm++ Workshop 2016

12

SLIDE 13

Stackable Scheduler with Pluggable Strategies

Associated with an ES
Can handle ULTs and tasklets
Can handle schedulers

– Allows to stack schedulers hierarchically

Can handle asynchronous events
Users can write schedulers

– Provides mechanisms, not policies – Replace the default scheduler

E.g., FIFO, LIFO, Priority Queue, etc.
ULT can explicitly yield to another ULT

– Avoid scheduler overhead

Sched U U E E E E U S S T T T T T U S U U U

yield() yield_to(target)

Charm++ Workshop 2016

13

SLIDE 14

Performance: Create/Join Time

Ideal scalability

– If the ULT runtime is perfectly scalable, the time should be the same regardless of the number of ESs

10 100 1000 10000 1 2 4 8 16 24 32 36 40 48 56 64 72 Create/Join Time per ULT (cycles) Number of Execution Streams (Workers) Qthreads MassiveThreads (H) MassiveThreads (W) Argobots (ULT) Argobots (Tasklet)

Charm++ Workshop 2016

14

SLIDE 15

Jonathan Lifflander, Prateek Jindal, Yanhua Sun Laxmikant Kale University of Illinois at Urbana-Champaign (UIUC)

Charm++ with Argobots

15

Charm++ Workshop 2016

SLIDE 16

Charm++ with Argobots

Goals

– Test the completeness and performance of Argobots with Charm++ programming model – Take advantage of Argobots features (tasklets, stackable schedulers, etc.) without modifying application codes – For Charm++ applications, interoperate with applications written in other models (MPI, Cilk, etc.)

16

Mini-apps and real world applications Charm++ model Converse runtime (threading, messaging, scheduler) Communication libraries (MPI, uGNI, PAMI, Verbs) Intelligent runtime Argobots (ULTs, Tasks, scheduling, etc.) Charm++ infrastructure Charm++ with Argobots

* Team members: Laxmikant Kale, Jonathan Lifflander, PrateekJindal (UIUC)

Charm++ Workshop 2016

SLIDE 17

Replacing the Converse Runtime with Argobots

Converse

– The active messaging layer in Charm++

Approaches

– Each Charm++ Pthread inside a node (including the communication thread) is implemented as an Argobots ES

Create an ES for every Converse instance

– A custom Argobots scheduler is created instead of using the Converse scheduler – Converse messages are enqueued into Argobots pools as tasklets – Converse threads (CthThread) are implemented on top of Argobots ULTs, with conditional variables to implement suspend/resume

Only 180 lines of code had to be changed!

Charm++ Workshop 2016

17

Converse runtime (threading, messaging, scheduler) Argobots (ULTs, Tasks, scheduling, etc.)

SLIDE 18

LeanMD Performance: Runtime Comparison

Charm++ Workshop 2016

18

Evaluation machine

– 2 x Intel Xeon E5-2699 v3 (2.30GHz): 36 cores (72 threads)

LeanMD simulation

– A total of 20 steps on a cell array of dimensions 7x7x7 – 1-away XYZ configuration and 1000 atoms per cell

Achieved comparable performance although it is a very simple implementation

SLIDE 19

LeanMD Performance: Manual Implementation

Charm++ Workshop 2016

19

Manual implementation of LeanMD using the Argobots

– Exploited both ULTs and tasklets

A ULT for managing a cell and a tasklet for managing the interaction between cells

– Used futures for the waiting mechanism – Work stealing between pools

Better performance of our manual implementation implies that Charm++ with

Argobots could be improved

SLIDE 20

Argobots: Interfaces for Shrink/Expand Events

20

ES0

Sched

ES1

Sched

ES2

Sched

NRM

ESn-1

Sched

...

E E E

Argobots socket

Charm++ CilkBot PaRSEC MPI+Argobots

programming model runtimes and applications callback functions 1. [Argobots] Connect to NRM using a socket on ABT_init() 2. [Runtimes/applications] Register callback functions for shrink/expand events 3. [Runtimes/applications] Deregister callback functions when they terminate 4. [Argobots] Disconnect from NRM on ABT_finalize()

ABT_ENV_POWER_EVENT_HOSTNAME ABT_ENV_POWER_EVENT_PORT ABT_event_add_callback() ABT_event_del_callback() Charm++ Workshop 2016

SLIDE 21

Argobots: Shrink/Expand Event Handling

Shrinking

21

ES0

Sched

ES1

Sched

ES2

Sched E E E

prog. model runtime/application

1. ES1 picks an event, which requests ES2 to be stopped 2. Ask the runtime using callbacks whether ES2 can be stopped 3. If OK, mark ES2 to need to stop so when the scheduler on ES2 checks events, it can be stopped 4. Notify the runtime that ES2 will be stopped 5. Create a ULT on ES0 6. When the scheduler on ES2 stops, ES2 is terminated 7. After ES2 is terminated, the ULT frees ES2 and sends a response to NRM

Any scheduler on any ES

can check and handle events

ES0 cannot be stopped

U 1 2 3 4 5

NRM

7 6 Charm++ Workshop 2016

SLIDE 22

Argobots: Shrink/Expand Event Handling

Expanding

22

ES0

Sched

ES1

Sched

ES2

Sched E E E

prog. model runtime/application

1. ES0 picks an event, which requests to create an ES2 2. Ask the runtime using callbacks whether it can create ES2 3. If OK, invoke a callback function so the runtime creates ES2 4. Create ES2 5. Send a response to NRM

1 2 3 5

NRM

4 Charm++ Workshop 2016

SLIDE 23

Charm++ with Argobots: Implementation

Shrink/Expand Implementation

– Charm++ maintains a set of pools for each scheduler, mapped to an ES – Ranks in Charm++ are virtualized by saving the mapping of pool to an ES – When an ES is removed, the associated pools are put into a global list

To maintain correctness in Charm++, the rank of any tasks/threads

in the global list are derived from the pool (ranks are virtualized)

– Shrink

Other ESs execute work units from orphaned pools with some

added synchronization

– Expand

A new ES is created and takes over a set of orphaned pools

23

Charm++ Workshop 2016

SLIDE 24

LeanMD Results of Shrinking/Expanding ESs

Charm++ Workshop 2016

24

SLIDE 25

Argobots Ecosystem

ES1

Sched U U E E E E U S S T T T T T

Argobots

...

ESn

MPI+Argobots

ULT ES ULT ES

MPI

Argobots runtime Communication libraries Charm++ Applications

Charm++

Cilk “Worker” Argobots ES RWS ULT Fused ULT 1 Fused ULT 2 Fused ULT N …

CilkBots

PO GE TR SY TR TR PO GE GE TR TR SY SY GE PO TR SY PO SY SY

PaRSEC OpenMP Mercury RPC

Origin Target

RPC proc RPC proc

OmpSs GridFTP, Kokkos, RAJA, ROSE, TASCEL, XMP, etc. External Connections

Charm++ Workshop 2016

25

SLIDE 26

Summary

Massive on-node parallelism is inevitable

– Need runtime systems utilizing such parallelism

Argobots

– A lightweight low-level threading/tasking framework – Provides efficient mechanisms, not policies, to users (library developers or compilers)

They can build their own solutions
Charm++ with Argobots

– Implemented by replacing the Converse runtime with Argobots – Achieved comparable performance on LeanMD – Incorporated the shrinking/expanding of ESs in Argobots in

rder to respond to the external events (e.g., power capping)

Charm++ Workshop 2016

26

SLIDE 27

Try Argobots

git repository
http://git.mcs.anl.gov/argo/argobots.git/
Documentation

– Wiki

https://collab.cels.anl.gov/display/ARGOBOTS/

– Doxygen

http://www.mcs.anl.gov/~sseo/public/argobots/

Charm++ Workshop 2016

27

SLIDE 28

Funding Acknowledgments

Funding Grant Providers Infrastructure Providers

Charm++ Workshop 2016

SLIDE 29

Programming Models and Runtime Systems Group

Group Lead

– Pavan Balaji (computer scientist and group lead)

Current Staff Members

– Abdelhalim Amer (postdoc) – Yanfei Guo (postdoc) – Rob Latham (developer) – Lena Oden (postdoc) – Ken Raffenetti (developer) – SangminSeo(assistant computer scientist) – Min Si (postdoc) – Min Tian (visiting scholar)

Past Staff Members

– Antonio Pena (postdoc) – Wesley Bland (postdoc) – Darius T. Buntinas (developer) – James S. Dinan (postdoc) – David J. Goodell (developer) – Huiwei Lu (postdoc) – Yanjie Wei (visiting scholar) – Yuqing Xiong (visiting scholar) – Jian Yu (visiting scholar) – Junchao Zhang (postdoc) – Xiaomin Zhu (visiting scholar) – AshwinAji (Ph.D.) – Abdelhalim Amer (Ph.D.) –

Md. Humayun Arafat (Ph.D.)

– Alex Brooks (Ph.D.) – Adrian Castello(Ph.D.) – Dazhao Cheng (Ph.D.) – James S. Dinan (Ph.D.) – Piotr Fidkowski (Ph.D.) – Priyanka Ghosh (Ph.D.) – SayanGhosh (Ph.D.) – Ralf Gunter (B.S.) – Jichi Guo (Ph.D.) – Yanfei Guo (Ph.D.) – Marius Horga (M.S.) – John Jenkins (Ph.D.) – Feng Ji (Ph.D.) – Ping Lai (Ph.D.) – Palden Lama (Ph.D.) – Yan Li (Ph.D.) – Huiwei Lu (Ph.D.) – Jintao Meng (Ph.D.) – Ganesh Narayanaswamy(M.S.) – Qingpeng Niu(Ph.D.) – Ziaul Haque Olive (Ph.D.) – David Ozog (Ph.D.) – Renbo Pang (Ph.D.) – Sreeram Potluri (Ph.D.) – Li Rao(M.S.) – Gopal Santhanaraman(Ph.D.) – Thomas Scogland (Ph.D.) – Min Si (Ph.D.) – Brian Skjerven (Ph.D.) – Rajesh Sudarsan(Ph.D.) – Lukasz Wesolowski (Ph.D.) – Shucai Xiao (Ph.D.) – Chaoran Yang (Ph.D.) – Boyu Zhang (Ph.D.) – Xiuxia Zhang (Ph.D.) – Xin Zhao (Ph.D.)

Advisory Board

– Pete Beckman (senior scientist) – Rusty Lusk (retired, STA) – Marc Snir (division director) – Rajeev Thakur (deputy director)

Current and Recent Students

Charm++ Workshop 2016