The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia - PDF document

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia National Laboratories * and Richard C. Murphy , Sandia National Laboratories and Dylan Stark , Sandia National Laboratories and Bradford L. Chamberlain , Cray Inc. † ABSTRACT: This paper describes the applicability of the third-party qthread lightweight threading library for implementing the tasking layer for Chapel applications on conventional multisocket multicore computing platforms. A collection of Chapel benchmark codes were used to demonstrate the correctness of the qthread implementation and the performance gain provided by using an optimized threading/tasking layer. The experience of porting Chapel to use qthreads also provides insights into additional requirements imposed by a lightweight user-level threading library, some of which have already been integrated into Chapel, and others that are posed here as open issues for future work. The initial performance results indicate an immediate performance benefit from using qthreads over the native multithreading support in Chapel. Both task and data parallel applications benefit from lower overheads in thread management. Future work on improved synchronization semantics are likely to further increase the efficiency of the qthreads implementation. KEYWORDS: Chapel, lightweight, threading, tasks 1. Introduction guage being developed by Cray Inc. as part of DARPA’s High Productivity Computing System It is increasingly recognized that, in order to obtain program (HPCS). One of its main motivating power and performance scalability, future hardware themes includes support for general parallel pro- architectures will provide large amounts of paral- gramming—data parallelism, task parallelism, con- lelism. Taking full advantage of this parallelism re- current programming, and arbitrary nestings of these quires an ability to specify the parallelism at multi- styles. It also adopts a multiresolution language ple levels within a program. However, parallel pro- design in which higher-level features like arrays gramming is also widely recognized to be a diffi- and data parallel loops are implemented in terms cult problem, and the set of programmers who can of lower-level features like classes and task paral- effectively leverage parallelism is a small fraction lelism. To this end, having a good implementation of those who are effective sequential programmers. of Chapel’s task parallel concepts is crucial since all Addressing the expressibility and programmability parallelism is built in terms of it. challenges are problems of wide interest. Task parallelism, in this case, refers not to the Chapel is a new parallel programming lan- task/data parallelism distinction, but to the idea of a user-level threading concept, wherein tasks that can * Sandia is a multiprogram laboratory operated by Sandia Cor- be executed in parallel are relatively short-lived and poration, a Lockheed Martin Company, for the United States De- are created and destroyed rapidly. To maximize per- partment of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. formance, applications must not only find parallel † This material is based upon work supported by the Defense work, but must also match the amount of parallel Advanced Research Projects Agency under its Agreement No. work expressed to the available hardware. This lat- HR0011-07-9-0001. 1

ter task, however, is one best carried out by a run- per-thread signal vectors, or preemptive multitask- time rather than the application itself. ing. The thread scheduler in Qthreads presumes a Qthreads is a new lightweight threading, or task- cooperative-multitasking approach, which provides ing, library being developed by Sandia National Lab- the flexibility to run threads in locations most con- oratories. The Qthreads runtime is designed to sup- venient to the scheduler and the code. There are port dynamic programming and performance features two scheduling regimes within qthreads: the single- not typically seen in either OpenMP or MPI systems. threaded location mode, which does not use work- Parallel work is specified and the Qthreads runtime stealing, and the multi-threaded hierarchical loca- maps the work onto available hardware resources. tion mode, which uses a shared work-queue between By comparing Qthreads dynamic mapping of tasks multiple workers in a single location and work-steal- to hardware against the default “FIFO” scheduling ing between locations. mechanism of the Chapel runtime, an accurate pic- Blocking synchronization, such as when perform- ture of the benefits of the Qthread model can be ob- ing a FEB operation, triggers a user-space context tained. In task parallelism situations, where cobegin switch. This context switch is done via function is used, Qthreads can outperform the FIFO tasking calls without trapping into the kernel, and therefore layer by as much as 45%. In data parallelism situa- does not require saving as much state as preemp- tions, where forall and coforall are used, Qthreads tive context switches—such as signal masks and the can outperform the FIFO tasking layer by as much full set of registers. This technique allows threads as 30%. Further work is planned to improve syn- to process largely uninterrupted until data is needed chronization performance and eliminate additional that is not yet available, and allows the scheduler to bottlenecks. attempt to hide communication latency by switch- Qthreads is described in more detail in Section 2. ing tasks when data is needed. Logically, this only It is followed by a discussion of the Chapel tasking hides communication latencies that take longer than layer in Section 3. A discussion of the difficulties a context switch. in mapping the Chapel tasking layer to the Qthreads API on single-node systems is in Section 4 and on 3. Chapel Tasking Layer multi-node systems is in Section 5. The results of our performance experiments are in Section 7. Like many implementations of higher-level languages, the Chapel [2] compiler is implemented by compil- ing Chapel source code down to standard C. This 2. Qthreads permits the Chapel compiler to focus on high-level Qthreads [4] is a cross-platform general purpose par- transformations and optimizations while leaving plat- allel runtime designed to support lightweight thread- form-specific targeting and optimizations to the na- ing and synchronization within a flexible integrated tive C compiler on each platform. Most of the lower- locality framework. Qthreads directly supports pro- level code required to execute Chapel is implemented gramming with lightweight threads and a variety of using Chapel’s runtime libraries which are also im- synchronization methods, including both non-block- plemented in C and then linked to the generated code. ing atomic operations and potentially blocking full/ The Chapel runtime libraries are organized as a empty bit (FEB) operations. number of sub-interfaces , each of which implements The Qthreads lightweight threading concept is a specific subset of functionality such as commu- intended to match future hardware threading en- nication, task management, memory management, vironments more closely than existing concepts in or timing routines. Each sub-interface is designed three crucial aspects: anonymity, introspectable such that several distinct implementations can be limited resources, and inherent localization. Un- supplied as long as each supports the interface’s se- like heavyweight threads, these threads do not sup- mantics. An end-user can select from among the im- port expensive features like per-thread identifiers, plementation options via an environment variable. 2

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia - PDF document

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia National Laboratories * and Richard C. Murphy , Sandia National Laboratories and Dylan Stark , Sandia National Laboratories and Bradford L. Chamberlain , Cray Inc. ABSTRACT: This

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and

CO2101 Processes and Multi-tasking Tom Ridge (tr61) 7th October 2019 tr61 Multi-tasking

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

Managing Command and Control Information Using a C2IEDM Based Tasking Grammar Dr. Michael Hieb

A multi-tasking wordset for Standard Forth Andrew Haley Consulting Engineer 8 September 2017

Review First, operating systems solves time-sharing multi-tasking context = memory address

Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science

IPv6 over Low power WPAN WG (6lowpan) Chairs: Geoff Mulligan <geoff@mulligan.com> Carsten

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

Last Time u Cost of nearly full resources u RAM is limited Think carefully about

Support ing Time-Sensit ive I nt roduct ion Mult imedia applicat ions t ime-sensit ive

Disclosures Complications of Wear or Corrosion of Chrome-Cobalt Index Case of Arthroprosthetic

PgBench Work in Progress Fabien Coelho MINES ParisTech, PSL Research University PostgreSQL

TinyOS Tutorial CSE521S, Spring 2017 Dolvara Gunatilaka Based on tutorial by Mo Sha, Rahav Dor

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia - PDF document

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia National Laboratories * and Richard C. Murphy , Sandia National Laboratories and Dylan Stark , Sandia National Laboratories and Bradford L. Chamberlain , Cray Inc. ABSTRACT: This

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and

CO2101 Processes and Multi-tasking Tom Ridge (tr61) 7th October 2019 tr61 Multi-tasking

Lambeth Lambeth Partnership Tasking Partnership Tasking &amp; &amp; Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking &amp; &amp; Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking &amp; &amp; Co- -ordination

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

Managing Command and Control Information Using a C2IEDM Based Tasking Grammar Dr. Michael Hieb

A multi-tasking wordset for Standard Forth Andrew Haley Consulting Engineer 8 September 2017

Review First, operating systems solves time-sharing multi-tasking context = memory address

Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science

IPv6 over Low power WPAN WG (6lowpan) Chairs: Geoff Mulligan &lt;geoff@mulligan.com&gt; Carsten

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

Last Time u Cost of nearly full resources u RAM is limited Think carefully about

Support ing Time-Sensit ive I nt roduct ion Mult imedia applicat ions t ime-sensit ive

Disclosures Complications of Wear or Corrosion of Chrome-Cobalt Index Case of Arthroprosthetic

PgBench Work in Progress Fabien Coelho MINES ParisTech, PSL Research University PostgreSQL

TinyOS Tutorial CSE521S, Spring 2017 Dolvara Gunatilaka Based on tutorial by Mo Sha, Rahav Dor

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

Lambeth Lambeth Partnership Tasking Partnership Tasking & & Co- -ordination

IPv6 over Low power WPAN WG (6lowpan) Chairs: Geoff Mulligan <geoff@mulligan.com> Carsten