The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. - - PowerPoint PPT Presentation

the chapel tasking layer over qthreads
SMART_READER_LITE
LIVE PREVIEW

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. - - PowerPoint PPT Presentation

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and Bradford L. Chamberlain Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department


slide-1
SLIDE 1

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

The Chapel Tasking Layer Over Qthreads

Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and Bradford L. Chamberlain

Wednesday, May 18, 2011

slide-2
SLIDE 2

The Structure of Chapel’s Runtime

Chapel Runtime Support Libraries (written in C)

Tasks Communication Memory Timers Launchers Standard Threads

Wednesday, May 18, 2011

slide-3
SLIDE 3

Chapel’s Tasking Layer

  • Role: Responsible for parallelism/synchronization
  • Main Focus:

– support begin/cobegin/coforall statements – support synchronization variables

  • Main Features:

– Startup/Teardown – Singleton Tasks – Task Lists – Synchronization – Control – Queries – ...serialization?

Wednesday, May 18, 2011

slide-4
SLIDE 4

The FIFO Tasking Implementation

  • Work-queue model

–Function calls for work execution –Centralized queue

  • Cons:

–Task synchronization (sync) using thread synchronization (pthread_mutex_t)

  • Compute/synch overlap requires
  • versubscribing (#threads > #cpus)
  • Difficult to provide non-native (non-mutex)

synchronization behavior

–#Task-to-#thread mismatch creates unexpected deadlock potential –Does not support work stealing –Does not support CPU pinning

  • Pros:

–Simple, easy to debug –Very portable –Uses native state management

  • stacks
  • thread/task-specific data

Wednesday, May 18, 2011

slide-5
SLIDE 5

Challenges in Highly-Threaded Runtimes

  • Per-thread state

–State vs threads

  • Locality

–An afterthought in standard threading models –Communication and synchronization are expensive, easy to use accidentally

  • Synchronization

–Hard to make portable, maintain guarantees

  • Every Machine is Different

–Granularity of sharing (cacheline size) –Optimal number of threads (PU count) –Communication topology –Cache structure –Memory model –Synchronization Primitives (CMPXCHG vs TNS vs CASXA vs LDARX/STWCX)

Wednesday, May 18, 2011

slide-6
SLIDE 6

Qthreads Highlights

  • Lightweight User-level Threading (Tasking)
  • Platform portability

–IA32/64, AMD64, PPC32/64, SparcV9, SST, Tilera –Linux, BSD, Solaris, MacOSX

  • Locality awareness

–“Shepherd” as thread mobility domain & locality

  • Fine-grained synchronization semantics

–Full/Empty Bits (64-bit & 60-bit) –Mutexes –Atomic operations (Incr & CAS)

  • Locality-aware Workstealing Model

Wednesday, May 18, 2011

slide-7
SLIDE 7

Chapel Single Locale Challenges

  • Startup & Teardown

–Functions with unspecified scope –Synchronization primitives of unspecified scope

  • Unsupported Behavior

–Limit on OS Threads

  • Default defined by hardware

–Forced serialization of tasks –Task-local data

Wednesday, May 18, 2011

slide-8
SLIDE 8

Chapel Multi-Locale Challenges

  • Communication (via GASNet)

–Blocking system calls

  • Dedicated OS thread
  • Possibility for proxying internally
  • Temporary solution: Forked initialization thread
  • Future solution: explicit progress thread creation

–External Task Operations

  • Task creation from outside the task library

–Memory management issue –Also: synchronization issue…

  • Task synchronization outside the task library

–Proxy-task using thread-level synchronization (pthread_mutex_t)

Wednesday, May 18, 2011

slide-9
SLIDE 9

Future Work

  • Synchronization

–Tasking interface assumes only mutex semantics –MTA/Qthreads interface provide fast FEB semantics –Implementing FEB semantics with a mutex implemented with FEB

  • perations is silly and slow
  • Stack Space

–Problem common to all tasking interfaces –Currently requires guess-and-check –Potential directions:

  • Technically possible to calculate stack requirements (e.g. gcc 4.6)
  • Technically possible to move stack variables to heap

–Moves the memory management problem

Wednesday, May 18, 2011

slide-10
SLIDE 10

Performance: Raw Tasking

  • QuickSort

–Naïve implementation (serial partitioning) –Uses recursive cobegin –Serialization threshold

  • For best comparison, set high

to avoid serialization

0.001 0.01 0.1 1 10 100 14 16 18 20 22 24 26 28 Execution Time (secs) Array Elements (power of 2)

Qthreads FIFO

0.5 1 1.5 2 2.5 3 14 16 18 20 22 24 26 28 Ratio FIFO/Qt

Wednesday, May 18, 2011

slide-11
SLIDE 11

Performance: Raw Tasking

  • Tree Exploration

–Constructs binary tree –Assigns Unique ID –Computes sum of IDs –Uses recursive cobegin

0.001 0.01 0.1 1 10 100 1000 12 14 16 18 20 22 24 26 28 Execution Time (secs) Tree Elements (power of 2)

Qthreads FIFO

0.5 1 1.5 2 2.5 12 14 16 18 20 22 24 26 28 Ratio FIFO/Qt

Wednesday, May 18, 2011

slide-12
SLIDE 12

Performance: Data Parallel

  • HPCC RandomAccess

–GUPS (random integer updates) –Stresses Memory System –Uses forall

1 10 100 1000 1 2 4 8 16 32 64 128 Execution Time (secs) Number of Tasks

Qthreads FIFO

0.25 0.5 0.75 1 1.25 1.5 1 2 4 8 16 32 64 128 Ratio FIFO/Qt

Wednesday, May 18, 2011

slide-13
SLIDE 13

Performance: Data Parallel

  • HPCC STREAM (-EP)

–Memory Bandwidth & Vector Kernels –EP version avoids communication –Uses forall –Synchronization surprisingly important

0.2 0.4 0.6 0.8 1 1 2 4 8 16 32 64 128 Execution Time (secs) Number of Tasks

Qthreads FIFO Qthreads EP FIFO EP

0.5 1 1.5 2 1 2 4 8 16 32 64 128 Ratio FIFO/Qt

STREAM STREAM-EP

Wednesday, May 18, 2011

slide-14
SLIDE 14

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Thank You!

Questions?

Wednesday, May 18, 2011