Scheduling Task Parallelism on Multi-Socket Multicore Systems - PowerPoint PPT Presentation

Scheduling Task Parallelism � on Multi-Socket Multicore Systems � Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill The University of North Carolina at Chapel Hill

Outline � Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill

Task Parallel Programming in a Nutshell • A task consists of executable code and associated data context, with some bookkeeping metadata for scheduling and synchronization. • Tasks are significantly more lightweight than threads. • Dynamically generated and terminated at run time • Scheduled onto threads for execution • Used in Cilk, TBB, X10, Chapel, and other languages • Our work is on the recent tasking constructs in OpenMP 3.0. The University of North Carolina at Chapel Hill 4

Simple Task Parallel OpenMP Program: Fibonacci int fib(int n) � { � fib(10) int x, y; � if (n < 2) return n; � fib(9) #pragma omp task � fib(8) x = fib(n - 1); � #pragma omp task � y = fib(n - 2); � #pragma omp taskwait � fib(8) fib(7) return x + y; � } � The University of North Carolina at Chapel Hill 5

Useful Applications • Recursive algorithms cilksort cilksort cilksort cilksort cilksort • E.g. Mergesort cilkmerge cilkmerge • List and tree traversal cilkmerge cilkmerge cilkmerge cilkmerge • Irregular computations cilkmerge • E.g., Adaptive Fast Multipole cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge • Parallelization of while loops • Situations where programmers might otherwise write a difficult-to-debug low-level task pool implementation in pthreads The University of North Carolina at Chapel Hill 6

Goals for Task Parallelism Support • Programmability • Expressiveness for applications • Ease of use • Performance & Scalability • Lack thereof is a serious barrier to adoption • Must improve software run time systems The University of North Carolina at Chapel Hill 7

Issues in Task Scheduling • Load Imbalance • Uneven distribution of tasks among threads • Overhead costs • Time spent creating, scheduling, synchronizing, and load balancing tasks, rather than doing the actual computational work • Locality • Task execution time depends on the time required to access data used by the task The University of North Carolina at Chapel Hill 8

The Current Hardware Environment • Shared Memory is not a free lunch. • Data can be accessed without explicitly programmed messages as in MPI, but not always at equal cost. • However, OpenMP has traditionally been agnostic toward affinity of data and computation. • Most vendors have (often non-portable) extensions for thread layout and binding. • First-touch traditionally used to distribute data across memories on many systems. The University of North Carolina at Chapel Hill 9

Example UMA System N N cores cores $ $ Mem • Incarnations include Intel server configurations prior to Nehalem and the Sun Niagara systems • Shared bus to memory The University of North Carolina at Chapel Hill 10

Example Target NUMA System N N $ $ Mem Mem cores cores N N $ $ Mem Mem cores cores • Incarnations include Intel Nehalem/Westmere processors using QPI and AMD Opterons using HyperTransport. • Remote memory accesses are typically higher latency than local accesses, and contention may exacerbate this. The University of North Carolina at Chapel Hill 11

Work Stealing • Studied and implemented in Cilk by Blumofe et al. at MIT • Now used in many task-parallel run time implementations • Allows dynamic load balancing with low critical path overheads since idle threads steal work from busy threads • Tasks are enqueued and dequeued LIFO and stolen FIFO for exploitation of local caches • Challenges • Not well suited to shared caches now common in multicore chips • Expensive off-chip steals in NUMA systems The University of North Carolina at Chapel Hill 13

PDFS (Parallel Depth-First Schedule) • Studied by Blelloch et al. at CMU • Basic idea: Schedule tasks in an order close to serial order • If sequential execution has good cache locality, PDFS should as well. • Implemented most easily as a shared LIFO queue • Shown to make good use of shared caches • Challenges • Contention for the shared queue • Long queue access times across chips in NUMA systems The University of North Carolina at Chapel Hill 14

Our Hierarchical Scheduler • Basic idea: Combine benefits of work stealing and PDFS for multi-socket multicore NUMA systems • Intra-chip shared LIFO queue to exploit shared L3 cache and provide natural load balancing among local cores • FIFO work stealing between chips for further low overhead load balancing while maintaining L3 cache locality • Only one thief thread per chip performs work stealing when the on-chip queue is empty • Thief thread steals enough tasks, if available, for all cores sharing the on-chip queue The University of North Carolina at Chapel Hill 15

Implementation • We implemented our scheduler, as well as other schedulers (e.g., work stealing, centralized queue), in extensions to Sandia’s Qthreads multithreading library. • We use the ROSE source-to-source compiler to accept OpenMP programs and generate transformed code with XOMP outlined functions for OpenMP directives and run time calls. • Our Qthreads extensions implement the XOMP functions. • ROSE-transformed application programs are compiled and executed with the Qthreads library. The University of North Carolina at Chapel Hill 16

Evaluation Setup • Hardware: Shared memory NUMA system • Four 8-core Intel x7550 chips fully connected by QPI • Compiler and Run time systems: ICC, GCC, Qthreads • Five Qthreads implementations • Q: Per-core FIFO queues with round robin task placement • L: Per-core LIFO queues with round robin task placement • CQ: Centralized queue • WS: Per-core LIFO queues with FIFO work stealing • MTS: Per-chip LIFO queues with FIFO work stealing The University of North Carolina at Chapel Hill 18

Evaluation Programs • From the Barcelona OpenMP Tasks Suite (BOTS) • Described in ICPP ‘09 paper by Duran et al. • Available for download online • Several of the programs have cut-off thresholds • No further tasks created beyond a certain depth in the computation tree The University of North Carolina at Chapel Hill 19

Health Simulation Performance The University of North Carolina at Chapel Hill 20

Health Simulation Performance Stock Qthreads scheduler (per-core FIFO queues) The University of North Carolina at Chapel Hill 21

Health Simulation Performance Per-core LIFO queues The University of North Carolina at Chapel Hill 22

Health Simulation Performance Per-core LIFO queues with FIFO work stealing The University of North Carolina at Chapel Hill 23

Health Simulation Performance Per-chip LIFO queues with FIFO work stealing The University of North Carolina at Chapel Hill 24

Health Simulation Performance The University of North Carolina at Chapel Hill 25

Sort Benchmark The University of North Carolina at Chapel Hill 26

NQueens Problem The University of North Carolina at Chapel Hill 27

Fibonacci The University of North Carolina at Chapel Hill 28

Strassen Multiply The University of North Carolina at Chapel Hill 29

Protein Alignment For loop startup Single task startup The University of North Carolina at Chapel Hill 30

Sparse LU Decomposition For loop startup Single task startup The University of North Carolina at Chapel Hill 31

Per-Core Work Stealing vs. Hierarchical Scheduling • Per-core work stealing exhibits lower variability in performance on most benchmarks • Both per-core work stealing and hierarchical scheduling Qthreads implementations had smaller standard deviations than ICC on almost all benchmarks Standard deviation as a percent of the fastest time The University of North Carolina at Chapel Hill 32

Per-Core Work Stealing vs. Hierarchical Scheduling • Hierarchical scheduling benefits • Significantly fewer remote steals observed on almost all programs The University of North Carolina at Chapel Hill 33

Per-Core Work Stealing vs. Hierarchical Scheduling • Hierarchical scheduling benefits • Lower L3 misses, QPI traffic, and fewer memory accesses as measured by HW performance counters on health , sort Health Sort The University of North Carolina at Chapel Hill 34

Stealing Multiple Tasks The University of North Carolina at Chapel Hill 35

Scheduling Task Parallelism on Multi-Socket Multicore Systems - PowerPoint PPT Presentation

Scheduling Task Parallelism on Multi-Socket Multicore Systems Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill The University of North Carolina at Chapel Hill Outline

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Loranger Socket Families Socket Families Loranger Over 3,000 Socket Designs - - 0.25mm and

What is a socket? Socket Programming Socket: An interface between an application process and

Outline ! Socket basics ! TCP sockets ! Socket details ! Socket options Operating Systems ! Final

Socket Service Types The following socket types are defined: 1. SOCK_STREAM : stream socket 2.

Outline ! Socket basics ! TCP sockets ! Socket details ! Socket options Computer Networks ! Final

Socket Programming CS457, FA 11 What is a socket? Basically, a socket is just a file

Scalable Socket I/O PG Consultants Peter Gordon peter@pg-consultants.com Objective

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Metrics and Task Scheduling Policies for Energy Saving in Multicore Computers J. Mair, K. Leung,

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Socket programming Goal: learn how to build client/server application that communicate using

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Le Lecture ture 7 Sea earch ch Wrap ap Up, p, In Intr tro o to to Con onstr trai

Inf2D 03: Search Strategies Valerio Restocchi School of Informatics, University of Edinburgh

Solving problems by searching Chapter 3 Some slide credits to Hwee Tou Ng (Singapore) Outline

Analyzing Search Generic search algorithm add start to frontier while frontier not empty get

Introduction to Artificial Intelligence Problem Solving through Search (Goal-based agents) BFS,

Uninformed search algorithms Chapter 3, Section 4 of; based on AIMA Slides c Artificial

Outline Problem solving agents and search Solving Problems by Examples Properties

Reducing Instruction Fetch Cost by Packing Instructions into Register Windows Stephen Hines, Gary