A tale of two schedulers Noah Evans, Richard Barrett, Stephen - PowerPoint PPT Presentation

A tale of two schedulers Noah Evans, Richard Barrett, Stephen Olivier, George Stelle nevans@sandia.gov 6/26/17 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

Outline ▪ Making Parallel Programming Easier ▪ Qthreads Chapel Support ▪ The two Qthreads schedulers (plus the old one) ▪ Sherwood ▪ Nemesis ▪ Distrib ▪ Performance evaluation ▪ Future work ▪ Conclusions 2

Making Parallel programming easier ▪ Typical Parallel Programming: MPI and BSP ▪ Downside: fiddly, lots of application programming effort ▪ Another Strategy: Push complexity of parallel programs into the runtime ▪ Programmer specifies data dependencies and smallest units of work. ▪ This is the approach taken by the HPCS language Chapel 3

Solution: Multiresolution ▪ Ability to change underlying aspects of language ▪ Write one program, compile in different ways based on environment variables ▪ Choose abstraction at compile time rather than in the code. ▪ Goal: enable performance portability, reduce programmer effort 4

Chapel structure Chapel Runtime Support Libraries (written in C) Tasks Communication Launchers Standard Memory Timers Threads 5

Qthreads Chapel Support ▪ Qthreads ▪ user level tasking model ▪ low level, anonymous threads, no signal handling cooperative. ▪ lighter than pthreads ▪ Distinguishing feature Full Empty Bits (FEBs) ▪ models the Cray XMT FEB, primitives can be in hardware or software ▪ Default for Chapel ▪ Qthreads tasking model is also multiresolution, can choose schedulers at configure time 6

Objective: Qthreads scheduler for many-core chapel ▪ Our old default scheduler built for NUMA multicore machines using mutexes. Our mutex based schedulers don’t scale for many-core. ▪ We’ve been working on schedulers to use lock-free methods and different scheduling strategies for many- core. ▪ Evaluating two schedulers, Nemesis and a new scheduler distrib. Nemesis good for simple streaming tasks. Distrib is good for irregular jobs using work stealing. 7

Sherwood ▪ Original work stealing scheduler for Qthreads ▪ Idea was queue to optimize for NUMA multicore ▪ Front: LIFO scheduling for cache locality ▪ Back: Bulk transfer of stolen jobs between NUMA domains ▪ Design: Mutex lock at both ends of double ended queue ▪ However, Looking at both ends of queue prevents lock free approaches ▪ So good for older multicore, poor performance on manycore. 8

Nemesis ▪ Alternative to Sherwood ▪ Took an idea from MPICH2, the “Nemesis” lock free queue [Buntinas et al, 2006] ▪ Scheduling is simple FIFO, no load balancing ▪ Optimized for performance of streaming jobs ▪ No concept of work stealing or load balancing ▪ spin based backoff mechanism 9

Newest Distrib ▪ Take advantage of lessons learned from Nemesis, but take advantage of work stealing ▪ Minimize cache contention by spreading queues across cache lines ▪ LIFO scheduling ▪ At the same time lightweight work stealing, steal one at a time using a predefined “steal ratio” of how many times to check the local queue, before attempting to steal from other queues ▪ If nothing to steal backoff after a certain number of iterations 10

Summary Table 1: Qthreads schedulers Scheduler Queue Workstealing Performance Sherwood One per NUMA domain Yes Good for multicore with OR one per worker thread big caches Nemesis Only one per worker No Good for streaming and thread contended workloads Distrib Only one per worker Yes Good for contended work- thread loads with imbalance 11

Performance Evaluation ▪ Want to see how much overhead using LIFO scheduling and our minimal work stealing contributes vs Nemesis ▪ Also use Sherwood as a baseline ▪ Questions to answer: ▪ What is the overhead of work stealing? ▪ How much does backoff matter? ▪ When should we use Nemesis and when should we use Distrib? 12

Experimental Design ▪ Knights Landing Processor 7250 ▪ 68 cores, 272 hardware threads, 1.6 GHz. ▪ 16GB of high bandwidth memory (MC-DRAM) on package ▪ operate in cache mode. ▪ Chapel 1.14, GCC version 4.8.3 using -O3 and - march=native ▪ Performance comparisons using Linux’s perftools suite for full system profiling 13

Benchmark overview ▪ Quicksort: simple distributed quick sort ▪ HPCS Scalable Synthetic Compact Applications graph analysis (SSCA#2) ▪ Stream: memory streaming benchmark ▪ Tree: constructs and sums a binary tree in parallel ▪ Graph500: two benchmarks, breadth first search (BFS) and shortest path, chapel only does BFS 14

Quicksort: distrib load balancing better (lower is better) 15

Same amount of actual work done, just better distributed 16

Distrib better for SSCA2 (lower is better) 17

Nemesis FIFO better for Stream (higher is better) 18

Tree: Distrib better at scale 19

Distrib better for graph500 (lower is better) 20

Experimental conclusions ▪ Distrib is better for most cases at scale ▪ Lightweight workstealing major reason ▪ Overhead makes it slower for small problems ▪ Nemesis is still better for streaming jobs with simple workflows 21

Future work ▪ All application progress threads in Qthreads ▪ (eg. MPI and Openfabrics asynchronous network threads) ▪ Right now nemesis and distrib have a backoff to make time for progress threads ▪ If all components of app use runtime, no need to backoff ▪ Is it possible to make distrib perform better than Nemesis in all cases? ▪ Make work stealing zero cost (turn off w/ no overhead) ▪ Switch LIFO/FIFO ▪ Dynamic schedulers? 22

Conclusions ▪ For most HPC use cases distrib is better ▪ For heavy streaming nemesis is more performant ▪ Can choose best tool for best job, fitting into Chapel’s multi resolution approach ▪ Helps solving a wide variety of HPC problems 23

Thank You 24

A tale of two schedulers Noah Evans, Richard Barrett, Stephen - PowerPoint PPT Presentation

A tale of two schedulers Noah Evans, Richard Barrett, Stephen Olivier, George Stelle nevans@sandia.gov 6/26/17 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of

Brett Ayoob, PSP Best Practices for CPM Schedulers // Introduction The Corporate Teams Plan

[537] Schedulers Tyler Harter 9/10/14 Overview Review processes Workloads, schedulers, and

NEC Forum Tale of Two Contracts Ir. Ir. PAUL LEE PAUL LEE, , Kai Kai-hung hung Assistant

A Tale of Two Indices: Positive vs. Normative Indexation in the Emerging Markets April 2020 A

City of Forest Park A Tale of Two TIFs A Tale of Two TIFs It was the best of times, it was

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Three Talks: 1. How does the solar wind blow? 2 A Tale of Two Space Plasma Physics 2. A Tale of

Integrated Schedulers for a Predictable Interrupt Management on Real-Time Kernels A. Crespo S.

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

Learning Automatic Schedulers through Projective Reparameterization Ajay Jain Saman Amarasinghe

Michael Watchorn: Campus Manager Monash University Berwick and Peninsula Campuses A tale of two

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

The 42 Year-Old Tennis Player Heard a Crack: What I Do, Tension Side or Compression Side, and

The tale fundamental group, tale homotopy and anabelian geometry Axel Sarlin |

The Environmentalists Tale Damnation, Purgatory and Armageddon The Environmentalists Tale 2

Sources 1/2 Model 634 Deuterium source (side on, MgF2 window, 30 W) for vacuum ultraviolet

12th International Satisfiability Modulo Theories Competition SMT-COMP 2017 Matthias Heizmann

Static Analysis Basics II Trent Jaeger Systems and Internet Infrastructure Security (SIIS) Lab

many others LLVM Developers Meeting, San Jose, October 2017 spcl.inf.ethz.ch @spcl_eth

Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database Me

CS145: INTRODUCTION TO DATA MINING Course Project Overview Instructor: Yizhou Sun

Environmental Jus0ce and the SDWA Agenda Who is dispropor/onately affected? Why are they

Networks - Fall 2005 Chapter 2 Play on networks 1: Strategic substitutes Bramoull e and