CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies - PowerPoint PPT Presentation

CS 5220: Load Balancing David Bindel 2017-11-09 1

Inefficiencies in parallel code Poor single processor performance • Typically in the memory system • Saw this in matrix multiply assignment 2

Inefficiencies in parallel code Overhead for parallelism • Thread creation, synchronization, communication • Saw this in moshpit and shallow water assignments 3

Inefficiencies in parallel code Load imbalance • Different amounts of work across processors • Different speeds / available resources • Insufficient parallel work • All this can change over phases 4

Where does the time go? • Load balance looks like large sync cost • ... maybe so does ordinary synchronization overhead! • And spin-locks may make sync look like useful work • And ordinary time sharing can confuse things more • Can get some help from profiling tools 5

Many independent tasks • Simplest strategy: partition by task index • What if task costs are inhomogeneous? • Worse: what if expensive tasks all land on one thread? • Potential fixes • Many small tasks, randomly assigned to processors • Dynamic task assignment • Issue: what about scheduling overhead? 6

Variations on a theme How to avoid overhead? Chunks! (Think OpenMP loops) • Small chunks: good balance, large overhead • Large chunks: poor balance, low overhead Variants: • Fixed chunk size (requires good cost estimates) • Tapering (size chunks based on variance) • Weighted factoring (GSS with heterogeneity) 7 • Guided self-scheduling (take ⌈ ( tasks left ) / p ⌉ work)

Static dependency and graph partitioning • Goal: even partition with small edge cut (comm volume) • Optimal partitioning is NP complete – use heuristics • Tradeoff quality vs speed • Good software exists (e.g. METIS) 8 • Graph G = ( V , E ) with vertex and edge weights

The limits of graph partitioning What if • We don’t know task costs? • We don’t know the communication/dependency pattern? • These things change over time? May want dynamic load balancing? Even in regular case: not every problem looks like an undirected graph! 9

Dependency graphs So far: Graphs for dependencies between unknowns . For dependency between tasks or computations: • Arrow from A to B means that B depends on A • Result is a directed acyclic graph (DAG) 10

Example: Longest Common Substring Goal: Longest sequence of (not necessarily contiguous) characters common to strings S and T . Recursive formulation: 11  max ( LCS [ i − 1 , j ] , LCS [ j , i − 1 ]) , S [ i ] ̸ = T [ j ]  LCS [ i , j ] = 1 + LCS [ i − 1 , j − 1 ] , S [ i ] = T [ j ]  Dynamic programming: Form a table of LCS [ i , j ]

Dependency graphs Can process in any order consistent with dependencies. Limits to available parallel work early on or late! 12

Dependency graphs Partition into coarser-grain tasks for locality? 13

Dependency graphs Dependence between coarse tasks limits parallelism. 14

Alternate perspective Recall LCS Two approaches to LCS: • Solve subproblems from bottom up • Solve from top down and memoize common subproblems Parallel question: shared memoization (and synchronize) or independent memoization (and redundant computation)? 15  max ( LCS [ i − 1 , j ] , LCS [ j , i − 1 ]) , S [ i ] ̸ = T [ j ]  LCS [ i , j ] = 1 + LCS [ i − 1 , j − 1 ] , S [ i ] = T [ j ] 

Load balancing and task-based parallelism 0 1 1 2 2 2 2 3 3 • Task DAG captures data dependencies • May be known at outset or dynamically generated • Topological sort reveals parallelism opportunities 16

Basic parameters • Task costs • Do all tasks have equal costs? • Costs known statically, at creation, at completion? • Task dependencies • Can tasks be run in any order? • If not, when are dependencies known? • Locality • Should tasks be co-located to reduce communication? • When is this information known? 17

Task costs Easy: equal unit cost tasks (branch-free loops) Harder: different, known times (sparse MVM) ? ? ? ? ? ? ? ? Hardest: costs unknown until completed (search) 18

Dependencies Easy: dependency-free loop (Jacobi sweep) Harder: tasks have predictable structure (some DAG) ? ? ? ? ? Hardest: structure is dynamic (search, sparse LU) 19

Locality/communication When do you communicate? • Easy: Only at start/end (embarrassingly parallel) • Harder: In a predictable pattern (elliptic PDE solver) • Hardest: Unpredictable (discrete event simulation) 20

A spectrum of solutions How much we can do depends on cost, dependency, locality • Static scheduling • Everything known in advance • Can schedule offline (e.g. graph partitioning) • Example: Shallow water solver • Semi-static scheduling • Everything known at start of step (for example) • Can use offline ideas (e.g. Kernighan-Lin refinement) • Example: Particle-based methods • Dynamic scheduling • Don’t know what we’re doing until we’ve started • Have to use online algorithms • Example: most search problems 21

Search problems • Different set of strategies from physics sims! • Usually require dynamic load balance • Example: • Optimal VLSI layout • Robot motion planning • Game playing • Speech processing • Reconstructing phylogeny • ... 22

Example: Tree search ? ? ? ? ? • Tree unfolds dynamically during search • May be common problems on different paths (graph) • Graph may or may not be explicit in advance 23

Search algorithms Generic search: Put root in stack/queue while stack/queue has work remove node n from queue if n satisfies goal, return mark n as searched add viable unsearched children of n to stack/queue (Can branch-and-bound) 24 Variants: DFS (stack), BFS (queue), A ∗ (priority queue), ...

Simple parallel search 0 • How can we do better? • Not very effective without work estimates for subtrees! • Each new task on an idle processor until all have a subtree Static load balancing: 0 0 0 3 0 0 0 0 3 2 1 0 25

Centralized scheduling Worker 0 Worker 1 Next? Worker 2 Worker 3 Idea: obvious parallelization of standard search • Locks on shared data structure (stack, queue, etc) • Or might be a manager task 26

Centralized scheduling Teaser: What could go wrong with this parallel BFS? Put root in queue fork obtain queue lock while queue has work remove node n from queue release queue lock process n , mark as searched obtain queue lock enqueue unsearched children of n release queue lock join 27

Centralized scheduling Teaser: What could go wrong with this parallel BFS? Put root in queue; workers active = 0 fork obtain queue lock while queue has work or workers active > 0 remove node n from queue; workers active ++ release queue lock process n , mark as searched obtain queue lock enqueue unsearched children of n ; workers active -- release queue lock join 28

Centralized task queue • Called self-scheduling when applied to loops • Tasks might be range of loop indices • Assume independent iterations • Loop body has unpredictable time (or do it statically) • Pro: dynamic, online scheduling • Con: centralized, so doesn’t scale • Con: high overhead if tasks are small 29

Beyond centralized task queue Worker 0 Worker 1 Worker 2 Worker 3 Yoink! Next? 30

Beyond centralized task queue Basic distributed task queue idea: • Each processor works on part of a tree • When done, get work from a peer • Or if busy, push work to a peer • Requires asynch communication Also goes by work stealing, work crews... Implemented in OpenMP, Cilk, X10, CUDA, QUARK, SMPss, ... 31

Picking a donor Could use: • Asynchronous round-robin • Global round-robin (keep current donor pointer at proc 0) • Randomized – optimal with high probability! 32

Diffusion-based balancing • Problem with random polling: communication cost! • But not all connections are equal • Idea: prefer to poll more local neighbors 33 • Average out load with neighbors = ⇒ diffusion!

Mixed parallelism • Today: mostly coarse-grain task parallelism • Other times: fine-grain data parallelism • Why not do both? Switched parallelism. 34

Takeaway • Lots of ideas, not one size fits all! • Axes: task size, task dependence, communication • Dynamic tree search is a particularly hard case! • Fundamental tradeoffs • Overdecompose (load balance) vs keep tasks big (overhead, locality) • Steal work globally (balance) vs steal from neighbors (comm. overhead) • Sometimes hard to know when code should stop! 35

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies - PowerPoint PPT Presentation

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment 2 Inefficiencies in parallel code Overhead

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Gone WILD Richard Wang, Dana Butnariu, Jennifer Rexford Key Tradeoffs Load Balancing 1. Fast

Deterministic Load Balancing and Dictionaries in the Parallel Disk Model Mette Berger, Esben

Misallocation or Mismeasurement? Mark Bils, University of Rochester and NBER Pete Klenow, Stanford

Procurement Standards 2018 CDBG-DR Problem Solving Clinic Atlanta, GA | D e c e m b e r 1 2 - 1

CS70: Lecture 28. Review: Independence Variance Definition Variance; Inequalities; WLLN X and Y

There are old traders and there are bold traders, but... Kris Boudt Professor of finance and

Ouverture Persistent objects form a general and very useful method for storing internal program

Standardization Strategies and Service Innovation in Health Care Ole Hanseth Standardization

Cryomodules Standardization S.K. Chandrasekaran International Workshop on CM Design &

Formalization in Constructive Type Theory of the Standardization Theorem for the Lambda Calculus

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies - PowerPoint PPT Presentation

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment 2 Inefficiencies in parallel code Overhead

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Gone WILD Richard Wang, Dana Butnariu, Jennifer Rexford Key Tradeoffs Load Balancing 1. Fast

Deterministic Load Balancing and Dictionaries in the Parallel Disk Model Mette Berger, Esben

Misallocation or Mismeasurement? Mark Bils, University of Rochester and NBER Pete Klenow, Stanford

Procurement Standards 2018 CDBG-DR Problem Solving Clinic Atlanta, GA | D e c e m b e r 1 2 - 1

CS70: Lecture 28. Review: Independence Variance Definition Variance; Inequalities; WLLN X and Y

There are old traders and there are bold traders, but... Kris Boudt Professor of finance and

Ouverture Persistent objects form a general and very useful method for storing internal program

Standardization Strategies and Service Innovation in Health Care Ole Hanseth Standardization

Cryomodules Standardization S.K. Chandrasekaran International Workshop on CM Design &amp;

Formalization in Constructive Type Theory of the Standardization Theorem for the Lambda Calculus

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Cryomodules Standardization S.K. Chandrasekaran International Workshop on CM Design &