Loris Marchal HDR defense Memory and data aware scheduling - - PowerPoint PPT Presentation

loris marchal hdr defense memory and data aware scheduling
SMART_READER_LITE
LIVE PREVIEW

Loris Marchal HDR defense Memory and data aware scheduling - - PowerPoint PPT Presentation

Loris Marchal HDR defense Memory and data aware scheduling committee : Umit C ataly urek (reviewer) Georgia Tech. Pierre Manneback Polytech-Mons Alix Munier Kordon Univ. Paris 6 Cynthia Phillips (reviewer) Sandia Nat. Lab.


slide-1
SLIDE 1

Loris Marchal — HDR defense Memory and data aware scheduling

committee: ¨ Umit C ¸ataly¨ urek (reviewer) Georgia Tech. Pierre Manneback Polytech-Mons Alix Munier Kordon

  • Univ. Paris 6

Cynthia Phillips (reviewer) Sandia Nat. Lab. Yves Robert ENS Lyon Denis Trystram (reviewer) Grenoble INP

slide-2
SLIDE 2

2 / 46

Position and supervision

◮ CNRS researcher since 2007 ◮ 4 PhD students:

◮ Mathias Jacquelin: 2008 – 2011 (with Y. Robert)

(research scientist at Lawrence Berkeley Nat. Lab., USA)

◮ Julien Herrmann: 2012 – 2015 (with Y. Robert)

(postdoc at Georgia Tech., USA)

◮ Bertrand Simon: 2015 – 2018 (with F. Vivien)

(defense in July)

◮ Changjiang Gou: 2016 – . . . (with A. Benoit)

slide-3
SLIDE 3

3 / 46

Motivation and context – scientific computing

◮ Simulation of larger systems with better accuracy ◮ Need for better performance on larger data

slide-4
SLIDE 4

4 / 46

Increasing complexity of computing platforms

Evolution of computing platforms:

◮ Single processor

slide-5
SLIDE 5

4 / 46

Increasing complexity of computing platforms

Evolution of computing platforms: memory

◮ Single processor ◮ n processors with shared memory

slide-6
SLIDE 6

4 / 46

Increasing complexity of computing platforms

Evolution of computing platforms:

memory memory memory

network

◮ Single processor ◮ n processors with shared memory ◮ n processors with communication delays

slide-7
SLIDE 7

4 / 46

Increasing complexity of computing platforms

Evolution of computing platforms:

memory memory memory

network

◮ Single processor ◮ n processors with shared memory ◮ n processors with communication delays ◮ n multi-core processors with memory hierachies

slide-8
SLIDE 8

4 / 46

Increasing complexity of computing platforms

Evolution of computing platforms:

memory memory memory

network

◮ Single processor ◮ n processors with shared memory ◮ n processors with communication delays ◮ n multi-core processors with memory hierachies ◮ n multi-core processors and k accelerators (GPUs)

slide-9
SLIDE 9

4 / 46

Increasing complexity of computing platforms

Evolution of computing platforms:

memory memory memory

network

◮ Single processor ◮ n processors with shared memory ◮ n processors with communication delays ◮ n multi-core processors with memory hierachies ◮ n multi-core processors and k accelerators (GPUs)

My focus: optimize application mapping and task scheduling for memory constraints and data movement

slide-10
SLIDE 10

5 / 46

Contributions

◮ Part I. Task graph scheduling with limited memory

◮ Chapter 2. Memory-aware dataflow model ◮ Chapter 3. Peak Memory and I/O Volume on Trees ◮ Chapter 4. Peak memory of series-parallel task graphs ◮ Chapter 5. Hybrid scheduling with bounded memory ◮ Chapter 6. Memory-aware parallel tree processing

◮ Part II. Minimizing data movement for matrix computations

◮ Chapter 7. Matrix product for memory hierarchy ◮ Chapter 8. Data redistribution for parallel computing ◮ Chapter 9. Dynamic scheduling for matrix computations

slide-11
SLIDE 11

5 / 46

Contributions

◮ Part I. Task graph scheduling with limited memory

◮ Chapter 2. Memory-aware dataflow model ◮ Chapter 3. Peak Memory and I/O Volume on Trees ◮ Chapter 4. Peak memory of series-parallel task graphs ◮ Chapter 5. Hybrid scheduling with bounded memory ◮ Chapter 6. Memory-aware parallel tree processing

◮ Part II. Minimizing data movement for matrix computations

◮ Chapter 7. Matrix product for memory hierarchy ◮ Chapter 8. Data redistribution for parallel computing ◮ Chapter 9. Dynamic scheduling for matrix computations

slide-12
SLIDE 12

Outline of this talk

Introduction

  • 1. Scheduling tree-shaped task graphs with bounded memory
  • 2. Data redistribution for parallel computing

Research perspectives

slide-13
SLIDE 13

Outline

Introduction

  • 1. Scheduling tree-shaped task graphs with bounded memory
  • 2. Data redistribution for parallel computing

Research perspectives

slide-14
SLIDE 14

8 / 46

Modeling scientific applications as task graphs

◮ Scientific applications divided into

rather independent modules (tasks)

◮ Tasks linked through data

dependencies

◮ Directed Acyclic Graph of tasks ◮ Abundant literature about (theoretical) task graph scheduling ◮ Popularized by runtime schedulers (ParSec, StarPU, XKaapi,

OpenMP 4)

◮ Express dependencies between tasks ◮ Write code for each task on (possibly several) processing units ◮ Choose task mapping at runtime

slide-15
SLIDE 15

9 / 46

Task graph scheduling and memory

◮ Consider a simple task graph

A B C D E F

slide-16
SLIDE 16

9 / 46

Task graph scheduling and memory

◮ Consider a simple task graph ◮ Tasks have durations and memory demands

A B C D E F

duration memory

slide-17
SLIDE 17

9 / 46

Task graph scheduling and memory

◮ Consider a simple task graph ◮ Tasks have durations and memory demands

A B C D E F

time Processor 1: Processor 2:

slide-18
SLIDE 18

9 / 46

Task graph scheduling and memory

◮ Consider a simple task graph ◮ Tasks have durations and memory demands

  • ut of memory!

A B C D E F

time Processor 1: Processor 2:

◮ Peak memory: maximum memory usage

slide-19
SLIDE 19

9 / 46

Task graph scheduling and memory

◮ Consider a simple task graph ◮ Tasks have durations and memory demands

A B C D E F

time Processor 1: Processor 2:

◮ Peak memory: maximum memory usage ◮ Trade-off between peak memory and performance (time to

solution)

slide-20
SLIDE 20

10 / 46

Going back to sequential processing

◮ Temporary data require memory ◮ Scheduling influences the peak memory

A B C D E F

slide-21
SLIDE 21

10 / 46

Going back to sequential processing

◮ Temporary data require memory ◮ Scheduling influences the peak memory

A B C D E F A B C D E F

slide-22
SLIDE 22

10 / 46

Going back to sequential processing

◮ Temporary data require memory ◮ Scheduling influences the peak memory

A B C D E F A B C D E F

When minimum memory demand > available memory:

◮ Store some temporary data on a larger, slower storage (disk) ◮ Out-of-core computing, with Input/Output operations (I/O) ◮ Decide both scheduling and eviction scheme

slide-23
SLIDE 23

11 / 46

(Black) Pebble game (1970s)

A B C D E F

Rules of the game (possible moves):

  • 1. Put a pebble on a source vertex
  • 2. Remove a pebble from a vertex
  • 3. Put a pebble on a vertex if all its predecessors are pebbled

Objectives:

◮ Pebble all output vertices ◮ Minimize the number of pebbles used

slide-24
SLIDE 24

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

+ u − − + 7 + v − 2 z 5 1 z x × / + t

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-25
SLIDE 25

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

t u − − + 7 + v − 2 z 5 1 z x × / + +

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-26
SLIDE 26

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

u − − + 7 + v − 2 z 5 1 z x × / + + t

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-27
SLIDE 27

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

− − + 7 + v − 2 z 5 1 z x × / + + t u

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-28
SLIDE 28

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

t u − − + 7 + v − 2 z 5 1 z x × / + +

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-29
SLIDE 29

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

− − + 7 + v − 2 z 5 1 z x × / + + t u

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-30
SLIDE 30

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

− + 7 + v − 2 z 5 1 z x × / + + t u −

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-31
SLIDE 31

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

u − − + 7 + v − 2 z 5 1 z x × / + + t

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-32
SLIDE 32

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

− − + 7 + v − 2 z 5 1 z x × / + + t u

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-33
SLIDE 33

12 / 46

Register allocation & pebble game

How to efficiently compute the following arithmetic expression with the minimum number of registers? 7 + (1 + x)(5 − z) − ((u − t)/(2 + z)) + v

Complexity results

Problem on trees:

◮ Polynomial algorithm [Sethi & Ullman, 1970]

General problem on DAGs (common subexpressions):

◮ P-Space complete [Gilbert, Lengauer & Tarjan, 1980] ◮ Without re-computation: NP-complete [Sethi, 1973]

Pebble movements corresponds to register operations:

  • 1. Pebbling a source vertex: load an input into register
  • 2. Removing a pebble: discard value in register
  • 3. Pebbling a vertex: computing a value in a new register

Objective: use a minimal number of registers

slide-34
SLIDE 34

13 / 46

Red-Blue pebble game (Hong & Kung 1991)

A B C D E F

Rules of the game (possible moves):

  • 1. Put a red pebble on a source vertex
  • 2. Remove a red pebble from a vertex
  • 3. Put a red pebble on a vertex if all predecessors red-pebbled
slide-35
SLIDE 35

13 / 46

Red-Blue pebble game (Hong & Kung 1991)

A B C D E F

Rules of the game (possible moves):

  • 1. Put a red pebble on a source vertex
  • 2. Remove a red pebble from a vertex
  • 3. Put a red pebble on a vertex if all predecessors red-pebbled
  • 4. Put a red pebble on a blue-pebbled vertex
  • 5. Put a blue pebble on a red-pebbled vertex
  • 6. Remove a blue pebble from a vertex
slide-36
SLIDE 36

13 / 46

Red-Blue pebble game (Hong & Kung 1991)

A B C D E F

Rules of the game (possible moves):

  • 1. Put a red pebble on a source vertex
  • 2. Remove a red pebble from a vertex
  • 3. Put a red pebble on a vertex if all predecessors red-pebbled
  • 4. Put a red pebble on a blue-pebbled vertex
  • 5. Put a blue pebble on a red-pebbled vertex
  • 6. Remove a blue pebble from a vertex
  • 7. Never use more than M red pebbles

Objective: pebble graph with minimum number of rules 4/5

slide-37
SLIDE 37

14 / 46

Red-Blue pebble game and I/O complexity

t + 7 + v − 2 z 5 1 z x × / + + − −

Analogy with out-of-core processing:

◮ red pebbles: memory slots ◮ blue pebbles: secondary storage (disk) ◮ red −

→ blue: write to disk, evict from memory

◮ blue −

→ red: read from disk, load in memory

◮ M: number of available memory slots

Objective: minimum number of I/O operations

slide-38
SLIDE 38

15 / 46

Red/Blue pebble game – Results

Idea of Hong & Kung:

◮ Partition graph into sets with at most M reads and writes ◮ Number of sets needed ⇒ lower-bound on I/Os

Lower-bound on I/Os:

◮ Product of 2 n2-matrices: Θ(n3/

√ M)

◮ Other regular graphs (FFT)

Later extended to other matrix operations:

◮ Lower bounds ◮ Communication-avoiding algorithms

slide-39
SLIDE 39

16 / 46

Summary

Three problems:

◮ Memory minimization

Black pebble game

◮ I/O minimization for out-of-core processing

Red-Blue pebble game

◮ Memory/Time tradeoff for parallel processing

slide-40
SLIDE 40

16 / 46

Summary

Three problems:

◮ Memory minimization

Black pebble game

◮ I/O minimization for out-of-core processing

Red-Blue pebble game

◮ Memory/Time tradeoff for parallel processing

Shift of focus:

◮ Pebble games limited to unit-size data ◮ Target coarse-grain tasks, with heterogeneous data sizes

slide-41
SLIDE 41

17 / 46

Tree-shaped task graphs

◮ Multifrontal sparse matrix factorization ◮ To cope with complex/heterogeneous platforms:

◮ Express factorization as a task graph ◮ Scheduled using specialized runtime

◮ Assembly/Elimination tree: task graph is an in-tree

Problem:

◮ Large temporary data ◮ Memory becomes a

bottleneck

◮ Schedule trees

with limited memory

slide-42
SLIDE 42

18 / 46

Tree traversal influences peak memory

3 3 2 2 8 1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

slide-43
SLIDE 43

18 / 46

Tree traversal influences peak memory

3 3 2

2

8 1 2

2

3

D A B C E ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4

slide-44
SLIDE 44

18 / 46

Tree traversal influences peak memory

3 3 2

2

8 1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2

slide-45
SLIDE 45

18 / 46

Tree traversal influences peak memory

3 3

2 2

8 1

2

2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6

slide-46
SLIDE 46

18 / 46

Tree traversal influences peak memory

3 3

2 2

8 1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4

slide-47
SLIDE 47

18 / 46

Tree traversal influences peak memory

3

3 2 2

8

1

2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8

slide-48
SLIDE 48

18 / 46

Tree traversal influences peak memory

3

3

2 2 8 1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3

slide-49
SLIDE 49

18 / 46

Tree traversal influences peak memory

3 3

2 2

8

1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14

slide-50
SLIDE 50

18 / 46

Tree traversal influences peak memory

3 3

2 2 8 1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6

slide-51
SLIDE 51

18 / 46

Tree traversal influences peak memory

3 3

2 2 8 1 2 2

3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9

slide-52
SLIDE 52

18 / 46

Tree traversal influences peak memory

3 3 2 2 8 1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9

slide-53
SLIDE 53

18 / 46

Tree traversal influences peak memory

3

3 2 2

8

1 2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11

slide-54
SLIDE 54

18 / 46

Tree traversal influences peak memory

3

3

2

2 8 1

2

2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11, 7

slide-55
SLIDE 55

18 / 46

Tree traversal influences peak memory

3

3

2 2

8 1 2

2

3

D A B C E ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11, 7, 9

slide-56
SLIDE 56

18 / 46

Tree traversal influences peak memory

3 3 2 2

8

1

2 2 3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11, 7, 9, 11

slide-57
SLIDE 57

18 / 46

Tree traversal influences peak memory

3 3

2 2 8 1 2 2

3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11, 7, 9, 11, 9

slide-58
SLIDE 58

18 / 46

Tree traversal influences peak memory

3 3

2 2 8 1 2 2

3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11, 7, 9, 11, 9

slide-59
SLIDE 59

18 / 46

Tree traversal influences peak memory

3 3

2 2 8 1 2 2

3

E D A B C ◮ Nodes: tasks

◮ Node weight: temporary data (mi)

◮ Edges: dependencies (data)

◮ Edge weight: data size (di,j)

◮ Process a node:

load inputs + output + temporary data

◮ Memory (→): 4, 2, 6, 4, 8, 3, 14, 6, 9 ◮ Memory (←): 11, 7, 9, 11, 9

Focus on two problems:

◮ How to minimize the memory requirement of a tree?

◮ Best post-order traversal ◮ Optimal traversal

◮ Given an amount of available memory, how to efficiently

process a tree?

◮ Parallel processing ◮ Goal: minimize processing time

slide-60
SLIDE 60

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1,   

slide-61
SLIDE 61

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1, d1 + P2,   

slide-62
SLIDE 62

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1, d1 + P2, d1 + d2 + P3,   

slide-63
SLIDE 63

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1, d1 + P2, d1 + d2 + P3, . . . ,

  • i<n

di + Pn,   

slide-64
SLIDE 64

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1, d1 + P2, d1 + d2 + P3, . . . ,

  • i<n

di + Pn,

  • i≤n

di + mi + dr   

slide-65
SLIDE 65

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1, d1 + P2, d1 + d2 + P3, . . . ,

  • i<n

di + Pn,

  • i≤n

di + mi + dr   

◮ Optimal order: non-increasing Pi − di

slide-66
SLIDE 66

19 / 46

Best post-order traversal for trees [Liu 86]

Post-Order: totally process a subtree before starting another one

dn d2 r P1 P2 . . . Pn d1

◮ For each subtree: peak memory Pi, residual memory di ◮ Given a processing order 1, . . . , n, the peak memory:

max   P1, d1 + P2, d1 + d2 + P3, . . . ,

  • i<n

di + Pn,

  • i≤n

di + mi + dr   

◮ Optimal order: non-increasing Pi − di ◮ Best post-order traversal is optimal for unit-weight trees

slide-67
SLIDE 67

20 / 46

Post-Order vs. optimal traversals

◮ In some cases, interesting to stop within a subtree

(if there exists a cut with small weight)

150 100 100 150 100 25 25

slide-68
SLIDE 68

20 / 46

Post-Order vs. optimal traversals

◮ In some cases, interesting to stop within a subtree

(if there exists a cut with small weight)

25 150 100 100 150 100 25

slide-69
SLIDE 69

20 / 46

Post-Order vs. optimal traversals

◮ In some cases, interesting to stop within a subtree

(if there exists a cut with small weight)

150 100 100 150 100 25 25

slide-70
SLIDE 70

20 / 46

Post-Order vs. optimal traversals

◮ In some cases, interesting to stop within a subtree

(if there exists a cut with small weight)

150 100 100 150 100 25 25

slide-71
SLIDE 71

20 / 46

Post-Order vs. optimal traversals

◮ In some cases, interesting to stop within a subtree

(if there exists a cut with small weight)

◮ For any K, possible to build a tree such that post-order uses

K times as much memory as the optimal traversal

150 100 100 150 100 25 25

slide-72
SLIDE 72

20 / 46

Post-Order vs. optimal traversals

◮ In some cases, interesting to stop within a subtree

(if there exists a cut with small weight)

◮ For any K, possible to build a tree such that post-order uses

K times as much memory as the optimal traversal

  • n actual
  • n random

assembly trees trees Fraction of non optimal traversals 4.2% 61% Maximum increase compared to optimal 18% 22% Average increased compared to optimal 1% 12%

Optimal algorithms:

◮ First algorithm proposed by [Liu 87]

Complex mutli-way merge, O(n2)

◮ New algorithm

Recursive exploration of the tree, O(n2), faster in practice [M. Jacquelin, L. Marchal, Y. Robert & B. U¸ car, IPDPS 2011]

slide-73
SLIDE 73

21 / 46

Model for parallel tree processing

◮ p identical processors ◮ Shared memory of size M ◮ Task i has execution times pi ◮ Parallel processing of nodes ⇒ larger memory ◮ Trade-off time vs. memory: bi-objective problem

◮ Peak memory ◮ Makespan (total processing time)

d3

d2

d5 d4 m3

m2

m5 m4

m1 2 1 3 5 4

slide-74
SLIDE 74

22 / 46

NP-Completeness in the pebble game model

Background:

◮ Makespan minimization NP-complete for trees (P|trees|Cmax) ◮ Polynomial when unit-weight tasks (P|pi = 1, trees|Cmax) ◮ Pebble game polynomial on trees

Pebble game model:

◮ Unit execution time: pi = 1 ◮ Unit memory costs: mi = 0, di = 1

(pebble edges, equivalent to pebble game for trees)

Theorem

Deciding whether a tree can be scheduled using at most B pebbles in at most C steps is NP-complete. [L. Eyraud-Dubois, L. Marchal, O. Sinnen, F. Vivien, TOPC 2015]

slide-75
SLIDE 75

23 / 46

Space-Time tradeoff

No guarantee on both memory and time simultaneously:

Theorem 1

There is no algorithm that is both an α-approximation for makespan minimization and a β-approximation for memory peak minimization when scheduling tree-shaped task graphs. Lemma: For a schedule with peak memory M and makespan Cmax, M × Cmax ≥ 2(n − 1) Proof: each edge stays in memory for at least 2 steps.

Theorem 2

For any α(p)-approximation for makespan and β(p)-approximation for memory peak with p ≥ 2 processors, α(p)β(p) ≥ 2p ⌈log(p)⌉ + 2·

slide-76
SLIDE 76

24 / 46

How to cope with limited memory?

◮ When processing a tree on a given machine: bounded memory ◮ Objective: Minimize processing time under this constraint ◮ NB: bounded memory ≥ memory for sequential processing ◮ Intuition:

◮ When data sizes ≪ memory bound:

process many tasks in parallel

◮ When approaching memory bound, limit parallelism

◮ Rely on a (memory-friendly) sequential traversal

Existing (system) approach:

◮ Book memory as in sequential processing

slide-77
SLIDE 77

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

activated completed running

slide-78
SLIDE 78

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-79
SLIDE 79

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-80
SLIDE 80

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-81
SLIDE 81

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-82
SLIDE 82

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-83
SLIDE 83

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-84
SLIDE 84

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

activated completed running

slide-85
SLIDE 85

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

completed running activated

slide-86
SLIDE 86

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

completed running activated

slide-87
SLIDE 87

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

completed running activated

slide-88
SLIDE 88

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

completed running activated

slide-89
SLIDE 89

25 / 46

Conservative approach: task activation

◮ From [Agullo, Buttari, Guermouche & Lopez 2013] ◮ Choose a sequential task order (e.g. best post-order) ◮ While memory available, activate tasks in this order:

book memory for their output + tmp. data

◮ Process only activated tasks (with given scheduling priority)

When a tasks completes:

◮ Free inputs ◮ Activate as many new tasks as possible ◮ Then, start scheduling activated tasks

completed running activated

◮ Can cope with very small memory bound ◮ No memory reuse

slide-90
SLIDE 90

26 / 46

Refined activation: predict memory reuse

activated completed running

◮ Follow the same activation approach ◮ When activating a node:

◮ Check how much memory is already booked by its subtree ◮ Book only what is missing (if needed)

◮ When completing a node:

◮ Distribute the booked memory to all activated ancestors ◮ Then, release the remaining memory (if any)

◮ Proof of termination

◮ Based on a sequential schedule using less than the memory

bound

◮ Process the whole tree without going out of memory

slide-91
SLIDE 91

27 / 46

New makespan lower bound

Theorem (Memory aware makespan lower bound).

Cmax ≥ 1 M

  • i

MemNeededi × ti.

◮ M: memory bound ◮ Cmax: makespan (total processing time) ◮ MemNeededi: memory needed to process task i ◮ ti: processing time of task i.

time memory usage memory bound M makespan MemNeededi ti task i

slide-92
SLIDE 92

28 / 46

Simulation on assembly trees

◮ Dataset: assembly trees of actual sparse matrices ◮ Algorithms:

◮ Activation from [Agullo et al, Europar 2013] ◮ MemBooking

◮ Sequential tasks (simple performance model) ◮ 8 processors (similar results for 2,4,16 and 32) ◮ Reference memory MPO :

peak memory of best sequential post-order

◮ Activation and execution orders: best seq. post-order ◮ Makespan normalized by max(CP, W p , MemAwareLB)

slide-93
SLIDE 93

29 / 46

Simulations: total processing time

1.0 1.2 1.4 1.6 1.8 5 10 15 20

Normalized memory bound Normalized makespan

Heuristics: Activation MemBooking

◮ MemBooking able to activate more nodes, increase parallelism ◮ Even for scarce memory conditions

[G. Aupy, C. Brasseur, L. Marchal, IPDPS 2017]

slide-94
SLIDE 94

30 / 46

Conclusion on memory-aware tree scheduling

Summary:

◮ Related to pebble games ◮ Well-known sequential algorithms for trees ◮ Parallel processing difficult:

◮ Complexity and inapproximability ◮ Efficient booking heuristics (guaranteed termination)

Other contributions in this area:

◮ Optimal sequential algorithm for SP-graphs ◮ Complexity and heuristics for two types of cores (hybrid) ◮ I/O volume minimization: optimal sequential algorithm for

homogeneous trees

◮ Guaranteed heuristic for memory-bounded parallel scheduling

  • f DAGs
slide-95
SLIDE 95

Outline

Introduction

  • 1. Scheduling tree-shaped task graphs with bounded memory
  • 2. Data redistribution for parallel computing

Research perspectives

slide-96
SLIDE 96

32 / 46

Introduction

Distributed computing:

◮ Processors have their own memory ◮ Data transfers are needed, but costly (time, energy) ◮ Computing speed increases faster than network bandwidth ◮ Need for limiting these communications

Following study:

◮ Data is originally (ill) distributed ◮ Computation to be performed has a preferred data layout ◮ Should we redistribute the data? How?

slide-97
SLIDE 97

33 / 46

Data collection and storage

◮ Origin of data: sensors (e.g.

satellites) that aggregate snapshots

◮ Data partitioned and distributed

before the computation

◮ During the collection ◮ By a previous computation

◮ Computation kernel (e.g. linear

algebra kernels) must be applied to data

◮ Initial data distribution may be

inefficient for the computation kernel

slide-98
SLIDE 98

34 / 46

Data distribution and mapping

◮ A data distribution is usually

defined to minimize the completion time of an algorithm

◮ Ex: 2D-cyclic

◮ There is not necessarily a single

data distribution that maximizes this efficiency

◮ Find the one-to-one mapping

(subsets of data - processors) for which the cost of the redistribution is minimal

slide-99
SLIDE 99

34 / 46

Data distribution and mapping

◮ A data distribution is usually

defined to minimize the completion time of an algorithm

◮ Ex: 2D-cyclic

◮ There is not necessarily a single

data distribution that maximizes this efficiency

◮ Find the one-to-one mapping

(subsets of data - processors) for which the cost of the redistribution is minimal

slide-100
SLIDE 100

34 / 46

Data distribution and mapping

◮ A data distribution is usually

defined to minimize the completion time of an algorithm

◮ Ex: 2D-cyclic

◮ There is not necessarily a single

data distribution that maximizes this efficiency

◮ Find the one-to-one mapping

(subsets of data - processors) for which the cost of the redistribution is minimal

slide-101
SLIDE 101

34 / 46

Data distribution and mapping

◮ A data distribution is usually

defined to minimize the completion time of an algorithm

◮ Ex: 2D-cyclic

◮ There is not necessarily a single

data distribution that maximizes this efficiency

◮ Find the one-to-one mapping

(subsets of data - processors) for which the cost of the redistribution is minimal

slide-102
SLIDE 102

35 / 46

Data distribution / Data partition

◮ Let P be a finite set of identical processors ◮ Let A be a finite set of data items ◮ Data Distribution: D : A → P

∀a ∈ A, D(a) = p ⇔ a hosted on proc p

◮ Data Partition: P : A → P

∀a, b ∈ A, P(a) = P(b) ⇔ a and b are hosted by the same processor

◮ A data distribution D is compatible with the data partition P

iif there exists a permutation σ such that ∀a ∈ A, D(a) = σ(P(a))

slide-103
SLIDE 103

36 / 46

Cost of redistribution

◮ Hardware symmetry assumption: the efficiency of the

computation algorithm is a function of the data partition

◮ Unitary size assumption: all data items are of the same size ◮ Evaluation of the redistribution with two metrics:

◮ Total volume of communication: the total number of data

items sent from one processor to another

◮ Number of parallel communication steps: one-port

bi-directional model

slide-104
SLIDE 104

37 / 46

Best redistribution to given partition

◮ For many algorithms, we know ideal data partitions that

minimize completion time

◮ There are |P|! data distributions compatible with the ideal

partition

Best redistribution problem

Given an initial data distribution Dini, find the target data distribution Dtar compatible with the ideal data partition that minimizes the cost of redistribution.

◮ Optimal algorithms for each metric ◮ Based on building bipartite graphs and computing perfect

matching

slide-105
SLIDE 105

38 / 46

Redistribution followed by computation kernel

◮ Non-overlapping phases assumption:

Ttot = Tredist(Dini → Dtar) + Tcomp(Dtar)

◮ Close formula for Tredist(Dini → Dtar) depending on the

communication model

◮ No close formula for Tcomp(Dtar) in the general case

slide-106
SLIDE 106

39 / 46

NP-completeness for 1D Stencil

◮ Consider the simple case of iterative 1D Stencil

step t step t + 1

◮ Simple close formula for T stencil comp (Dtar) for both communication

models

Theorem

Finding the optimal distribution Dtar that minimizes Ttot = Tredist(Dini → Dtar) + T stencil

comp (Dtar)

is NP-complete in the strong sense.

slide-107
SLIDE 107

40 / 46

Heuristics for redistribution + computation

Na¨ ıve options:

◮ Do not redistribute (owner-compute) ◮ Canonical redistribution to target partition:

◮ Processor i gets part i

Using previous algorithms:

◮ Compute best redistribution for each metric

◮ Total volume (vol) ◮ Redistribution steps (steps)

slide-108
SLIDE 108

41 / 46

Experimental set up

◮ Implementation with the ParSEC runtime ◮ Initial distribution: random balanced distributions ◮ Targeted partition Ptar: optimal partition for the considered

computation kernel (QR: 2D-block cyclic)

◮ Parsec moves data from initial distribution to the target

compute location when needed (computation/communication overlap)

◮ Target distribution computed according to four heuristics:

◮ Owner compute (default heuristics of Parsec) ◮ Canonical redistribution to Ptar ◮ Best redistribution for total volume (vol) ◮ Best redistribution for number of steps (steps)

slide-109
SLIDE 109

42 / 46

Results on QR factorization

◮ Improvements in total completion time (redistribution +

computation)

◮ Compared to Owner compute (no redistribution) ◮ Average on 50 matrices n canonical

  • Vol. algo.

Steps algo. 16 41.9% 39.5% 43.4% 34 64.1% 67.7% 66.4% 52 65.8% 70.5% 71.2% 70 70.8% 72.7% 71.4% 88 70.8% 72.6% 72.4%

Results on skewed distribution (2D-block cyclic + 50% tiles randomly moved)

n canonical

  • Vol. algo.

Steps algo. 16 27.0% 28.1% 28.1% 34 20.6% 25.5% 22.1% 52 13.6% 25.8% 26.2% 70 12.7% 14.5% 4.8% 88 12.0% 15.7% 13.4%

Results for ChunkSet (Earth science application) [J. Herrmann et al., Parallel Computing 2016]

slide-110
SLIDE 110

43 / 46

Data redistribution – Conclusion

Summary:

◮ Algorithms that find the optimal target distribution for

different redistribution metrics

◮ NP-completeness proof for minimizing redistribution time

followed by a computation kernel

◮ Experimental validation on ParSEC for QR factorization kernel

slide-111
SLIDE 111

Outline

Introduction

  • 1. Scheduling tree-shaped task graphs with bounded memory
  • 2. Data redistribution for parallel computing

Research perspectives

slide-112
SLIDE 112

45 / 46

Perspectives – scheduling problems

Data locality still a very timely research topic Memory-aware scheduling for distributed memories

◮ Consider data movement at the same time ◮ Trade-off between performance and data movement ◮ Partition trees/graphs for both performance and memory

Memory-aware work-stealing

◮ Work-stealing: distributed, dynamic scheduler ◮ Existing lower/upper bounds on data locality ◮ How to derive memory guarantees? ◮ Based on which pre-computed information?

slide-113
SLIDE 113

46 / 46

Perspectives – runtime schedulers

Collaboration with runtime experts:

◮ Started during the SOLHAR project ◮ Need to adapt our algorithms:

◮ Lower scheduling complexity ◮ Make the algorithms dynamic (graph gradually uncovered) ◮ Distributing scheduling decisions

◮ Possible tools:

◮ Hierarchical scheduling ◮ Precompute memory information on the graph

❀ New scheduling problems!

slide-114
SLIDE 114

Outline

Introduction

  • 1. Scheduling tree-shaped task graphs with bounded memory

Introduction Pebble games Tree-shaped task graphs Post-Order vs. optimal peak memory Parallel processing of trees – complexity Parallel processing of trees – algorithms

  • 2. Data redistribution for parallel computing

Redistribution data Coupling redistribution and computation Performance Evaluation Conclusion Research perspectives