Periodic I/O Scheduling for Supercomputers Guillaume Aupy 1 , Ana - - PowerPoint PPT Presentation

periodic i o scheduling for supercomputers
SMART_READER_LITE
LIVE PREVIEW

Periodic I/O Scheduling for Supercomputers Guillaume Aupy 1 , Ana - - PowerPoint PPT Presentation

Periodic I/O Scheduling for Supercomputers Guillaume Aupy 1 , Ana Gainaru 2 , Valentin Le F` evre 3 1 Inria & U. of Bordeaux 2 Vanderbilt University 3 ENS Lyon & Inria PMBS Workshop, November 2017 Slides available at


slide-1
SLIDE 1

Periodic I/O Scheduling for Supercomputers

Guillaume Aupy1, Ana Gainaru2, Valentin Le F` evre3 1 – Inria & U. of Bordeaux 2 – Vanderbilt University 3 – ENS Lyon & Inria

PMBS Workshop, November 2017

Slides available at https://project.inria.fr/dash/

slide-2
SLIDE 2

IO congestion in HPC systems

Some numbers for motivation:

◮ Computational power keeps increasing (Intrepid: 0.56 PFlop/s, Mira: 10 PFlop/s,

Aurora: 450 PFlop/s (?)).

◮ IO Bandwidth increases at slowlier rate (Intrepid: 88 GB/s, Mira: 240 GB/s, Aurora:

1 TB/s (?)).

slide-3
SLIDE 3

IO congestion in HPC systems

Some numbers for motivation:

◮ Computational power keeps increasing (Intrepid: 0.56 PFlop/s, Mira: 10 PFlop/s,

Aurora: 450 PFlop/s (?)).

◮ IO Bandwidth increases at slowlier rate (Intrepid: 88 GB/s, Mira: 240 GB/s, Aurora:

1 TB/s (?)). In other terms: Intrepid can process 160 GB for every PFlop Mira can process 24 GB for every PFlop Aurora will (?) process 2.2 GB for every PFlop

Congestion is coming.

slide-4
SLIDE 4

Burst buffers: the solution?

Simplistically:

◮ If IO bandwidth available: use it ◮ Else, fill the burst buffers ◮ When IO bandwidth is available: empty the burst-buffers.

If the Burst Buffers are big enough it should work

slide-5
SLIDE 5

Burst buffers: the solution?

Simplistically:

◮ If IO bandwidth available: use it ◮ Else, fill the burst buffers ◮ When IO bandwidth is available: empty the burst-buffers.

If the Burst Buffers are big enough it should work right?

slide-6
SLIDE 6

Burst buffers: the solution?

Simplistically:

◮ If IO bandwidth available: use it ◮ Else, fill the burst buffers ◮ When IO bandwidth is available: empty the burst-buffers.

If the Burst Buffers are big enough it should work right? Average I/O occupation: sum for all applications of the volume of I/O transfered, divided by the time they execute, normalized by the peak I/O bandwidth.

slide-7
SLIDE 7

Burst buffers: the solution?

Simplistically:

◮ If IO bandwidth available: use it ◮ Else, fill the burst buffers ◮ When IO bandwidth is available: empty the burst-buffers.

If the Burst Buffers are big enough it should work right? Average I/O occupation: sum for all applications of the volume of I/O transfered, divided by the time they execute, normalized by the peak I/O bandwidth. Given a set of data-intensive applications running conjointly:

◮ on Intrepid have a max average I/O occupation of 25%

slide-8
SLIDE 8

Burst buffers: the solution?

Simplistically:

◮ If IO bandwidth available: use it ◮ Else, fill the burst buffers ◮ When IO bandwidth is available: empty the burst-buffers.

If the Burst Buffers are big enough it should work right? Average I/O occupation: sum for all applications of the volume of I/O transfered, divided by the time they execute, normalized by the peak I/O bandwidth. Given a set of data-intensive applications running conjointly:

◮ on Intrepid have a max average I/O occupation of 25% ◮ on Mira have an average I/O occupation of 120 to 300%!

slide-9
SLIDE 9

Previously in IO cong.

“Online” scheduling (Gainaru et al. Ipdps’15):

◮ When an application is ready to do I/O, it sends a message to an I/O scheduler; ◮ Based on the other applications running and a priority function, the I/O scheduler

will give a GO or NOGO to the application.

◮ If the application receives a NOGO, it pauses until a GO instruction. ◮ Else, it performs I/O.

slide-10
SLIDE 10

Previously in IO cong.

App(1) App(2) App(3) bw Time B

slide-11
SLIDE 11

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) bw Time B

slide-12
SLIDE 12

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) bw Time B

slide-13
SLIDE 13

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) bw Time B

slide-14
SLIDE 14

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) bw Time B

slide-15
SLIDE 15

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) bw Time B

slide-16
SLIDE 16

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) w(1) bw Time B

slide-17
SLIDE 17

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) w(1) bw Time B

slide-18
SLIDE 18

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) w(1) w(3) bw Time B

slide-19
SLIDE 19

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) w(1) w(3) w(2) bw Time B

slide-20
SLIDE 20

Previously in IO cong.

App(1) App(2) App(3) w(1) w(2) w(3) w(1) w(3) w(2) w(1) w(2) w(3) bw Time B Approx 10% improvement in application performance with 5% gain in system performance

  • n Intrepid.
slide-21
SLIDE 21

This work

Assumption: Applications follow I/O patterns1 that we can obtain (based on historical data for intance).

◮ We use this information to compute an I/O time schedule; ◮ Each application then knows its GO/NOGO information and uses it to perform I/O.

Spoiler: it works very well (at least it seems)

1periodic pattern, to be defined

slide-22
SLIDE 22

I/O characterization of HPC applis

Hu et al. 2016

  • 1. Periodicity: computation and I/O phases (write operations such as checkpoints).
  • 2. Synchronization: parallel identical jobs lead to synchronized I/O operations.
  • 3. Repeatability: jobs run several times with different inputs.
  • 4. Burstiness: short burst of write operations.

Idea: use the periodic behavior to compute periodic schedules.

slide-23
SLIDE 23

Platform model

◮ N unit-speed processors, equipped with an I/O card of bandwidth b ◮ Centralized I/O system with total bandwidth B

b=0.1Gb/s/Node

=B

Model instantiation for the Intrepid platform.

slide-24
SLIDE 24

Application Model

K periodic applications already scheduled in the system: App(k)(β(k), w(k), vol(k)

io ). ◮ β(k) is the number of processors onto which App(k) is assigned ◮ w(k) is the computation time of a period ◮ vol(k) io

is the volume of I/O to transfor after the w(k) units of time time(k)

io =

vol(k)

io

min(β(k) · b, B)

App(1) w(1) w(1) w(1) App(2) w(2) w(2) w(2) App(3) w(3) w(3) w(3) Bandwidth Time B

slide-25
SLIDE 25

Objectives

If App(k) runs during a total time Tk and performs n(k) instances, we define: ρ(k) = w(k) w(k) + time(k)

io

, ˜ ρ(k) = n(k)w(k) Tk

slide-26
SLIDE 26

Objectives

If App(k) runs during a total time Tk and performs n(k) instances, we define: ρ(k) = w(k) w(k) + time(k)

io

, ˜ ρ(k) = n(k)w(k) Tk SysEfficiency maximize peak performance (average number of Flops): maximize

1 N

K

k=1 β(k)˜

ρ(k).

slide-27
SLIDE 27

Objectives

If App(k) runs during a total time Tk and performs n(k) instances, we define: ρ(k) = w(k) w(k) + time(k)

io

, ˜ ρ(k) = n(k)w(k) Tk SysEfficiency maximize peak performance (average number of Flops): maximize

1 N

K

k=1 β(k)˜

ρ(k). Dilation minimize largest slowdown (fairness between users): minimize maxk=1..K

ρ(k) ˜ ρ(k) .

slide-28
SLIDE 28

High-level constraints

◮ Applications are already scheduled on the machines:

not (yet) our job to do it;

slide-29
SLIDE 29

High-level constraints

◮ Applications are already scheduled on the machines:

not (yet) our job to do it;

◮ We want the schedule information distributed over the applis:

the goal is not to add a new congestion point;

slide-30
SLIDE 30

High-level constraints

◮ Applications are already scheduled on the machines:

not (yet) our job to do it;

◮ We want the schedule information distributed over the applis:

the goal is not to add a new congestion point;

◮ Computing a full I/O schedule over all iterations of all applications would be too

expensive (i) in time, (ii) in space.

slide-31
SLIDE 31

High-level constraints

◮ Applications are already scheduled on the machines:

not (yet) our job to do it;

◮ We want the schedule information distributed over the applis:

the goal is not to add a new congestion point;

◮ Computing a full I/O schedule over all iterations of all applications would be too

expensive (i) in time, (ii) in space.

◮ We want a minimum overhead for Applis users:

  • therwise, our guess is, users might not like it that much .
slide-32
SLIDE 32

High-level constraints

◮ Applications are already scheduled on the machines:

not (yet) our job to do it;

◮ We want the schedule information distributed over the applis:

the goal is not to add a new congestion point;

◮ Computing a full I/O schedule over all iterations of all applications would be too

expensive (i) in time, (ii) in space.

◮ We want a minimum overhead for Applis users:

  • therwise, our guess is, users might not like it that much .
slide-33
SLIDE 33

High-level constraints

◮ Applications are already scheduled on the machines:

not (yet) our job to do it;

◮ We want the schedule information distributed over the applis:

the goal is not to add a new congestion point;

◮ Computing a full I/O schedule over all iterations of all applications would be too

expensive (i) in time, (ii) in space.

◮ We want a minimum overhead for Applis users:

  • therwise, our guess is, users might not like it that much .

We introduce Periodic Scheduling.

slide-34
SLIDE 34

✶✵

Periodic schedules

Bw Time Init

· · ·

Pattern Clean up c T +c 2T +c 3T +c (n−2)T +c (n−1)T +c nT +c

(a) Periodic schedule (phases)

Bw Time T B vol(1)

io

vol(1)

io

vol(1)

io

vol(2)

io

vol(2)

io

vol(2)

io

vol(3)

io

vol(3)

io

vol(4)

io

initW(4)

1

endW(4)

1

initIO(4)

1

(b) Detail of I/O in a period/pattern

slide-35
SLIDE 35

✶✵

Periodic schedules

Time Schedule vs what Application 4 sees

Bw B Time c T +c vol(1)

io

vol(1)

io

vol(1)

io

vol(2)

io

vol(2)

io

vol(2)

io

vol(3)

io

vol(3)

io

vol(4)

io

initW(4)

1

endW(4)

1

initIO(4)

1

◮ Distributed information ◮ Low complexity ◮ Minimum overhead

slide-36
SLIDE 36

✶✵

Periodic schedules

Time Schedule vs what Application 4 sees

Time c T +c initW(4)

1

endW(4)

1

initIO(4)

1

Compute Idle IO + bw Idle IO + bw Compute

◮ Distributed information ◮ Low complexity ◮ Minimum overhead

slide-37
SLIDE 37

✶✶

Finding a schedule

Obj: algorithm with good SysEfficiency and Dilation perf. Questions:

  • 1. Pattern length T?
  • 2. How many instances of each application?
  • 3. How to schedule them efficiently?

Bw Time T B initW(4)

1

endW(4)

1

initIO(4)

1

slide-38
SLIDE 38

✶✶

Finding a schedule

Obj: algorithm with good SysEfficiency and Dilation perf. Questions:

  • 1. Pattern length T?
  • 2. How many instances of each application?
  • 3. How to schedule them efficiently?

Bw Time T B initW(4)

1

endW(4)

1

initIO(4)

1

Answers:

  • 1. Iterative search, exponential growth (Tmin to Tmax).
  • 2. Bound on the number of instances of each application O
  • maxk(w(k)+time(k)

io )

mink(w(k)+time(k)

io )

  • .
  • 3. Greedy insertion of instances, priority to applis with small Dilation.
slide-39
SLIDE 39

✶✶

Finding a schedule

Obj: algorithm with good SysEfficiency and Dilation perf. Questions:

  • 1. Pattern length T?
  • 2. How many instances of each application?
  • 3. How to schedule them efficiently?

Bw Time T B initW(4)

1

endW(4)

1

initIO(4)

1

Answers:

  • 1. Iterative search, exponential growth (Tmin to Tmax).
  • 2. Bound on the number of instances of each application O
  • maxk(w(k)+time(k)

io )

mink(w(k)+time(k)

io )

  • .
  • 3. Greedy insertion of instances, priority to applis with small Dilation.
  • Distributed information
  • Low complexity
  • Minimum overhead
slide-40
SLIDE 40

✶✷

Model validation (I)

Announcement:

It’s hard to find appli data (w(k), vol(k)

io , β(k)).

If you have some, let’s talk .

slide-41
SLIDE 41

✶✷

Model validation (I)

◮ Four periodic behaviors from the literature to instantiate the applications. ◮ Comparison between simulations and a real machine

(Jupiter @Mellanox: 640 cores, b = 0.01GB/s, B = 3GB/s).

◮ We use IOR benchmark to instantiate the applications on the cluster (ideal world, no

  • ther communication than I/O transfers).

App(k) w(k) (s) vol(k)

io

(GB) β(k) Turbulence1 (T1) 70 128.2 32,768 Turbulence2 (T2) 1.2 235.8 4,096 AstroPhysics (AP) 240 423.4 8,192 PlasmaPhysics (PP) 7554 34304 32,768 Set # T1 T2 AP PP 1 10 2 8 1 3 6 2 4 4 3 5 2 1 6 2 4 7 1 2 8 1 1 9 5 10 1 1

slide-42
SLIDE 42

✶✸

Model validation (II)

0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 9 10

Set System Efficiency / Upper bound

Periodic (expe) Periodic (simu) Online (expe) Online (simu) Congestion

(c) SysEfficiency/Upper bound

1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7 8 9 10

Set Dilation

Periodic (expe) Periodic (simu) Online (expe) Online (simu) Congestion

(d) Dilation

The performance estimated by our model is accurate within 3.8% for periodic schedules and 2.3% for online schedules.

slide-43
SLIDE 43

✶✹

Results

◮ Periodic: our periodic algorithm; ◮ Online: the best performance (Dilation or SysEff resp.) of any online algorithm in Gainaru et al.

Ipdps’15;

◮ Congestion: Current performance on the machine.

Set Dilation SysEff 1

  • 9.33%

+17.94% 2

  • 13.81%

+7.01% 3

  • 15.81%

+8.60% 4

  • 1.46%

+1.09% 5

  • 0.49%

+0.62% 6

  • 2.90%

+6.96% 7

  • 0.49%

+0.73% 8

  • 0.00%

+0.00% 9

  • 0.40%

+0.10% 10

  • 0.59%

+0.10%

0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 9 10

Set System Efficiency / Upper bound

Periodic (expe) Periodic (simu) Online (expe) Online (simu) Congestion

Figure: SysEfficiency/Upper bound

slide-44
SLIDE 44

✶✹

Results

◮ Periodic: our periodic algorithm; ◮ Online: the best performance (Dilation or SysEff resp.) of any online algorithm in Gainaru et al.

Ipdps’15;

◮ Congestion: Current performance on the machine.

Set Dilation SysEff 1

  • 9.33%

+17.94% 2

  • 13.81%

+7.01% 3

  • 15.81%

+8.60% 4

  • 1.46%

+1.09% 5

  • 0.49%

+0.62% 6

  • 2.90%

+6.96% 7

  • 0.49%

+0.73% 8

  • 0.00%

+0.00% 9

  • 0.40%

+0.10% 10

  • 0.59%

+0.10%

1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7 8 9 10

Set Dilation

Periodic (expe) Periodic (simu) Online (expe) Online (simu) Congestion

Figure: Dilation

slide-45
SLIDE 45

✶✺

More data

◮ We generate more sets

  • f applications

◮ Simulate on

instanciations of Intrepid and Mira.

0.6 0.8 1.0 25 50 75 100

Set System Efficiency / Upper bound

Periodic Online (best) 0.4 0.6 0.8 1.0 25 50 75 100

Set System Efficiency / Upper bound

Periodic Online (best) 1.0 1.2 1.4 1.6 25 50 75 100

Set Dilation

Periodic Online (best) 1 2 3 25 50 75 100

Set Dilation

Periodic Online (best)

Intrepid Mira

slide-46
SLIDE 46

✶✻

List of open issues / Future steps

◮ Study of robustness: what if w(k) and vol(k) io

are not exactly known?

◮ Integrating non-periodic application ◮ Burst-buffers integration/modeling ◮ Coupling application scheduler to IO scheduler ◮ Evaluation on real applications

slide-47
SLIDE 47

Conclusion

◮ Offline periodic scheduling algorithm for periodic applications. Algorithm scales well

(complexity depends on the unmber of applications, not on the size of the machine).

◮ Very good expected performance. ◮ Very precise model on very friendly benchmarks. ◮ Right now, more a proof of concept.

Paper, data, code: https://github.com/vlefevre/IO-scheduling-simu