Capacity Planning of Supercomputers Simulating MPI Applications at - - PowerPoint PPT Presentation

capacity planning of supercomputers
SMART_READER_LITE
LIVE PREVIEW

Capacity Planning of Supercomputers Simulating MPI Applications at - - PowerPoint PPT Presentation

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the supervision of Arnaud Legrand 21 June 2017 Laboratoire dInformatique de Grenoble Ensimag - Grenoble INP Introduction Top500 Sunway


slide-1
SLIDE 1

Capacity Planning of Supercomputers

Simulating MPI Applications at Scale

Tom Cornebize Under the supervision of Arnaud Legrand 21 June 2017

Laboratoire d’Informatique de Grenoble Ensimag - Grenoble INP

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Top500

Sunway TaihuLight, China, #1 Tianhe-2, China, #2 93 Pflops 34 Pflops Custom five level hierarchy Fat tree 40, 950 × 260 cores 32, 000 × 12 cores + 48, 000 Xeon Phi Piz Daint, Switzerland, #3 Stampede, United States, #20 20 Pflops 5 Pflops Dragonfly Fat tree 5, 272 × (8 cores + 1 GPU) 6, 400 × (8 cores + 1 Xeon Phi)

1/19

slide-4
SLIDE 4

High Performance LINPACK (HPL)

Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2

3N3 + 2N2 + O(N)

L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

2/19

slide-5
SLIDE 5

High Performance LINPACK (HPL)

Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2

3N3 + 2N2 + O(N)

L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

2/19

slide-6
SLIDE 6

High Performance LINPACK (HPL)

Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2

3N3 + 2N2 + O(N)

L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

2/19

slide-7
SLIDE 7

Open questions in HPC

  • Topology (torus, fat tree, dragonfly, etc.)
  • Routing algorithm
  • Scheduling (when? where?)
  • Workload (job size, behavior)

Keywords: capacity planning, co-design Simulation may help

3/19

slide-8
SLIDE 8

Simulation of HPC applications

Off-line

  • P5: MPI_Recv at t=0.872s
  • P3: MPI_Wait at t=0.881s
  • P7: MPI_Send at t=1.287s
  • P5: MPI_Recv at t=1.568s
  • P7: MPI_Send at t=2.221s
  • P0: MPI_Recv at t=2.559s
  • P3: MPI_Wait at t=2.602s
  • P0: MPI_Send at t=3.520s
  • P1: MPI_Recv at t=4.257s
  • P2: MPI_Recv at t=4.514s
  • P6: MPI_Send at t=5.017s
  • P7: MPI_Recv at t=5.989s
  • P6: MPI_Recv at t=5.997s
  • P4: MPI_Send at t=6.107s
  • P6: MPI_Recv at t=6.534s
  • P2: MPI_Send at t=7.152s
  • P4: MPI_Recv at t=7.754s

[...] t

On-line Simgrid: both approaches

4/19

slide-9
SLIDE 9

Simulation of HPC applications

Off-line

  • P5: MPI_Recv at t=0.872s
  • P3: MPI_Wait at t=0.881s
  • P7: MPI_Send at t=1.287s
  • P5: MPI_Recv at t=1.568s
  • P7: MPI_Send at t=2.221s
  • P0: MPI_Recv at t=2.559s
  • P3: MPI_Wait at t=2.602s
  • P0: MPI_Send at t=3.520s
  • P1: MPI_Recv at t=4.257s
  • P2: MPI_Recv at t=4.514s
  • P6: MPI_Send at t=5.017s
  • P7: MPI_Recv at t=5.989s
  • P6: MPI_Recv at t=5.997s
  • P4: MPI_Send at t=6.107s
  • P6: MPI_Recv at t=6.534s
  • P2: MPI_Send at t=7.152s
  • P4: MPI_Recv at t=7.754s

[...] t

On-line

P0 P1 P2 P3 P4 P5 P6 P7 t

Simgrid: both approaches

4/19

slide-10
SLIDE 10

Simulation of HPC applications

Off-line

  • P5: MPI_Recv at t=0.872s
  • P3: MPI_Wait at t=0.881s
  • P7: MPI_Send at t=1.287s
  • P5: MPI_Recv at t=1.568s
  • P7: MPI_Send at t=2.221s
  • P0: MPI_Recv at t=2.559s
  • P3: MPI_Wait at t=2.602s
  • P0: MPI_Send at t=3.520s
  • P1: MPI_Recv at t=4.257s
  • P2: MPI_Recv at t=4.514s
  • P6: MPI_Send at t=5.017s
  • P7: MPI_Recv at t=5.989s
  • P6: MPI_Recv at t=5.997s
  • P4: MPI_Send at t=6.107s
  • P6: MPI_Recv at t=6.534s
  • P2: MPI_Send at t=7.152s
  • P4: MPI_Recv at t=7.754s

[...] t

On-line

P0 P1 P2 P3 P4 P5 P6 P7 t

Simgrid: both approaches

4/19

slide-11
SLIDE 11

Objective: simulation of Stampede’s execution of HPL

Real execution:

  • Matrix of size 3,875,000
  • Using 6,006 MPI processes
  • About 2 hours

Requirement for the emulation of Stampede’s execution:

  • 3 875 0002

8 bytes 120 terabytes of memory

  • 6 006

2 hours 500 days Very optimistic

5/19

slide-12
SLIDE 12

Objective: simulation of Stampede’s execution of HPL

Real execution:

  • Matrix of size 3,875,000
  • Using 6,006 MPI processes
  • About 2 hours

Requirement for the emulation of Stampede’s execution:

  • ≥ 3, 875, 0002 × 8 bytes ≈ 120 terabytes of memory
  • ≥ 6, 006 × 2 hours ≈ 500 days

Very optimistic

5/19

slide-13
SLIDE 13

Scalable HPL simulation

slide-14
SLIDE 14

Methodology

Several optimizations. For each of them:

  • Evaluate the (possible) loss of prediction accuracy
  • Evaluate the (possible) gain of performance

Publicly available:

  • Laboratory notebook
  • Modified HPL
  • Scripts
  • Modifications to Simgrid (integrated in the main project)

6/19

slide-15
SLIDE 15

Computation kernel sampling

dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

Tdgemm M N K M N K 1 706348 10

10

Tdtrsm M N M N2 8 624970 10

11 7/19

slide-16
SLIDE 16

Computation kernel sampling

dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

Tdgemm M N K M N K 1 706348 10

10

Tdtrsm M N M N2 8 624970 10

11 7/19

slide-17
SLIDE 17

Computation kernel sampling

dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

  • 5

10 0e+00 2e+10 4e+10 6e+10 8e+10

m * n * k Time (seconds)

Linear regression of dgemm

  • 1

2 3 4 5 0e+00 2e+10 4e+10 6e+10

m * n^2 Time (seconds)

Linear regression of dtrsm

Tdgemm(M, N, K) = M × N × K × 1.706348 × 10−10 Tdtrsm(M, N) = M × N2 × 8.624970 × 10−11

7/19

slide-18
SLIDE 18

Computation pruning

68% of the simulation time spent in HPL

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

Culprits:

  • Initialization and verification functions
  • Other BLAS and HPL functions

Solution: just skip them

8/19

slide-19
SLIDE 19

Computation pruning

68% of the simulation time spent in HPL

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

Culprits:

  • Initialization and verification functions
  • Other BLAS and HPL functions

Solution: just skip them

8/19

slide-20
SLIDE 20

Computation pruning

68% of the simulation time spent in HPL

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

Culprits:

  • Initialization and verification functions
  • Other BLAS and HPL functions

Solution: just skip them

8/19

slide-21
SLIDE 21

Reducing the memory consumption

Memory consumption still too large Solution: use SMPI_SHARED_MALLOC

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

virtual physical

9/19

slide-22
SLIDE 22

Reducing the memory consumption

Memory consumption still too large Solution: use SMPI_SHARED_MALLOC

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

virtual physical

9/19

slide-23
SLIDE 23

Reducing the memory consumption

Memory consumption still too large Solution: use SMPI_SHARED_MALLOC

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

virtual physical

9/19

slide-24
SLIDE 24

Reducing the memory consumption

Memory consumption still too large Solution: use SMPI_SHARED_MALLOC

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

virtual physical

9/19

slide-25
SLIDE 25

Reducing the memory consumption

Memory consumption still too large Solution: use SMPI_SHARED_MALLOC

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

virtual physical

9/19

slide-26
SLIDE 26

Reducing the memory consumption #154

Problem: panel buffers Must remain contiguous

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.

10/19

slide-27
SLIDE 27

Reducing the memory consumption #154

Problem: panel buffers Must remain contiguous

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.

10/19

slide-28
SLIDE 28

Reducing the memory consumption #154

Problem: panel buffers Must remain contiguous

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.

10/19

slide-29
SLIDE 29

Reducing the memory consumption #154

Problem: panel buffers Must remain contiguous

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.

10/19

slide-30
SLIDE 30

Reducing the memory consumption #154

Problem: panel buffers Must remain contiguous

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.

10/19

slide-31
SLIDE 31

Reusing the panel buffers

At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update

shared private shared shared private shared initial buffer current buffer

11/19

slide-32
SLIDE 32

Reusing the panel buffers

At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update

shared private shared shared private shared initial buffer current buffer

11/19

slide-33
SLIDE 33

Reusing the panel buffers

At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update

shared private shared shared private shared initial buffer current buffer

11/19

slide-34
SLIDE 34

Using huge pages #168

Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages

Matrix of size N page table of size: PTsize N N2 8 4 096 8

allocated virtual memory page size entry size

PTsize 600 000 5GB Solution: using huge pages

12/19

slide-35
SLIDE 35

Using huge pages #168

Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages

Matrix of size N page table of size: PTsize N N2 8 4 096 8

allocated virtual memory page size entry size

PTsize 600 000 5GB Solution: using huge pages

12/19

slide-36
SLIDE 36

Using huge pages #168

Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages

Matrix of size N ⇒ page table of size: PTsize(N) = N2 × 8 4, 096 × 8

allocated virtual memory page size entry size

PTsize(600, 000) ≈ 5GB Solution: using huge pages

12/19

slide-37
SLIDE 37

Using huge pages #168

Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages

Matrix of size N ⇒ page table of size: PTsize(N) = N2 × 8 4, 096 × 8

allocated virtual memory page size entry size

PTsize(600, 000) ≈ 5GB Solution: using huge pages

12/19

slide-38
SLIDE 38

Scalability

slide-39
SLIDE 39

Scalability

  • 512

1,024 2,048 4,096 10 20 30 40 0e+00 1e+06 2e+06 3e+06 4e+06 Matrix size Simulation time (hours)

Simulation time for different matrix sizes

  • 512

1,024 2,048 4,096 5 10 15 0e+00 1e+06 2e+06 3e+06 4e+06 Matrix size Memory consumption (gigabytes)

Memory consumption for different matrix sizes

Number of processes

  • 512

1,024 2,048 4,096

  • 500,000

1,000,000 2,000,000 4,000,000 10 20 30 40 1000 2000 3000 4000 Number of processes Simulation time (hours)

Simulation time for different number of processes

  • 500,000

1,000,000 2,000,000 4,000,000 5 10 15 1000 2000 3000 4000 Number of processes Memory consumption (gigabytes)

Memory consumption for different number of processes

Matrix size

  • 500,000

1,000,000 2,000,000 4,000,000

13/19

slide-40
SLIDE 40

Validation

slide-41
SLIDE 41

What?

Let’s compare:

  • a real experiment
  • a vanilla simulation
  • an optimized simulation

Measuring the duration of HPL and its energy consumption

14/19

slide-42
SLIDE 42

How?

Using Grid’5000:

  • Cluster Taurus, in Lyon
  • 16 nodes
  • 2 Intel Xeon E5-2630, 6 cores/CPU, 2.3GHz
  • 32GB RAM
  • 1 switch, 10Gbps links
  • Hyperthreading and turbo-mode disabled

15/19

slide-43
SLIDE 43

Result

  • 20

40 60 80 50 100 150 200

Number of processes Duration (seconds)

HPL duration for different numbers of processes Matrix size: 20,000

  • 10000

20000 30000 40000 50 100 150 200

Number of processes Energy consumption (joules)

HPL energy consumption for different numbers of processes Matrix size: 20,000

Experiment type

  • Optimized simulation

Vanilla simulation Real execution

Prediction error: ≤ 12% Simulation systematically too optimistic

  • No outliers in dgemm and dtrsm duration
  • Functions skipped
  • Optimistic network model

16/19

slide-44
SLIDE 44

Result

  • 20

40 60 80 50 100 150 200

Number of processes Duration (seconds)

HPL duration for different numbers of processes Matrix size: 20,000

  • 10000

20000 30000 40000 50 100 150 200

Number of processes Energy consumption (joules)

HPL energy consumption for different numbers of processes Matrix size: 20,000

Experiment type

  • Optimized simulation

Vanilla simulation Real execution

Prediction error: ≤ 12% Simulation systematically too optimistic

  • No outliers in dgemm and dtrsm duration
  • Functions skipped
  • Optimistic network model

16/19

slide-45
SLIDE 45

Conclusion

slide-46
SLIDE 46

To sum up

Simulation of HPL, accurate and efficient Can reach the scales of the largest supercomputers Small modifications to HPL (300/34k lines) Several improvements to Simgrid

17/19

slide-47
SLIDE 47

Coming soon…

Capacity planning of supercomputers

  • 10

20 30 5 10 15

Number of L2 switches Performance estimation (Gflops)

HPL performance estimation for different topologies Bandwidth of 10Mbps

  • 100

200 300 400 500 5 10 15

Number of L2 switches Performance estimation (Gflops)

HPL performance estimation for different topologies Bandwidth of 10Gbps

Mapping

  • Random

Sequential

Failures, energy consumption

18/19

slide-48
SLIDE 48

Top500

Sunway TaihuLight, China, #1 Tianhe-2, China, #2 93 Pflops 34 Pflops Custom five level hierarchy Fat tree 40, 950 × 260 cores 32, 000 × 12 cores + 48, 000 Xeon Phi Piz Daint, Switzerland, #3 Stampede, United States, #20 20 Pflops 5 Pflops Dragonfly Fat tree 5, 272 × (8 cores + 1 GPU) 6, 400 × (8 cores + 1 Xeon Phi) 1/19

High Performance LINPACK (HPL) Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2

3N3 + 2N2 + O(N) L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update 2/19

Open questions in HPC

  • Topology (torus, fat tree, dragonfly, etc.)
  • Routing algorithm
  • Scheduling (when? where?)
  • Workload (job size, behavior)

Keywords: capacity planning, co-design Simulation may help 3/19 Simulation of HPC applications Off-line

  • P5: MPI_Recv at t=0.872s
  • P3: MPI_Wait at t=0.881s
  • P7: MPI_Send at t=1.287s
  • P5: MPI_Recv at t=1.568s
  • P7: MPI_Send at t=2.221s
  • P0: MPI_Recv at t=2.559s
  • P3: MPI_Wait at t=2.602s
  • P0: MPI_Send at t=3.520s
  • P1: MPI_Recv at t=4.257s
  • P2: MPI_Recv at t=4.514s
  • P6: MPI_Send at t=5.017s
  • P7: MPI_Recv at t=5.989s
  • P6: MPI_Recv at t=5.997s
  • P4: MPI_Send at t=6.107s
  • P6: MPI_Recv at t=6.534s
  • P2: MPI_Send at t=7.152s
  • P4: MPI_Recv at t=7.754s
[...] t

On-line

P0 P1 P2 P3 P4 P5 P6 P7 t

Simgrid: both approaches 4/19 Objective: simulation of Stampede’s execution of HPL Real execution:

  • Matrix of size 3,875,000
  • Using 6,006 MPI processes
  • About 2 hours

Requirement for the emulation of Stampede’s execution:

  • ≥ 3, 875, 0002 × 8 bytes ≈ 120 terabytes of memory
  • ≥ 6, 006 × 2 hours ≈ 500 days

Very optimistic 5/19 Methodology Several optimizations. For each of them:

  • Evaluate the (possible) loss of prediction accuracy
  • Evaluate the (possible) gain of performance

Publicly available:

  • Laboratory notebook
  • Modified HPL
  • Scripts
  • Modifications to Simgrid (integrated in the main project)
6/19

Computation kernel sampling dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
  • 5
10 0e+00 2e+10 4e+10 6e+10 8e+10 m * n * k Time (seconds) Linear regression of dgemm
  • 1
2 3 4 5 0e+00 2e+10 4e+10 6e+10 m * n^2 Time (seconds) Linear regression of dtrsm

Tdgemm(M, N, K) = M × N × K × 1.706348 × 10−10 Tdtrsm(M, N) = M × N2 × 8.624970 × 10−11 7/19 Computation pruning 68% of the simulation time spent in HPL

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

Culprits:

  • Initialization and verification functions
  • Other BLAS and HPL functions

Solution: just skip them 8/19 Reducing the memory consumption Memory consumption still too large Solution: use SMPI_SHARED_MALLOC

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

virtual physical 9/19 Reducing the memory consumption #154 Problem: panel buffers Must remain contiguous

allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update

matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks. 10/19 Reusing the panel buffers At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update

shared private shared shared private shared initial buffer current buffer 11/19 Using huge pages #168 Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large

allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages

Matrix of size N ⇒ page table of size: PTsize(N) = N2 × 8 4, 096 × 8 allocated virtual memory page size entry size PTsize(600, 000) ≈ 5GB Solution: using huge pages 12/19 Scalability

  • 512
1,024 2,048 4,096 10 20 30 40 0e+00 1e+06 2e+06 3e+06 4e+06 Matrix size Simulation time (hours) Simulation time for different matrix sizes
  • 512
1,024 2,048 4,096 5 10 15 0e+00 1e+06 2e+06 3e+06 4e+06 Matrix size Memory consumption (gigabytes) Memory consumption for different matrix sizes Number of processes
  • 512
1,024 2,048 4,096
  • 500,000
1,000,000 2,000,000 4,000,000 10 20 30 40 1000 2000 3000 4000 Number of processes Simulation time (hours) Simulation time for different number of processes
  • 500,000
1,000,000 2,000,000 4,000,000 5 10 15 1000 2000 3000 4000 Number of processes Memory consumption (gigabytes) Memory consumption for different number of processes Matrix size
  • 500,000
1,000,000 2,000,000 4,000,000 13/19

What? Let’s compare:

  • a real experiment
  • a vanilla simulation
  • an optimized simulation

Measuring the duration of HPL and its energy consumption 14/19 How? Using Grid’5000:

  • Cluster Taurus, in Lyon
  • 16 nodes
  • 2 Intel Xeon E5-2630, 6 cores/CPU, 2.3GHz
  • 32GB RAM
  • 1 switch, 10Gbps links
  • Hyperthreading and turbo-mode disabled
15/19

Result

  • 20
40 60 80 50 100 150 200 Number of processes Duration (seconds) HPL duration for different numbers of processes Matrix size: 20,000
  • 10000
20000 30000 40000 50 100 150 200 Number of processes Energy consumption (joules) HPL energy consumption for different numbers of processes Matrix size: 20,000 Experiment type
  • Optimized simulation
Vanilla simulation Real execution

Prediction error: ≤ 12% Simulation systematically too optimistic

  • No outliers in dgemm and dtrsm duration
  • Functions skipped
  • Optimistic network model
16/19

Thanks for your attention! Any questions?

slide-49
SLIDE 49

Computation kernel sampling

  • 25

50 75 100 10000 20000 30000 40000

Matrix size Performance estimation (Gflops)

Performance estimation for different matrix sizes Using 64 MPI processes

  • 25

50 75 100 50 100

Number of processes Performance estimation (Gflops)

Performance estimation for different number of processes Using a matrix size of 20,000

Kernel sampling

  • FALSE

TRUE

Prediction error: ≤ 10%

20/19

slide-50
SLIDE 50

Computation kernel sampling

  • 500

1000 1500 10000 20000 30000 40000

Matrix size Simulation time (seconds)

Simulation time for different matrix sizes Using 64 MPI processes

  • 500

1000 1500 50 100

Number of processes Simulation time (seconds)

Simulation time for different number of processes Using a matrix size of 20,000

  • 0e+00

5e+09 1e+10 10000 20000 30000 40000

Matrix size Memory consumption (bytes)

Memory consumption for different matrix sizes Using 64 MPI processes

  • 0e+00

5e+09 1e+10 50 100

Number of processes Memory consumption (bytes)

Memory consumption for different number of processes Using a matrix size of 20,000

Kernel sampling

  • FALSE

TRUE

21/19

slide-51
SLIDE 51

Computation pruning

  • 30

60 90 10000 20000 30000 40000

Matrix size Performance estimation (Gflops)

Performance estimation for different matrix sizes Using 64 MPI processes

  • 30

60 90 50 100

Number of processes Performance estimation (Gflops)

Performance estimation for different number of processes Using a matrix size of 20,000

Computation pruning

  • FALSE

TRUE

Prediction error: ≤ 5%

22/19

slide-52
SLIDE 52

Computation pruning

  • 50

100 150 10000 20000 30000 40000

Matrix size Simulation time (seconds)

Simulation time for different matrix sizes Using 64 MPI processes

  • 50

100 150 50 100

Number of processes Simulation time (seconds)

Simulation time for different number of processes Using a matrix size of 20,000

  • 0e+00

5e+09 1e+10 10000 20000 30000 40000

Matrix size Memory consumption (bytes)

Memory consumption for different matrix sizes Using 64 MPI processes

  • 0e+00

5e+09 1e+10 50 100

Number of processes Memory consumption (bytes)

Memory consumption for different number of processes Using a matrix size of 20,000

Computation pruning

  • FALSE

TRUE

23/19

slide-53
SLIDE 53

Reducing the memory consumption

  • 30

60 90 10000 20000 30000 40000

Matrix size Performance estimation (Gflops)

Performance estimation for different matrix sizes Using 64 MPI processes

  • 30

60 90 50 100

Number of processes Performance estimation (Gflops)

Performance estimation for different number of processes Using a matrix size of 20,000

Shared malloc

  • FALSE

TRUE

Prediction error: ≤ 1%

24/19

slide-54
SLIDE 54

Reducing the memory consumption

  • 10

20 30 40 10000 20000 30000 40000

Matrix size Simulation time (seconds)

Simulation time for different matrix sizes Using 64 MPI processes

  • 10

20 30 40 50 100

Number of processes Simulation time (seconds)

Simulation time for different number of processes Using a matrix size of 20,000

  • 0e+00

2e+08 4e+08 6e+08 10000 20000 30000 40000

Matrix size Memory consumption (bytes)

Memory consumption for different matrix sizes Using 64 MPI processes

  • 0e+00

2e+08 4e+08 6e+08 50 100

Number of processes Memory consumption (bytes)

Memory consumption for different number of processes Using a matrix size of 20,000

Shared malloc

  • FALSE

TRUE

25/19

slide-55
SLIDE 55

Reusing the panel buffers

  • 30

60 90 10000 20000 30000 40000

Matrix size Performance estimation (Gflops)

Performance estimation for different matrix sizes Using 64 MPI processes

  • 30

60 90 50 100

Number of processes Performance estimation (Gflops)

Performance estimation for different number of processes Using a matrix size of 20,000

Panel reuse

  • FALSE

TRUE

Prediction error: ≤ 1%

26/19

slide-56
SLIDE 56

Reusing the panel buffers

  • 5

10 15 20 10000 20000 30000 40000

Matrix size Simulation time (seconds)

Simulation time for different matrix sizes Using 64 MPI processes

  • 5

10 15 20 50 100

Number of processes Simulation time (seconds)

Simulation time for different number of processes Using a matrix size of 20,000

  • 0e+00

1e+07 2e+07 3e+07 4e+07 10000 20000 30000 40000

Matrix size Memory consumption (bytes)

Memory consumption for different matrix sizes Using 64 MPI processes

  • 0e+00

1e+07 2e+07 3e+07 4e+07 50 100

Number of processes Memory consumption (bytes)

Memory consumption for different number of processes Using a matrix size of 20,000

Panel reuse

  • FALSE

TRUE

27/19

slide-57
SLIDE 57

Using huge pages

  • 30

60 90 120 0e+00 1e+05 2e+05 3e+05

Matrix size Performance estimation (Gflops) Huge page

  • FALSE

TRUE

Performance estimation for different matrix sizes Using 64 MPI processes

Prediction error: ≤ 0.1%

28/19

slide-58
SLIDE 58

Using huge pages

  • 200

400 600 0e+00 1e+05 2e+05 3e+05

Matrix size Simulation time (seconds)

Simulation time for different matrix sizes Using 64 MPI processes

  • 0.0e+00

5.0e+08 1.0e+09 1.5e+09 0e+00 1e+05 2e+05 3e+05

Matrix size Memory consumption (bytes)

Memory consumption for different matrix sizes Using 64 MPI processes

Huge page

  • FALSE

TRUE

29/19