Capacity Planning of Supercomputers Simulating MPI Applications at - - PowerPoint PPT Presentation
Capacity Planning of Supercomputers Simulating MPI Applications at - - PowerPoint PPT Presentation
Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the supervision of Arnaud Legrand 21 June 2017 Laboratoire dInformatique de Grenoble Ensimag - Grenoble INP Introduction Top500 Sunway
Introduction
Top500
Sunway TaihuLight, China, #1 Tianhe-2, China, #2 93 Pflops 34 Pflops Custom five level hierarchy Fat tree 40, 950 × 260 cores 32, 000 × 12 cores + 48, 000 Xeon Phi Piz Daint, Switzerland, #3 Stampede, United States, #20 20 Pflops 5 Pflops Dragonfly Fat tree 5, 272 × (8 cores + 1 GPU) 6, 400 × (8 cores + 1 Xeon Phi)
1/19
High Performance LINPACK (HPL)
Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2
3N3 + 2N2 + O(N)
L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
2/19
High Performance LINPACK (HPL)
Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2
3N3 + 2N2 + O(N)
L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
2/19
High Performance LINPACK (HPL)
Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2
3N3 + 2N2 + O(N)
L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
2/19
Open questions in HPC
- Topology (torus, fat tree, dragonfly, etc.)
- Routing algorithm
- Scheduling (when? where?)
- Workload (job size, behavior)
Keywords: capacity planning, co-design Simulation may help
3/19
Simulation of HPC applications
Off-line
- P5: MPI_Recv at t=0.872s
- P3: MPI_Wait at t=0.881s
- P7: MPI_Send at t=1.287s
- P5: MPI_Recv at t=1.568s
- P7: MPI_Send at t=2.221s
- P0: MPI_Recv at t=2.559s
- P3: MPI_Wait at t=2.602s
- P0: MPI_Send at t=3.520s
- P1: MPI_Recv at t=4.257s
- P2: MPI_Recv at t=4.514s
- P6: MPI_Send at t=5.017s
- P7: MPI_Recv at t=5.989s
- P6: MPI_Recv at t=5.997s
- P4: MPI_Send at t=6.107s
- P6: MPI_Recv at t=6.534s
- P2: MPI_Send at t=7.152s
- P4: MPI_Recv at t=7.754s
[...] t
On-line Simgrid: both approaches
4/19
Simulation of HPC applications
Off-line
- P5: MPI_Recv at t=0.872s
- P3: MPI_Wait at t=0.881s
- P7: MPI_Send at t=1.287s
- P5: MPI_Recv at t=1.568s
- P7: MPI_Send at t=2.221s
- P0: MPI_Recv at t=2.559s
- P3: MPI_Wait at t=2.602s
- P0: MPI_Send at t=3.520s
- P1: MPI_Recv at t=4.257s
- P2: MPI_Recv at t=4.514s
- P6: MPI_Send at t=5.017s
- P7: MPI_Recv at t=5.989s
- P6: MPI_Recv at t=5.997s
- P4: MPI_Send at t=6.107s
- P6: MPI_Recv at t=6.534s
- P2: MPI_Send at t=7.152s
- P4: MPI_Recv at t=7.754s
[...] t
On-line
P0 P1 P2 P3 P4 P5 P6 P7 t
Simgrid: both approaches
4/19
Simulation of HPC applications
Off-line
- P5: MPI_Recv at t=0.872s
- P3: MPI_Wait at t=0.881s
- P7: MPI_Send at t=1.287s
- P5: MPI_Recv at t=1.568s
- P7: MPI_Send at t=2.221s
- P0: MPI_Recv at t=2.559s
- P3: MPI_Wait at t=2.602s
- P0: MPI_Send at t=3.520s
- P1: MPI_Recv at t=4.257s
- P2: MPI_Recv at t=4.514s
- P6: MPI_Send at t=5.017s
- P7: MPI_Recv at t=5.989s
- P6: MPI_Recv at t=5.997s
- P4: MPI_Send at t=6.107s
- P6: MPI_Recv at t=6.534s
- P2: MPI_Send at t=7.152s
- P4: MPI_Recv at t=7.754s
[...] t
On-line
P0 P1 P2 P3 P4 P5 P6 P7 t
Simgrid: both approaches
4/19
Objective: simulation of Stampede’s execution of HPL
Real execution:
- Matrix of size 3,875,000
- Using 6,006 MPI processes
- About 2 hours
Requirement for the emulation of Stampede’s execution:
- 3 875 0002
8 bytes 120 terabytes of memory
- 6 006
2 hours 500 days Very optimistic
5/19
Objective: simulation of Stampede’s execution of HPL
Real execution:
- Matrix of size 3,875,000
- Using 6,006 MPI processes
- About 2 hours
Requirement for the emulation of Stampede’s execution:
- ≥ 3, 875, 0002 × 8 bytes ≈ 120 terabytes of memory
- ≥ 6, 006 × 2 hours ≈ 500 days
Very optimistic
5/19
Scalable HPL simulation
Methodology
Several optimizations. For each of them:
- Evaluate the (possible) loss of prediction accuracy
- Evaluate the (possible) gain of performance
Publicly available:
- Laboratory notebook
- Modified HPL
- Scripts
- Modifications to Simgrid (integrated in the main project)
6/19
Computation kernel sampling
dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
Tdgemm M N K M N K 1 706348 10
10
Tdtrsm M N M N2 8 624970 10
11 7/19
Computation kernel sampling
dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
Tdgemm M N K M N K 1 706348 10
10
Tdtrsm M N M N2 8 624970 10
11 7/19
Computation kernel sampling
dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
- 5
10 0e+00 2e+10 4e+10 6e+10 8e+10
m * n * k Time (seconds)
Linear regression of dgemm
- 1
2 3 4 5 0e+00 2e+10 4e+10 6e+10
m * n^2 Time (seconds)
Linear regression of dtrsm
Tdgemm(M, N, K) = M × N × K × 1.706348 × 10−10 Tdtrsm(M, N) = M × N2 × 8.624970 × 10−11
7/19
Computation pruning
68% of the simulation time spent in HPL
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
Culprits:
- Initialization and verification functions
- Other BLAS and HPL functions
Solution: just skip them
8/19
Computation pruning
68% of the simulation time spent in HPL
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
Culprits:
- Initialization and verification functions
- Other BLAS and HPL functions
Solution: just skip them
8/19
Computation pruning
68% of the simulation time spent in HPL
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
Culprits:
- Initialization and verification functions
- Other BLAS and HPL functions
Solution: just skip them
8/19
Reducing the memory consumption
Memory consumption still too large Solution: use SMPI_SHARED_MALLOC
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
virtual physical
9/19
Reducing the memory consumption
Memory consumption still too large Solution: use SMPI_SHARED_MALLOC
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
virtual physical
9/19
Reducing the memory consumption
Memory consumption still too large Solution: use SMPI_SHARED_MALLOC
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
virtual physical
9/19
Reducing the memory consumption
Memory consumption still too large Solution: use SMPI_SHARED_MALLOC
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
virtual physical
9/19
Reducing the memory consumption
Memory consumption still too large Solution: use SMPI_SHARED_MALLOC
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
virtual physical
9/19
Reducing the memory consumption #154
Problem: panel buffers Must remain contiguous
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.
10/19
Reducing the memory consumption #154
Problem: panel buffers Must remain contiguous
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.
10/19
Reducing the memory consumption #154
Problem: panel buffers Must remain contiguous
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.
10/19
Reducing the memory consumption #154
Problem: panel buffers Must remain contiguous
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.
10/19
Reducing the memory consumption #154
Problem: panel buffers Must remain contiguous
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update
matrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks.
10/19
Reusing the panel buffers
At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update
shared private shared shared private shared initial buffer current buffer
11/19
Reusing the panel buffers
At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update
shared private shared shared private shared initial buffer current buffer
11/19
Reusing the panel buffers
At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update
shared private shared shared private shared initial buffer current buffer
11/19
Using huge pages #168
Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages
Matrix of size N page table of size: PTsize N N2 8 4 096 8
allocated virtual memory page size entry size
PTsize 600 000 5GB Solution: using huge pages
12/19
Using huge pages #168
Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages
Matrix of size N page table of size: PTsize N N2 8 4 096 8
allocated virtual memory page size entry size
PTsize 600 000 5GB Solution: using huge pages
12/19
Using huge pages #168
Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages
Matrix of size N ⇒ page table of size: PTsize(N) = N2 × 8 4, 096 × 8
allocated virtual memory page size entry size
PTsize(600, 000) ≈ 5GB Solution: using huge pages
12/19
Using huge pages #168
Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pages
Matrix of size N ⇒ page table of size: PTsize(N) = N2 × 8 4, 096 × 8
allocated virtual memory page size entry size
PTsize(600, 000) ≈ 5GB Solution: using huge pages
12/19
Scalability
Scalability
- 512
1,024 2,048 4,096 10 20 30 40 0e+00 1e+06 2e+06 3e+06 4e+06 Matrix size Simulation time (hours)
Simulation time for different matrix sizes
- 512
1,024 2,048 4,096 5 10 15 0e+00 1e+06 2e+06 3e+06 4e+06 Matrix size Memory consumption (gigabytes)
Memory consumption for different matrix sizes
Number of processes
- 512
1,024 2,048 4,096
- 500,000
1,000,000 2,000,000 4,000,000 10 20 30 40 1000 2000 3000 4000 Number of processes Simulation time (hours)
Simulation time for different number of processes
- 500,000
1,000,000 2,000,000 4,000,000 5 10 15 1000 2000 3000 4000 Number of processes Memory consumption (gigabytes)
Memory consumption for different number of processes
Matrix size
- 500,000
1,000,000 2,000,000 4,000,000
13/19
Validation
What?
Let’s compare:
- a real experiment
- a vanilla simulation
- an optimized simulation
Measuring the duration of HPL and its energy consumption
14/19
How?
Using Grid’5000:
- Cluster Taurus, in Lyon
- 16 nodes
- 2 Intel Xeon E5-2630, 6 cores/CPU, 2.3GHz
- 32GB RAM
- 1 switch, 10Gbps links
- Hyperthreading and turbo-mode disabled
15/19
Result
- 20
40 60 80 50 100 150 200
Number of processes Duration (seconds)
HPL duration for different numbers of processes Matrix size: 20,000
- 10000
20000 30000 40000 50 100 150 200
Number of processes Energy consumption (joules)
HPL energy consumption for different numbers of processes Matrix size: 20,000
Experiment type
- Optimized simulation
Vanilla simulation Real execution
Prediction error: ≤ 12% Simulation systematically too optimistic
- No outliers in dgemm and dtrsm duration
- Functions skipped
- Optimistic network model
16/19
Result
- 20
40 60 80 50 100 150 200
Number of processes Duration (seconds)
HPL duration for different numbers of processes Matrix size: 20,000
- 10000
20000 30000 40000 50 100 150 200
Number of processes Energy consumption (joules)
HPL energy consumption for different numbers of processes Matrix size: 20,000
Experiment type
- Optimized simulation
Vanilla simulation Real execution
Prediction error: ≤ 12% Simulation systematically too optimistic
- No outliers in dgemm and dtrsm duration
- Functions skipped
- Optimistic network model
16/19
Conclusion
To sum up
Simulation of HPL, accurate and efficient Can reach the scales of the largest supercomputers Small modifications to HPL (300/34k lines) Several improvements to Simgrid
17/19
Coming soon…
Capacity planning of supercomputers
- 10
20 30 5 10 15
Number of L2 switches Performance estimation (Gflops)
HPL performance estimation for different topologies Bandwidth of 10Mbps
- 100
200 300 400 500 5 10 15
Number of L2 switches Performance estimation (Gflops)
HPL performance estimation for different topologies Bandwidth of 10Gbps
Mapping
- Random
Sequential
Failures, energy consumption
18/19
Top500
Sunway TaihuLight, China, #1 Tianhe-2, China, #2 93 Pflops 34 Pflops Custom five level hierarchy Fat tree 40, 950 × 260 cores 32, 000 × 12 cores + 48, 000 Xeon Phi Piz Daint, Switzerland, #3 Stampede, United States, #20 20 Pflops 5 Pflops Dragonfly Fat tree 5, 272 × (8 cores + 1 GPU) 6, 400 × (8 cores + 1 Xeon Phi) 1/19High Performance LINPACK (HPL) Benchmark used to establish the Top500 LU factorization, A = L × U Complexity: flop(N) = 2
3N3 + 2N2 + O(N) L U A N allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update 2/19Open questions in HPC
- Topology (torus, fat tree, dragonfly, etc.)
- Routing algorithm
- Scheduling (when? where?)
- Workload (job size, behavior)
Keywords: capacity planning, co-design Simulation may help 3/19 Simulation of HPC applications Off-line
- P5: MPI_Recv at t=0.872s
- P3: MPI_Wait at t=0.881s
- P7: MPI_Send at t=1.287s
- P5: MPI_Recv at t=1.568s
- P7: MPI_Send at t=2.221s
- P0: MPI_Recv at t=2.559s
- P3: MPI_Wait at t=2.602s
- P0: MPI_Send at t=3.520s
- P1: MPI_Recv at t=4.257s
- P2: MPI_Recv at t=4.514s
- P6: MPI_Send at t=5.017s
- P7: MPI_Recv at t=5.989s
- P6: MPI_Recv at t=5.997s
- P4: MPI_Send at t=6.107s
- P6: MPI_Recv at t=6.534s
- P2: MPI_Send at t=7.152s
- P4: MPI_Recv at t=7.754s
On-line
P0 P1 P2 P3 P4 P5 P6 P7 tSimgrid: both approaches 4/19 Objective: simulation of Stampede’s execution of HPL Real execution:
- Matrix of size 3,875,000
- Using 6,006 MPI processes
- About 2 hours
Requirement for the emulation of Stampede’s execution:
- ≥ 3, 875, 0002 × 8 bytes ≈ 120 terabytes of memory
- ≥ 6, 006 × 2 hours ≈ 500 days
Very optimistic 5/19 Methodology Several optimizations. For each of them:
- Evaluate the (possible) loss of prediction accuracy
- Evaluate the (possible) gain of performance
Publicly available:
- Laboratory notebook
- Modified HPL
- Scripts
- Modifications to Simgrid (integrated in the main project)
Computation kernel sampling dgemm and dtrsm ≥ 90% of the simulation time Solution: modeling these functions to inject their duration
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update- 5
- 1
Tdgemm(M, N, K) = M × N × K × 1.706348 × 10−10 Tdtrsm(M, N) = M × N2 × 8.624970 × 10−11 7/19 Computation pruning 68% of the simulation time spent in HPL
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast updateCulprits:
- Initialization and verification functions
- Other BLAS and HPL functions
Solution: just skip them 8/19 Reducing the memory consumption Memory consumption still too large Solution: use SMPI_SHARED_MALLOC
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast updatevirtual physical 9/19 Reducing the memory consumption #154 Problem: panel buffers Must remain contiguous
allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast updatematrix parts indices matrix parts can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks. 10/19 Reusing the panel buffers At each iteration, new allocation and deallo- cation by all processes Solution: reuse the buffers (sizes strictly de- creasing) Needs to be done carefully
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast updateshared private shared shared private shared initial buffer current buffer 11/19 Using huge pages #168 Problem: at large scales, CPU utilization drops and simulation time explodes Reason: the page table becomes very large
allocate the matrix for k = N to 0 do / / / / / / / / / / / / / / / / / / / / / / / allocate the panel various functions (max, swap,…) compute the inverse broadcast update huge pagesMatrix of size N ⇒ page table of size: PTsize(N) = N2 × 8 4, 096 × 8 allocated virtual memory page size entry size PTsize(600, 000) ≈ 5GB Solution: using huge pages 12/19 Scalability
- 512
- 512
- 512
- 500,000
- 500,000
- 500,000
What? Let’s compare:
- a real experiment
- a vanilla simulation
- an optimized simulation
Measuring the duration of HPL and its energy consumption 14/19 How? Using Grid’5000:
- Cluster Taurus, in Lyon
- 16 nodes
- 2 Intel Xeon E5-2630, 6 cores/CPU, 2.3GHz
- 32GB RAM
- 1 switch, 10Gbps links
- Hyperthreading and turbo-mode disabled
Result
- 20
- 10000
- Optimized simulation
Prediction error: ≤ 12% Simulation systematically too optimistic
- No outliers in dgemm and dtrsm duration
- Functions skipped
- Optimistic network model
Thanks for your attention! Any questions?
Computation kernel sampling
- 25
50 75 100 10000 20000 30000 40000
Matrix size Performance estimation (Gflops)
Performance estimation for different matrix sizes Using 64 MPI processes
- 25
50 75 100 50 100
Number of processes Performance estimation (Gflops)
Performance estimation for different number of processes Using a matrix size of 20,000
Kernel sampling
- FALSE
TRUE
Prediction error: ≤ 10%
20/19
Computation kernel sampling
- 500
1000 1500 10000 20000 30000 40000
Matrix size Simulation time (seconds)
Simulation time for different matrix sizes Using 64 MPI processes
- 500
1000 1500 50 100
Number of processes Simulation time (seconds)
Simulation time for different number of processes Using a matrix size of 20,000
- 0e+00
5e+09 1e+10 10000 20000 30000 40000
Matrix size Memory consumption (bytes)
Memory consumption for different matrix sizes Using 64 MPI processes
- 0e+00
5e+09 1e+10 50 100
Number of processes Memory consumption (bytes)
Memory consumption for different number of processes Using a matrix size of 20,000
Kernel sampling
- FALSE
TRUE
21/19
Computation pruning
- 30
60 90 10000 20000 30000 40000
Matrix size Performance estimation (Gflops)
Performance estimation for different matrix sizes Using 64 MPI processes
- 30
60 90 50 100
Number of processes Performance estimation (Gflops)
Performance estimation for different number of processes Using a matrix size of 20,000
Computation pruning
- FALSE
TRUE
Prediction error: ≤ 5%
22/19
Computation pruning
- 50
100 150 10000 20000 30000 40000
Matrix size Simulation time (seconds)
Simulation time for different matrix sizes Using 64 MPI processes
- 50
100 150 50 100
Number of processes Simulation time (seconds)
Simulation time for different number of processes Using a matrix size of 20,000
- 0e+00
5e+09 1e+10 10000 20000 30000 40000
Matrix size Memory consumption (bytes)
Memory consumption for different matrix sizes Using 64 MPI processes
- 0e+00
5e+09 1e+10 50 100
Number of processes Memory consumption (bytes)
Memory consumption for different number of processes Using a matrix size of 20,000
Computation pruning
- FALSE
TRUE
23/19
Reducing the memory consumption
- 30
60 90 10000 20000 30000 40000
Matrix size Performance estimation (Gflops)
Performance estimation for different matrix sizes Using 64 MPI processes
- 30
60 90 50 100
Number of processes Performance estimation (Gflops)
Performance estimation for different number of processes Using a matrix size of 20,000
Shared malloc
- FALSE
TRUE
Prediction error: ≤ 1%
24/19
Reducing the memory consumption
- 10
20 30 40 10000 20000 30000 40000
Matrix size Simulation time (seconds)
Simulation time for different matrix sizes Using 64 MPI processes
- 10
20 30 40 50 100
Number of processes Simulation time (seconds)
Simulation time for different number of processes Using a matrix size of 20,000
- 0e+00
2e+08 4e+08 6e+08 10000 20000 30000 40000
Matrix size Memory consumption (bytes)
Memory consumption for different matrix sizes Using 64 MPI processes
- 0e+00
2e+08 4e+08 6e+08 50 100
Number of processes Memory consumption (bytes)
Memory consumption for different number of processes Using a matrix size of 20,000
Shared malloc
- FALSE
TRUE
25/19
Reusing the panel buffers
- 30
60 90 10000 20000 30000 40000
Matrix size Performance estimation (Gflops)
Performance estimation for different matrix sizes Using 64 MPI processes
- 30
60 90 50 100
Number of processes Performance estimation (Gflops)
Performance estimation for different number of processes Using a matrix size of 20,000
Panel reuse
- FALSE
TRUE
Prediction error: ≤ 1%
26/19
Reusing the panel buffers
- 5
10 15 20 10000 20000 30000 40000
Matrix size Simulation time (seconds)
Simulation time for different matrix sizes Using 64 MPI processes
- 5
10 15 20 50 100
Number of processes Simulation time (seconds)
Simulation time for different number of processes Using a matrix size of 20,000
- 0e+00
1e+07 2e+07 3e+07 4e+07 10000 20000 30000 40000
Matrix size Memory consumption (bytes)
Memory consumption for different matrix sizes Using 64 MPI processes
- 0e+00
1e+07 2e+07 3e+07 4e+07 50 100
Number of processes Memory consumption (bytes)
Memory consumption for different number of processes Using a matrix size of 20,000
Panel reuse
- FALSE
TRUE
27/19
Using huge pages
- 30
60 90 120 0e+00 1e+05 2e+05 3e+05
Matrix size Performance estimation (Gflops) Huge page
- FALSE
TRUE
Performance estimation for different matrix sizes Using 64 MPI processes
Prediction error: ≤ 0.1%
28/19
Using huge pages
- 200
400 600 0e+00 1e+05 2e+05 3e+05
Matrix size Simulation time (seconds)
Simulation time for different matrix sizes Using 64 MPI processes
- 0.0e+00
5.0e+08 1.0e+09 1.5e+09 0e+00 1e+05 2e+05 3e+05
Matrix size Memory consumption (bytes)
Memory consumption for different matrix sizes Using 64 MPI processes
Huge page
- FALSE
TRUE