Using Processor Partitioning to Using Processor Partitioning to - - PowerPoint PPT Presentation

using processor partitioning to using processor
SMART_READER_LITE
LIVE PREVIEW

Using Processor Partitioning to Using Processor Partitioning to - - PowerPoint PPT Presentation

http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and


slide-1
SLIDE 1

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad- core Cray XT4 Systems Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad- core Cray XT4 Systems

Xingfu Wu and Valerie Taylor Xingfu Wu and Valerie Taylor

Department of Computer Science & Engineering Department of Computer Science & Engineering Texas A&M University Texas A&M University CUG2009, May 5, 2009, Atlanta, GA CUG2009, May 5, 2009, Atlanta, GA

slide-2
SLIDE 2

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Outline Outline

  • Introduction: Processor Partitioning

Introduction: Processor Partitioning

  • Execution Platforms and Performance

Execution Platforms and Performance

  • NAS Parallel Benchmarks (MPI,

NAS Parallel Benchmarks (MPI, OpenMP OpenMP) )

  • Gyrokinetic

Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid)

  • Performance Modeling Using Prophesy

Performance Modeling Using Prophesy System System

  • Summary

Summary

slide-3
SLIDE 3

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Introduction Introduction

  • Chip multiprocessors (CMP) are usually

Chip multiprocessors (CMP) are usually configured hierarchically to form a compute configured hierarchically to form a compute node of CMP cluster systems. node of CMP cluster systems.

  • One issue is how many processor cores per

One issue is how many processor cores per node to use for efficient execution. node to use for efficient execution.

  • The best number of processor cores per

The best number of processor cores per node is dependent upon the application node is dependent upon the application characteristics and system configurations. characteristics and system configurations.

slide-4
SLIDE 4

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Processor Partitioning Processor Partitioning

  • Quantify the performance gap resulting

Quantify the performance gap resulting from using different number of processors from using different number of processors per node for application execution (for per node for application execution (for which we use the term processor which we use the term processor partitioning) partitioning) . .

  • Understand how processor partitioning

Understand how processor partitioning impacts system & application performance impacts system & application performance

  • Investigate how and why an application is

Investigate how and why an application is sensitive to communication and memory sensitive to communication and memory access patterns access patterns

slide-5
SLIDE 5

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Processor Partitioning Scheme Processor Partitioning Scheme

  • Processor partitioning scheme NXM

Processor partitioning scheme NXM stands for N nodes with M processor stands for N nodes with M processor cores per node (PPN) cores per node (PPN)

  • Using processor partitioning changes

Using processor partitioning changes the memory access pattern and the memory access pattern and communication pattern of a MPI communication pattern of a MPI program. program.

slide-6
SLIDE 6

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Outline Outline

  • Introduction

Introduction

  • Execution Platforms and Performance

Execution Platforms and Performance

  • Memory Performance Analysis

Memory Performance Analysis

  • STREAM benchmark

STREAM benchmark

  • MPI Communication Performance Analysis

MPI Communication Performance Analysis

  • IMB benchmarks

IMB benchmarks

  • NAS Parallel Benchmarks (MPI,

NAS Parallel Benchmarks (MPI, OpenMP OpenMP) )

  • Gyrokinetic

Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid)

  • Performance Modeling Using Prophesy System

Performance Modeling Using Prophesy System

  • Summary

Summary

slide-7
SLIDE 7

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Dual- and Quad-core Cray XT4 Dual- and Quad-core Cray XT4

3D-Torus 3D-Torus Network 2MB 1MB L2 Cache/chip 64/64 KB 64/64 KB L1 Cache/CPU 8GB 4GB Memory/Node 2.1 GHz Opteron 2.6 GHz Opteron CPU type 4 2 Cores / Node 4 2 Cores/chip 7,832 9,660 Total Nodes 31,328 19,320 Total Cores Jaguar Franklin Configurations

slide-8
SLIDE 8

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

STREAM Benchmark STREAM Benchmark

  • Synthetic benchmarks, written in

Synthetic benchmarks, written in Fortran 77 and MPI or in C and Fortran 77 and MPI or in C and OpenMP OpenMP

  • Measure the sustainable memory

Measure the sustainable memory bandwidth using the unit bandwidth using the unit-

  • stride

stride TRIAD benchmark ( TRIAD benchmark (a(i a(i) = ) = b(i)+q b(i)+q* *c(i c(i)) ))

  • The array size is 4M (2^22)

The array size is 4M (2^22)

slide-9
SLIDE 9

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Sustainable Memory Bandwidth Sustainable Memory Bandwidth

3565.71 6710.89 4026.53 Memory Bandwidth (MB/s) 2 threads 2x1 1x2 Processor partitioning scheme OpenMP MPI Frnaklin 5606.77 10066.33 10066.33 5752.19 Memory Bandwidth (MB/s) 4 threads 4x1 2x2 1x4 Processor partitioning scheme OpenMP MPI Jaguar

slide-10
SLIDE 10

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Intel’s MPI Benchmarks (IMB) Intel’s MPI Benchmarks (IMB)

  • Provides a concise set of benchmarks

Provides a concise set of benchmarks targeted at measuring the most important targeted at measuring the most important MPI functions MPI functions

  • Version 2.3, written in C and MPI

Version 2.3, written in C and MPI

  • Using

Using PingPong PingPong to measure to measure uni uni-

  • directional intra/inter

directional intra/inter-

  • node latency and

node latency and bandwdith bandwdith

slide-11
SLIDE 11

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Uni-directional Latency and Bandwidth Uni-directional Latency and Bandwidth

Uni-directional Intra-node Latency Comparison Using PingPong

0.1 1 10 100 1000 10000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Latency (us, log scale) Franklin Jaguar

Uni-directional Inter-node Latency Comparison Using PingPong

1 10 100 1000 10000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Latency (us, log scale) Franklin Jaguar

Uni-directional Intra-node Bandwidth Comparison Using PingPong

0.1 1 10 100 1000 10000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Band w idth (M B/s, lo g scale) Franklin Jaguar

Uni-directional Inter-node Bandwidth Comparison Using PingPong

0.1 1 10 100 1000 10000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Bandw idth (M B/s, log scale) Franklin Jaguar

slide-12
SLIDE 12

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Lessons Learned from STREAM and IMB Lessons Learned from STREAM and IMB

  • Memory access patterns at different

Memory access patterns at different memory hierarchy levels affect memory hierarchy levels affect sustainable memory bandwidth sustainable memory bandwidth

  • The fewer PPN, the higher the sustainable

The fewer PPN, the higher the sustainable memory bandwidth memory bandwidth

  • Using all cores per node does not result

Using all cores per node does not result in the highest memory bandwidth in the highest memory bandwidth

  • Intra

Intra-

  • node MPI latency/bandwidth is much

node MPI latency/bandwidth is much lower/higher than inter lower/higher than inter-

  • node

node

slide-13
SLIDE 13

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Outline Outline

  • Introduction: Processor Partitioning

Introduction: Processor Partitioning

  • Execution Platforms and Performance

Execution Platforms and Performance

  • NAS Parallel Benchmarks (MPI,

NAS Parallel Benchmarks (MPI, OpenMP OpenMP) )

  • Gyrokinetic

Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid)

  • Performance Modeling Using Prophesy

Performance Modeling Using Prophesy System System

  • Summary

Summary

slide-14
SLIDE 14

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

NAS Parallel Benchmarks NAS Parallel Benchmarks

  • NPB 3.2.1 (MPI and

NPB 3.2.1 (MPI and OpenMP OpenMP) )

  • CG, EP, FT, IS, MG, LU, BT, SP

CG, EP, FT, IS, MG, LU, BT, SP

  • Class B and C

Class B and C

  • Compiler

Compiler ftn ftn with the options with the options -

  • O3

O3 – – fastsse fastsse on Franklin and Jaguar

  • n Franklin and Jaguar
  • Strong scaling

Strong scaling

slide-15
SLIDE 15

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Performance Comparison Performance Comparison

Performance Comparison of MPI and OpenMP Performance on Franklin

50 100 150 200 250 300 350 400 CG-M EP-M FT-M IS-M MG-M LU-M BT-M SP-M CG-O EP-O FT-O IS-O MG-O LU-O BT-O SP-O NPB Benchmarks Tim e (s) 2

Ratio of MPI to OpenMP Performance on Franklin

0.2 0.4 0.6 0.8 1 1.2 1.4 CG EP FT IS MG LU BT SP NPB Benchmarks Ratio of M PI/OpenM P 2

Performance Comparison of MPI and OpenMP Performance on Jaguar

50 100 150 200 250 300 350 CG-M EP-M FT-M IS-M MG-M LU-M BT-M SP-M CG-O EP-O FT-O IS-O MG-O LU-O BT-O SP-O NPB Benchmarks Tim e (s) 2 4

Ratio of MPI to OpenMP Performance on Jaguar

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 CG EP FT IS MG LU BT SP NPB Benchmarks Ratio of M PI/OpenM P 2 4

slide-16
SLIDE 16

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Processor Partitioning Using Processor Partitioning

50 100 150 200 250 300 T im e (s ) CG EP FT IS MG LU BT SP NPB benchmarks (Class B)

Performance comparison using processor partitioning on Jaguar (quad-core)

1x4 2x2 4x1

200 400 600 800 1000 1200 T im e ( s ) CG EP FT IS MG LU BT SP NPB benchmarks (Class C)

Performance comparison using processor partitioning on Jaguar (quad-core)

1x4 2x2 4x1

slide-17
SLIDE 17

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Hardware Counters’ Performance Using Hardware Counters’ Performance

1.90% 1.10% 1.10%

  • Comm. %

160.36% 1328.96 992.14 510.43 L2-Mem BW (MB/s) per core 121.61% 589.88 409.21 266.18 L2-D1 BW (MB/s) per core 114.78% 2605.24 1861.85 1212.99 Mem-D1 BW (MB/s) per core 45.1% 48.30% 52.80% D2 hit ratio 91.1% 91.10% 91.20% D1 hit ratio 92.70% 92.70% 92.80% D1+D2 hit ratio 115.10% 128.01 179.98 275.35 Runtime (s) Diff.(%) 4x1 2x2 1x4 SP on Jagur

slide-18
SLIDE 18

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Hardware Counters’ Performance Using Hardware Counters’ Performance

  • Comm. %

0.082% 144.44 144.41 144.32 L2-Mem BW (MB/s) per core 0.189% 2.65 2.64 2.65 L2-D1 BW (MB/s) per core 0.066% 288.73 288.71 288.54 Mem-D1 BW (MB/s) per core 3.30% 3.30% 3.30% D2 hit ratio 99.10% 99.10% 99.10% D1 hit ratio 99.10% 99.10% 99.10% D1+D2 hit ratio 0.388% 28.36 28.39 28.47 Runtime (s) Diff.(%) 4x1 2x2 1x4 EP on Jaguar

slide-19
SLIDE 19

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Processor Partitioning Using Processor Partitioning

100 200 300 400 500 600 700 T i m e ( s ) CG EP FT IS MG LU BT SP NPB benchmarks (Class C)

Performance comparison using processor partitioning on Franklin (dual-core)

2x2 4x1

20 40 60 80 100 120 140 160 180 Tim e (s) CG EP FT IS MG LU BT SP NPB Benchmarks (Class B)

Performance comparison using processor partitioning on Franklin (dual-core)

2x2 4x1

slide-20
SLIDE 20

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Hardware Counters’ Performance Using Hardware Counters’ Performance

1.10% 1.00%

  • Comm. %

45.89% 1424.925 976.701 L2-Mem BW (MB/s) per core 49.60% 647.878 433.072 L2-D1 BW (MB/s) per core 46.07% 2866.102 1962.18 Mem-D1 BW (MB/s) per core 45.30% 48.90% D2 hit ratio 91.10% 91.20% D1 hit ratio 92.80% 92.80% D1+D2 hit ratio 47.09% 115.14 169.36 Runtime (s) Difference (%) 4x1 2x2 SP on Franklin

slide-21
SLIDE 21

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Lessons Learned from NPB Lessons Learned from NPB

  • Using processor partitioning changes the memory

Using processor partitioning changes the memory access pattern and communication pattern of a MPI access pattern and communication pattern of a MPI program. program.

  • Regarding the merits of using processor

Regarding the merits of using processor partitioning, the hardware performance counters partitioning, the hardware performance counters’ ’ data is conclusive. data is conclusive.

  • Processor partitioning has significant performance

Processor partitioning has significant performance impact of a MPI program except embarrassingly impact of a MPI program except embarrassingly parallel applications such as EP. parallel applications such as EP.

  • The memory bandwidth per core is the primary

The memory bandwidth per core is the primary source of performance degradation when source of performance degradation when increasing the number of cores per node. increasing the number of cores per node.

slide-22
SLIDE 22

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Outline Outline

  • Introduction: Processor Partitioning

Introduction: Processor Partitioning

  • Execution Platforms and Performance

Execution Platforms and Performance

  • NAS Parallel Benchmarks (MPI,

NAS Parallel Benchmarks (MPI, OpenMP OpenMP) )

  • Gyrokinetic

Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid)

  • Performance Modeling Using Prophesy

Performance Modeling Using Prophesy System System

  • Summary

Summary

slide-23
SLIDE 23

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

  • Gyrokinetic

Gyrokinetic Toroidal Toroidal code (GTC) code (GTC)

  • A 3D particle

A 3D particle-

  • in

in-

  • cell application developed at the

cell application developed at the Princeton Plasma Physics Laboratory to study turbulent Princeton Plasma Physics Laboratory to study turbulent transport in magnetic fusion transport in magnetic fusion

  • A flagship

A flagship SciDAC SciDAC fusion fusion microturbulence microturbulence code code

  • 100 particles per cell and 100 time steps

100 particles per cell and 100 time steps

  • Weak scaling

Weak scaling

GTC Code GTC Code

slide-24
SLIDE 24

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

PIC Steps of GTC Code PIC Steps of GTC Code

Stephane Stephane Ethier Ethier’ ’s s Talk in 2005 Talk in 2005 BlueGene BlueGene Applications Workshop Applications Workshop

slide-25
SLIDE 25

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Processor Partitioning for 4 Cores on Jaguar Using Processor Partitioning for 4 Cores on Jaguar

5.49% 342.2 346.7 361 pusher 55.57% 10.33 11.76 16.07 shift 4.35% 292.1 296 304.8 charge 37.19% 5.47 5.94 7.51 poisson 18.95% 2.07 2.14 2.46 smooth 58.90% 1.03 1.14 1.64 field 18.05% 3.68 3.88 4.34 load 6.23% 657.47 668.2 698.45 Runtime Diff.(%) 4x1 2x2 1x4 4 Cores

slide-26
SLIDE 26

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Using Processor Partitioning for 64 Cores on Jaguar Using Processor Partitioning for 64 Cores on Jaguar

5.89% 365.3 368.7 386.8 pusher 142.93% 22.69 34.6 55.12 shift 6.34% 313.9 317.2 333.8 charge 21.51% 5.70 5.70 6.93 poisson 22.84% 2.06 2.23 2.53 smooth 60.39% 1.09 1.10 1.73 field 21.79% 3.72 3.72 4.53 load 10.82% 715.16 733.85 792.54 Runtime Diff.(%) 64x1 32x2 16x4 64 Cores

slide-27
SLIDE 27

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Performance Comparison Performance Comparison

Performance of GTC

200 400 600 800 1000 1200 1400 1600 1800 64 128 256 512 1024 2048 4096 8192 Number of Cores Time (s) Franklin-MPI Franklin-Hybrid Jaguar-MPI Jaguar-Hybrid

Relative Speedup of GTC

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 64 128 256 512 1024 2048 4096 8192

Number of Cores Speedup Franklin-MPI Franklin-Hybrid Jaguar-MPI Jaguar-Hybrid

Ratio of MPI to Hybrid GTC

0.2 0.4 0.6 0.8 1 1.2 Franklin Jaguar Systems R a t io o f M P I/H y b rid 64 128 256 512 1024 2048 4096 8192

slide-28
SLIDE 28

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Performance Modeling Performance Modeling

slide-29
SLIDE 29

Xingfu Wu <wuxf@cs.tamu.edu> http://prophesy.cs.tamu.edu

Summary Summary

  • Using STREAM and IMB to understand how

Using STREAM and IMB to understand how processor partitioning impacts system & processor partitioning impacts system & application performance application performance

  • Using processor partitioning to quantify the

Using processor partitioning to quantify the performance difference among different processor performance difference among different processor partitioning schemes for NAS Parallel Benchmarks partitioning schemes for NAS Parallel Benchmarks

  • Investigated how and why GTC is sensitive to

Investigated how and why GTC is sensitive to communication and memory access patterns communication and memory access patterns

  • Using processor partitioning to understand an

Using processor partitioning to understand an application application’ ’s performance characteristics for s performance characteristics for

  • ptimizing the application in order to efficiently
  • ptimizing the application in order to efficiently

utilize all processors per node utilize all processors per node