using processor partitioning to using processor
play

Using Processor Partitioning to Using Processor Partitioning to - PowerPoint PPT Presentation

http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and


  1. http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and Quad- Applications on Dual- and Quad- Xingfu Wu <wuxf@cs.tamu.edu> core Cray XT4 Systems core Cray XT4 Systems Xingfu Wu and Valerie Taylor Xingfu Wu and Valerie Taylor Department of Computer Science & Engineering Department of Computer Science & Engineering Texas A&M University Texas A&M University CUG2009, May 5, 2009, Atlanta, GA CUG2009, May 5, 2009, Atlanta, GA

  2. http://prophesy.cs.tamu.edu Outline Outline � Introduction: Processor Partitioning Introduction: Processor Partitioning � � Execution Platforms and Performance Execution Platforms and Performance � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) Xingfu Wu <wuxf@cs.tamu.edu> � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy Performance Modeling Using Prophesy � System System � Summary Summary �

  3. http://prophesy.cs.tamu.edu Introduction Introduction � Chip multiprocessors (CMP) are usually Chip multiprocessors (CMP) are usually � configured hierarchically to form a compute configured hierarchically to form a compute node of CMP cluster systems. node of CMP cluster systems. Xingfu Wu <wuxf@cs.tamu.edu> � One issue is how many processor cores per One issue is how many processor cores per � node to use for efficient execution. node to use for efficient execution. � The best number of processor cores per The best number of processor cores per � node is dependent upon the application node is dependent upon the application characteristics and system configurations. characteristics and system configurations.

  4. http://prophesy.cs.tamu.edu Processor Partitioning Processor Partitioning � Quantify the performance gap resulting Quantify the performance gap resulting � from using different number of processors from using different number of processors per node for application execution (for per node for application execution (for Xingfu Wu <wuxf@cs.tamu.edu> which we use the term processor which we use the term processor partitioning) . . partitioning) � Understand how processor partitioning Understand how processor partitioning � impacts system & application performance impacts system & application performance � Investigate how and why an application is Investigate how and why an application is � sensitive to communication and memory sensitive to communication and memory access patterns access patterns

  5. http://prophesy.cs.tamu.edu Processor Partitioning Scheme Processor Partitioning Scheme � Processor partitioning scheme NXM Processor partitioning scheme NXM � stands for N nodes with M processor stands for N nodes with M processor cores per node (PPN) cores per node (PPN) Xingfu Wu <wuxf@cs.tamu.edu> � Using processor partitioning changes Using processor partitioning changes � the memory access pattern and the memory access pattern and communication pattern of a MPI communication pattern of a MPI program. program.

  6. http://prophesy.cs.tamu.edu Outline Outline � Introduction Introduction � � Execution Platforms and Performance Execution Platforms and Performance � � Memory Performance Analysis Memory Performance Analysis � � STREAM benchmark STREAM benchmark Xingfu Wu <wuxf@cs.tamu.edu> � � MPI Communication Performance Analysis MPI Communication Performance Analysis � � IMB benchmarks IMB benchmarks � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy System Performance Modeling Using Prophesy System � � Summary Summary �

  7. http://prophesy.cs.tamu.edu Dual- and Quad-core Cray XT4 Dual- and Quad-core Cray XT4 Configurations Franklin Jaguar Total Cores 19,320 31,328 Total Nodes 9,660 7,832 Xingfu Wu <wuxf@cs.tamu.edu> Cores/chip 2 4 Cores / Node 2 4 CPU type 2.6 GHz Opteron 2.1 GHz Opteron Memory/Node 4GB 8GB L1 Cache/CPU 64/64 KB 64/64 KB L2 Cache/chip 1MB 2MB Network 3D-Torus 3D-Torus

  8. http://prophesy.cs.tamu.edu STREAM Benchmark STREAM Benchmark � Synthetic benchmarks, written in Synthetic benchmarks, written in � Fortran 77 and MPI or in C and Fortran 77 and MPI or in C and OpenMP OpenMP Xingfu Wu <wuxf@cs.tamu.edu> � Measure the sustainable memory Measure the sustainable memory � bandwidth using the unit- -stride stride bandwidth using the unit TRIAD benchmark (a(i a(i) = ) = b(i)+q b(i)+q* *c(i c(i)) )) TRIAD benchmark ( � The array size is 4M (2^22) The array size is 4M (2^22) �

  9. http://prophesy.cs.tamu.edu Sustainable Memory Bandwidth Sustainable Memory Bandwidth Frnaklin MPI OpenMP Processor partitioning scheme 1x2 2x1 2 threads Xingfu Wu <wuxf@cs.tamu.edu> Memory Bandwidth (MB/s) 4026.53 6710.89 3565.71 Jaguar MPI OpenMP Processor partitioning scheme 1x4 2x2 4x1 4 threads Memory Bandwidth (MB/s) 5752.19 10066.33 10066.33 5606.77

  10. http://prophesy.cs.tamu.edu Intel’s MPI Benchmarks (IMB) Intel’s MPI Benchmarks (IMB) � Provides a concise set of benchmarks Provides a concise set of benchmarks � targeted at measuring the most important targeted at measuring the most important MPI functions MPI functions Xingfu Wu <wuxf@cs.tamu.edu> � Version 2.3, written in C and MPI Version 2.3, written in C and MPI � � Using Using PingPong PingPong to measure to measure uni uni- - � directional intra/inter- -node latency and node latency and directional intra/inter bandwdith bandwdith

  11. http://prophesy.cs.tamu.edu Uni-directional Latency and Bandwidth Uni-directional Latency and Bandwidth Uni-directional Intra-node Latency Comparison Using PingPong Uni-directional Inter-node Latency Comparison Using PingPong 10000 10000 1000 Franklin Franklin 1000 Jaguar Jaguar Latency (us, log scale) Latency (us, log scale) 100 100 10 Xingfu Wu <wuxf@cs.tamu.edu> 10 1 0.1 1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Message Size (Bytes, log scale) Uni-directional Intra-node Bandwidth Comparison Using PingPong Uni-directional Inter-node Bandwidth Comparison Using PingPong 10000 10000 Franklin Franklin 1000 1000 Band w idth (M B/s, lo g scale) Bandw idth (M B/s, log scale) Jaguar Jaguar 100 100 10 10 1 1 0.1 0.1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Message Size (Bytes, log scale)

  12. Lessons Learned from http://prophesy.cs.tamu.edu Lessons Learned from STREAM and IMB STREAM and IMB � Memory access patterns at different Memory access patterns at different � memory hierarchy levels affect memory hierarchy levels affect sustainable memory bandwidth sustainable memory bandwidth Xingfu Wu <wuxf@cs.tamu.edu> � The fewer PPN, the higher the sustainable The fewer PPN, the higher the sustainable � memory bandwidth memory bandwidth � Using all cores per node does not result Using all cores per node does not result � in the highest memory bandwidth in the highest memory bandwidth � Intra Intra- -node MPI latency/bandwidth is much node MPI latency/bandwidth is much � lower/higher than inter- -node node lower/higher than inter

  13. http://prophesy.cs.tamu.edu Outline Outline � Introduction: Processor Partitioning Introduction: Processor Partitioning � � Execution Platforms and Performance Execution Platforms and Performance � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) Xingfu Wu <wuxf@cs.tamu.edu> � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy Performance Modeling Using Prophesy � System System � Summary Summary �

  14. http://prophesy.cs.tamu.edu NAS Parallel Benchmarks NAS Parallel Benchmarks � NPB 3.2.1 (MPI and NPB 3.2.1 (MPI and OpenMP OpenMP) ) � � CG, EP, FT, IS, MG, LU, BT, SP CG, EP, FT, IS, MG, LU, BT, SP � Xingfu Wu <wuxf@cs.tamu.edu> � Class B and C Class B and C � � Compiler Compiler ftn ftn with the options with the options - -O3 O3 – – � fastsse on Franklin and Jaguar on Franklin and Jaguar fastsse � Strong scaling Strong scaling �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend