Using Processor Partitioning to Using Processor Partitioning to - PowerPoint PPT Presentation

http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and Quad- Applications on Dual- and Quad- Xingfu Wu <wuxf@cs.tamu.edu> core Cray XT4 Systems core Cray XT4 Systems Xingfu Wu and Valerie Taylor Xingfu Wu and Valerie Taylor Department of Computer Science & Engineering Department of Computer Science & Engineering Texas A&M University Texas A&M University CUG2009, May 5, 2009, Atlanta, GA CUG2009, May 5, 2009, Atlanta, GA

http://prophesy.cs.tamu.edu Outline Outline � Introduction: Processor Partitioning Introduction: Processor Partitioning � � Execution Platforms and Performance Execution Platforms and Performance � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) Xingfu Wu <wuxf@cs.tamu.edu> � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy Performance Modeling Using Prophesy � System System � Summary Summary �

http://prophesy.cs.tamu.edu Introduction Introduction � Chip multiprocessors (CMP) are usually Chip multiprocessors (CMP) are usually � configured hierarchically to form a compute configured hierarchically to form a compute node of CMP cluster systems. node of CMP cluster systems. Xingfu Wu <wuxf@cs.tamu.edu> � One issue is how many processor cores per One issue is how many processor cores per � node to use for efficient execution. node to use for efficient execution. � The best number of processor cores per The best number of processor cores per � node is dependent upon the application node is dependent upon the application characteristics and system configurations. characteristics and system configurations.

http://prophesy.cs.tamu.edu Processor Partitioning Processor Partitioning � Quantify the performance gap resulting Quantify the performance gap resulting � from using different number of processors from using different number of processors per node for application execution (for per node for application execution (for Xingfu Wu <wuxf@cs.tamu.edu> which we use the term processor which we use the term processor partitioning) . . partitioning) � Understand how processor partitioning Understand how processor partitioning � impacts system & application performance impacts system & application performance � Investigate how and why an application is Investigate how and why an application is � sensitive to communication and memory sensitive to communication and memory access patterns access patterns

http://prophesy.cs.tamu.edu Processor Partitioning Scheme Processor Partitioning Scheme � Processor partitioning scheme NXM Processor partitioning scheme NXM � stands for N nodes with M processor stands for N nodes with M processor cores per node (PPN) cores per node (PPN) Xingfu Wu <wuxf@cs.tamu.edu> � Using processor partitioning changes Using processor partitioning changes � the memory access pattern and the memory access pattern and communication pattern of a MPI communication pattern of a MPI program. program.

http://prophesy.cs.tamu.edu Outline Outline � Introduction Introduction � � Execution Platforms and Performance Execution Platforms and Performance � � Memory Performance Analysis Memory Performance Analysis � � STREAM benchmark STREAM benchmark Xingfu Wu <wuxf@cs.tamu.edu> � � MPI Communication Performance Analysis MPI Communication Performance Analysis � � IMB benchmarks IMB benchmarks � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy System Performance Modeling Using Prophesy System � � Summary Summary �

http://prophesy.cs.tamu.edu Dual- and Quad-core Cray XT4 Dual- and Quad-core Cray XT4 Configurations Franklin Jaguar Total Cores 19,320 31,328 Total Nodes 9,660 7,832 Xingfu Wu <wuxf@cs.tamu.edu> Cores/chip 2 4 Cores / Node 2 4 CPU type 2.6 GHz Opteron 2.1 GHz Opteron Memory/Node 4GB 8GB L1 Cache/CPU 64/64 KB 64/64 KB L2 Cache/chip 1MB 2MB Network 3D-Torus 3D-Torus

http://prophesy.cs.tamu.edu STREAM Benchmark STREAM Benchmark � Synthetic benchmarks, written in Synthetic benchmarks, written in � Fortran 77 and MPI or in C and Fortran 77 and MPI or in C and OpenMP OpenMP Xingfu Wu <wuxf@cs.tamu.edu> � Measure the sustainable memory Measure the sustainable memory � bandwidth using the unit- -stride stride bandwidth using the unit TRIAD benchmark (a(i a(i) = ) = b(i)+q b(i)+q* *c(i c(i)) )) TRIAD benchmark ( � The array size is 4M (2^22) The array size is 4M (2^22) �

http://prophesy.cs.tamu.edu Sustainable Memory Bandwidth Sustainable Memory Bandwidth Frnaklin MPI OpenMP Processor partitioning scheme 1x2 2x1 2 threads Xingfu Wu <wuxf@cs.tamu.edu> Memory Bandwidth (MB/s) 4026.53 6710.89 3565.71 Jaguar MPI OpenMP Processor partitioning scheme 1x4 2x2 4x1 4 threads Memory Bandwidth (MB/s) 5752.19 10066.33 10066.33 5606.77

http://prophesy.cs.tamu.edu Intel’s MPI Benchmarks (IMB) Intel’s MPI Benchmarks (IMB) � Provides a concise set of benchmarks Provides a concise set of benchmarks � targeted at measuring the most important targeted at measuring the most important MPI functions MPI functions Xingfu Wu <wuxf@cs.tamu.edu> � Version 2.3, written in C and MPI Version 2.3, written in C and MPI � � Using Using PingPong PingPong to measure to measure uni uni- - � directional intra/inter- -node latency and node latency and directional intra/inter bandwdith bandwdith

http://prophesy.cs.tamu.edu Uni-directional Latency and Bandwidth Uni-directional Latency and Bandwidth Uni-directional Intra-node Latency Comparison Using PingPong Uni-directional Inter-node Latency Comparison Using PingPong 10000 10000 1000 Franklin Franklin 1000 Jaguar Jaguar Latency (us, log scale) Latency (us, log scale) 100 100 10 Xingfu Wu <wuxf@cs.tamu.edu> 10 1 0.1 1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Message Size (Bytes, log scale) Uni-directional Intra-node Bandwidth Comparison Using PingPong Uni-directional Inter-node Bandwidth Comparison Using PingPong 10000 10000 Franklin Franklin 1000 1000 Band w idth (M B/s, lo g scale) Bandw idth (M B/s, log scale) Jaguar Jaguar 100 100 10 10 1 1 0.1 0.1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Message Size (Bytes, log scale)

Lessons Learned from http://prophesy.cs.tamu.edu Lessons Learned from STREAM and IMB STREAM and IMB � Memory access patterns at different Memory access patterns at different � memory hierarchy levels affect memory hierarchy levels affect sustainable memory bandwidth sustainable memory bandwidth Xingfu Wu <wuxf@cs.tamu.edu> � The fewer PPN, the higher the sustainable The fewer PPN, the higher the sustainable � memory bandwidth memory bandwidth � Using all cores per node does not result Using all cores per node does not result � in the highest memory bandwidth in the highest memory bandwidth � Intra Intra- -node MPI latency/bandwidth is much node MPI latency/bandwidth is much � lower/higher than inter- -node node lower/higher than inter

http://prophesy.cs.tamu.edu Outline Outline � Introduction: Processor Partitioning Introduction: Processor Partitioning � � Execution Platforms and Performance Execution Platforms and Performance � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) Xingfu Wu <wuxf@cs.tamu.edu> � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy Performance Modeling Using Prophesy � System System � Summary Summary �

http://prophesy.cs.tamu.edu NAS Parallel Benchmarks NAS Parallel Benchmarks � NPB 3.2.1 (MPI and NPB 3.2.1 (MPI and OpenMP OpenMP) ) � � CG, EP, FT, IS, MG, LU, BT, SP CG, EP, FT, IS, MG, LU, BT, SP � Xingfu Wu <wuxf@cs.tamu.edu> � Class B and C Class B and C � � Compiler Compiler ftn ftn with the options with the options - -O3 O3 – – � fastsse on Franklin and Jaguar on Franklin and Jaguar fastsse � Strong scaling Strong scaling �

Using Processor Partitioning to Using Processor Partitioning to - PowerPoint PPT Presentation

http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Background MapReduce Model SCOPE Language and Cosmos system Advanced partitioning

Power grid partitioning Data-Driven Partitioning of Power Networks Via Koopman Mode

Partitioning Tens and Ones Can you put these numbers into tens and ones? 37 = 7 30 3 7

Optimal Partitioning of Multicast Receivers Min Sik Kim minskim@cs.utexas.edu Co-authors: Simon

Data Life Cycle Management for Oracle @ CERN with partitioning Oracle @ CERN with partitioning,

Some Results on the Online Partitioning of Permutations Benjamin Leroy-Beaulieu 1 Marc Demange 2 1

Territory partitioning is ... art Territory Partitioning for Minimalist Gossiping Robots

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

Program Partitioning Program Partitioning for Secure E xecution for Secure E xecution Cha rle

Solution Concepts www.unisi.it and W ell-posedness of Hybrid Systems Maurice Heemels Embedded

CHAPTER 5: REACTIVE AND HYBRID ARCHITECTURES An Introduction to Multiagent Systems

Practical, Real-time Centralized Control for CDN-based Live Video Delivery Matt Mukerjee , David

platform engagement: exploring hybrid value chains in Kenya AARTI KRISHNAN, UNIVERSITY OF

PUBLIC KEY INFRASTRUCTURE Nina Bindel Cryptography for the IoT+Cloud Udyani Herath Bochum,

GraphP phP : Reducing Communication for PIM-based Graph Processing with Efficient Data Partition

Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies

formulation and single resolution experiments with real data for NCEP GFS Ting Lei, Xuguang Wang

Using Processor Partitioning to Using Processor Partitioning to - PowerPoint PPT Presentation

http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Background MapReduce Model SCOPE Language and Cosmos system Advanced partitioning

Power grid partitioning Data-Driven Partitioning of Power Networks Via Koopman Mode

Partitioning Tens and Ones Can you put these numbers into tens and ones? 37 = 7 30 3 7

Optimal Partitioning of Multicast Receivers Min Sik Kim minskim@cs.utexas.edu Co-authors: Simon

Data Life Cycle Management for Oracle @ CERN with partitioning Oracle @ CERN with partitioning,

Some Results on the Online Partitioning of Permutations Benjamin Leroy-Beaulieu 1 Marc Demange 2 1

Territory partitioning is ... art Territory Partitioning for Minimalist Gossiping Robots

KPart: A Hybrid Cache Sharing-Partitioning Technique for Commodity Multicores Nosayba EI-Sayed

Program Partitioning Program Partitioning for Secure E xecution for Secure E xecution Cha rle

Solution Concepts www.unisi.it and W ell-posedness of Hybrid Systems Maurice Heemels Embedded

CHAPTER 5: REACTIVE AND HYBRID ARCHITECTURES An Introduction to Multiagent Systems

Practical, Real-time Centralized Control for CDN-based Live Video Delivery Matt Mukerjee , David

platform engagement: exploring hybrid value chains in Kenya AARTI KRISHNAN, UNIVERSITY OF

PUBLIC KEY INFRASTRUCTURE Nina Bindel Cryptography for the IoT+Cloud Udyani Herath Bochum,

GraphP phP : Reducing Communication for PIM-based Graph Processing with Efficient Data Partition

Towards Hybrid Isolation for Shared Multicore Systems 23rd Workshop on Job Scheduling Strategies

formulation and single resolution experiments with real data for NCEP GFS Ting Lei, Xuguang Wang

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System