Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin - PowerPoint PPT Presentation

Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin Bao, Chen Ding University of Rochester Nov 10, 2011 The 10th Workshop On Compiler-Driven Performance Thursday, November 10, 11

Multi-core Popularity ✤ More and more cluster machines are using multi-core processors ✤ TOP500.org (June 2011): ✤ “Quad-core processors are used in 46.2 percent of the systems, while already 42.4 percent of the systems use processors with six or more cores.” Thursday, November 10, 11

Programming on Cluster ✤ MPI (Message-Passing Interface) is still dominant ✤ Scalability issues, e.g. ✤ Load balance ✤ Communication overhead ✤ Multicore: resource contention Thursday, November 10, 11

Performance Impact of Resource Sharing ✤ Experimental studies: Chai et al. [CCGRID’07], Saini et al. [SC’09], etc. 1x2 2x2 4x2 3 ✤ Intel Nehalem E5520 (4 cores) Speedup (Shared 8MB L3 cache) 2 ✤ GCC 4.4.1, MPICH2 1.4.1 1 0 cg ep ft is lu mg Thursday, November 10, 11

Goal: Modeling Cache Contention ✤ Tool: reuse distance (aka LRU stack distance), the number of distinct data elements accessed between two consecutive references to the same element a b c c d a rd=3 ✤ Reuse distance can be used to calculate cache miss rate ✤ Program A’s cache miss rate = P(A’s reuse distance ≥ cache size) Thursday, November 10, 11

Locality (Reuse Distance) Scaling ✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses a 1,1 a 1,2 ... a 1,n b 1 c 1 a 2,1 a 2,2 ... a 2,n b 2 c 2 X = ... ... ... a n,1 a n,2 ... a n,n b n c n Thursday, November 10, 11

Reuse Distance Reference Histogram 1GB 1x2 NAS-LU (B input) 2x2 4x2 32MB Average reuse distance 1MB 32KB 1KB 0 200 400 600 800 1000 Reference partitions Thursday, November 10, 11

More Examples 1GB 1GB 1x2 1x2 FT CG 2x2 2x2 4x2 4x2 32MB 32MB Average reuse distance Average reuse distance 1MB 1MB 32KB 32KB 1KB 1KB 0 200 400 600 800 1000 0 200 400 600 800 1000 1GB 1GB 1x2 1x2 Reference partitions Reference partitions MG IS 2x2 2x2 4x2 4x2 32MB 32MB Average reuse distance Average reuse distance 1MB 1MB 32KB 32KB 1KB 1KB 0 200 400 600 800 1000 0 200 400 600 800 1000 Thursday, November 10, 11

Linear Regression Based Reuse Distance Prediction ✤ For partition i: rd i = a i × (1/nproc) + b i ✤ The model captures scaled-distance reuses and fixed-distance reuses ✤ Each partition is independent Thursday, November 10, 11

Cache Sharing ✤ General dilation model ✤ Symmetric MPI programs: [Xiang et al. PPoPP’11] uniform interleaving assumption rd = 5 rd = 5 a b c d e f a a b c d e f a Task A: Program A: rd’ = 5 ft = 4 k l m n o p k k m m m n o n Task B: Program B: rd” = 11 rd’ = rd+ft = 9 abcklmdenofpak akbcmdmemfnona Task A&B: Program A&B: Thursday, November 10, 11

Experiments ✤ Pin-based trace cache simulator (16-way LRU, 8MB) MPI Task MPI Task MPI Task MPI Task ... Pin Tool Pin Tool Pin Tool Pin Tool Pin Lock $ Block $ Block ... $ Block ... Pin Lock $ Block $ Block ... $ Block ✤ Performance counters (OProfile, LLC Misses) Thursday, November 10, 11

Cache Simulator vs Reuse Distance Based Calculation 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (b) Reuse Distance (a) Cache Simulator Based Calculation Thursday, November 10, 11

Reuse Distance Prediction 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (a) Reuse Distance (b) Reuse Distance Based Calculation Prediction (8-task) Thursday, November 10, 11

Hardware Performance Counter 1.6 1.6 1x2 1x2 2x2 2x2 1.4 1.4 4x2 4x2 1.2 1.2 Memory Traffic Memory Traffic 1 1 0.8 0.8 0.6 0.6 0.4 0.4 cg ep ft is lu mg cg ep ft is lu mg (a) Cache Simulator (b) Performance Counter Thursday, November 10, 11

Hardware Performance Counter do k = 1, d3 do ii = 0, d1 - fftblock, fftblock do j = 1, d2 do i = 1, fftblock y(i,j,1) = x(i+ii,j,k) 1.6 1.6 enddo 1x2 1x2 2x2 2x2 1.4 enddo 1.4 4x2 4x2 call cfftz (is, logd2, d2, y, y(1, 1, 2)) 1.2 1.2 Memory Traffic Memory Traffic do j = 1, d2 1 1 do i = 1, fftblock 0.8 0.8 xout(i+ii,j,k) = y(i,j,1) 0.6 0.6 enddo 0.4 enddo 0.4 cg ep ft is lu mg cg ep ft is lu mg enddo enddo (a) Cache Simulator (b) Performance Counter Thursday, November 10, 11

Summary & Future Work ✤ Reuse distance reference histograms show clear patterns ✤ Linear regression based reuse distance prediction ✤ Coarse-granularity uniform interleaving assumption ✤ Verified with a Pin-based cache simulator ✤ Memory bandwidth contention modeling Thursday, November 10, 11

Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin - PowerPoint PPT Presentation

Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin Bao, Chen Ding University of Rochester Nov 10, 2011 The 10th Workshop On

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of

Probabilistic reasoning with graphical security models Barbara Kordy Clermont-Ferrand, January

Unique equilibrium states for geodesic flows in nonpositive curvature Todd Fisher Department of

Product Development Dilemma Product Development Dilemma

Dynamic Response of a Large-scale Prestressed Concrete Girder Bridge Subjected to Hurricane Wave

Threshold adaptation and its time course Ming Xiang; Chris Kennedy; Allison Kramer

The Fundamental Group and Brouwers Fixed Point Theorem Directed Reading Project Presentation

Non Clairvoyant Dynamic Mechanism Design Vahab Mirrokni Renato Paes Leme (Google)

Sambuz

Useful Links

Newsletter

Mail Us