Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin - - PowerPoint PPT Presentation

modeling cache sharing for mpi programs on multi core
SMART_READER_LITE
LIVE PREVIEW

Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin - - PowerPoint PPT Presentation

Modeling Cache Sharing for MPI Programs on Multi-core Machines Bin Bao, Chen Ding University of Rochester Nov 10, 2011 The 10th Workshop On


slide-1
SLIDE 1

Nov 10, 2011 The 10th Workshop On Compiler-Driven Performance

Modeling Cache Sharing for MPI Programs on Multi-core Machines

Bin Bao, Chen Ding University of Rochester

Thursday, November 10, 11

slide-2
SLIDE 2

Multi-core Popularity

✤ More and more cluster machines are using multi-core processors ✤ TOP500.org (June 2011): ✤ “Quad-core processors are used in 46.2 percent of the systems,

while already 42.4 percent of the systems use processors with six or more cores.”

Thursday, November 10, 11

slide-3
SLIDE 3

Programming on Cluster

✤ MPI (Message-Passing Interface) is still

dominant

✤ Scalability issues, e.g. ✤ Load balance ✤ Communication overhead ✤ Multicore: resource contention

Thursday, November 10, 11

slide-4
SLIDE 4

Performance Impact of Resource Sharing

✤ Experimental studies: Chai et al. [CCGRID’07], Saini et al. [SC’09],

etc.

cg ep ft is lu mg Speedup 1 2 3 1x2 2x2 4x2 ✤ Intel Nehalem E5520 (4 cores)

(Shared 8MB L3 cache)

✤ GCC 4.4.1, MPICH2 1.4.1

Thursday, November 10, 11

slide-5
SLIDE 5

Goal: Modeling Cache Contention

✤ Tool: reuse distance (aka LRU stack distance), the number of distinct

data elements accessed between two consecutive references to the same element

a b c c d a

rd=3

✤ Reuse distance can be used to calculate cache miss rate ✤ Program A’s cache miss rate = P(A’s reuse distance ≥ cache size)

Thursday, November 10, 11

slide-6
SLIDE 6

Locality (Reuse Distance) Scaling

✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses

a1,1 a1,2 ... a1,n a2,1 a2,2 ... a2,n ... an,1 an,2 ... an,n X b1 b2 ... bn = c1 c2 ... cn

Thursday, November 10, 11

slide-7
SLIDE 7

Locality (Reuse Distance) Scaling

✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses

a1,1 a1,2 ... a1,n a2,1 a2,2 ... a2,n ... an,1 an,2 ... an,n X b1 b2 ... bn = c1 c2 ... cn

Thursday, November 10, 11

slide-8
SLIDE 8

Locality (Reuse Distance) Scaling

✤ Strong scaling: fixed total problem size ✤ Fixed-distance reuses and scaled-distance reuses

a1,1 a1,2 ... a1,n a2,1 a2,2 ... a2,n ... an,1 an,2 ... an,n X b1 b2 ... bn = c1 c2 ... cn

Thursday, November 10, 11

slide-9
SLIDE 9

Reuse Distance Reference Histogram

NAS-LU (B input)

Reference partitions Average reuse distance 200 400 600 800 1000 1KB 32KB 1MB 32MB 1GB 1x2 2x2 4x2

Thursday, November 10, 11

slide-10
SLIDE 10

Reuse Distance Reference Histogram

NAS-LU (B input)

Reference partitions Average reuse distance 200 400 600 800 1000 1KB 32KB 1MB 32MB 1GB 1x2 2x2 4x2

Thursday, November 10, 11

slide-11
SLIDE 11

More Examples

Reference partitions Average reuse distance 200 400 600 800 1000 1KB 32KB 1MB 32MB 1GB 1x2 2x2 4x2 Reference partitions Average reuse distance 200 400 600 800 1000 1KB 32KB 1MB 32MB 1GB 1x2 2x2 4x2 Average reuse distance 200 400 600 800 1000 1KB 32KB 1MB 32MB 1GB 1x2 2x2 4x2 Average reuse distance 200 400 600 800 1000 1KB 32KB 1MB 32MB 1GB 1x2 2x2 4x2

CG FT IS MG

Thursday, November 10, 11

slide-12
SLIDE 12

Linear Regression Based Reuse Distance Prediction

✤ For partition i: ✤ The model captures scaled-distance reuses and fixed-distance reuses ✤ Each partition is independent

rdi = ai × (1/nproc) + bi

Thursday, November 10, 11

slide-13
SLIDE 13

Cache Sharing

✤ General dilation model

[Xiang et al. PPoPP’11]

✤ Symmetric MPI programs:

uniform interleaving assumption a b c d e f a k l m n o p k

Task A: Task B: rd’ = 5 rd = 5

abcklmdenofpak

Task A&B: rd” = 11

a b c d e f a k m m m n o n

Program A: Program B: ft = 4 rd = 5

akbcmdmemfnona

Program A&B: rd’ = rd+ft = 9

Thursday, November 10, 11

slide-14
SLIDE 14

Experiments

✤ Pin-based trace cache simulator (16-way LRU, 8MB) ✤ Performance counters (OProfile, LLC Misses)

Pin Tool MPI Task Pin Tool MPI Task Pin Tool MPI Task Pin Tool MPI Task Pin Lock $ Block $ Block ... $ Block ... Pin Lock $ Block $ Block ... $ Block ...

Thursday, November 10, 11

slide-15
SLIDE 15

Cache Simulator vs Reuse Distance Based Calculation

cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2 cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2

(a) Cache Simulator (b) Reuse Distance Based Calculation

Thursday, November 10, 11

slide-16
SLIDE 16

Reuse Distance Prediction

(a) Reuse Distance Based Calculation (b) Reuse Distance Prediction (8-task)

cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2 cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2

Thursday, November 10, 11

slide-17
SLIDE 17

Hardware Performance Counter

cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2 cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2

(a) Cache Simulator (b) Performance Counter

Thursday, November 10, 11

slide-18
SLIDE 18

Hardware Performance Counter

cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2 cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2

(a) Cache Simulator (b) Performance Counter

Thursday, November 10, 11

slide-19
SLIDE 19

Hardware Performance Counter

cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2 cg ep ft is lu mg Memory Traffic 0.4 0.6 0.8 1 1.2 1.4 1.6 1x2 2x2 4x2

(a) Cache Simulator (b) Performance Counter do k = 1, d3 do ii = 0, d1 - fftblock, fftblock do j = 1, d2 do i = 1, fftblock y(i,j,1) = x(i+ii,j,k) enddo enddo call cfftz (is, logd2, d2, y, y(1, 1, 2)) do j = 1, d2 do i = 1, fftblock xout(i+ii,j,k) = y(i,j,1) enddo enddo enddo enddo

Thursday, November 10, 11

slide-20
SLIDE 20

Summary & Future Work

✤ Reuse distance reference histograms show clear patterns ✤ Linear regression based reuse distance prediction ✤ Coarse-granularity uniform interleaving assumption ✤ Verified with a Pin-based cache simulator ✤ Memory bandwidth contention modeling

Thursday, November 10, 11