[PPT] - Evaluation and Optimization of Multicore Performance Bottlenecks in PowerPoint Presentation

SLIDE 1

1

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications

Jeff Diamond1, Martin Burtscher2, John D. McCalpin3, Byoung-Do Kim3, Stephen W. Keckler1,4, James C. Browne1

1University of Texas, 2Texas State, 3Texas Advanced Computing Center, 4NVIDIA

SLIDE 2

Trends In Supercomputers

2

!" #!$!!!" %!!$!!!" %#!$!!!" &!!$!!!" &#!$!!!" '!!$!!!" &!!(" &!!)" &!!*" &!!+" &!%!" &!%%" !"#$%&'"()*& +)$(&

!",&!)-&./,)(0"1,/#)(*&2&'"()&'"/-#&

,-."/0123" 456"/0123" ,78"/0123"

SLIDE 3

!"# $!"# %!"# &!"# '!"# (!"# )!"# *!"# +!"# ,!"# $!!"# $,,'# $,,(# $,,*# $,,+# %!!!# %!!$# %!!&# %!!'# %!!)# %!!*# %!!,# %!$!#

!"#$%&'()"#+",+-(.)/"%0+!1&2+$"+ 345677+89:;<+=>>?@A7=7+

../0/123/4#5673/87#

$)9:;1/#:<=# $%9.;1/#:<=# >.32.;1/# ?/@2.;1/# AB24#:;1/# CB20#:;1/# 5DEF0/#:;1/#:<=# GB0HI0/#:JDI7#</1#:<=# 8D@/4# ;3J/1#

3

Is multicore an issue?

SLIDE 4

The Problem: Multicore Scalability

4

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" !" #" $" %" &" '!" '#" '$" '%" !" #" $" %" &" '!" '#" '$" '%" !"##$%"& '()*#&+,-#.&/#-&0,$#&

123-42,$#&!(45627&/#-8,-942(#&

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" 3<2=/;"391:./;"

SLIDE 5

The Problem: Multicore Scalability

5

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" !" #" $" %" &" #" $" %" &" !"##$%"& '()*#&+,-#.&/#-&+01"&

234-5(01"&!(56137&/#-8,-953(#&

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" 3<2=/;"391:./;"

SLIDE 6

!"# $!"# %!"# &!"# '!"# (!!"# ($!"# (%!"# (&!"# ('!"# $!!"# ) * + ,

.

$ # ) * + ,

.

/ # 1 .

.

$ # 1 .

.

/ # !"##$%&"%'(#)%*+,#%-%'.%

'/#%01)#%2#)%034"%

! " # $ % & ' ( ! " # $ % & ) ( * + & % & ' ( * + & % & ) (

!"#$%&"$'(%)'$%&*+,%

6

Optimizations Differ in Multicore

Base ¡code ¡vs ¡Mul-core ¡Op-mized ¡code ¡

SLIDE 7

Paper Contributions

l Studies multicore related bottlenecks l Identifies performance measurement challenges unique to multicore systems l Presents systematic approach to multicore performance analysis l Demonstrates principles of optimization

7

SLIDE 8

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

8

SLIDE 9

Approach: An HPC Case Study

l Examine a real HPC application

¡ Major functions add variety

l What is a typical HPC application?

¡ Many exhibit low arithmetic intensity

l Typical of explicit / iterative solvers, stencils l Finite volume / elements / differences l CFD, Molecular dynamics, particle simulations, graph search, Sparse MM, etc.

9

SLIDE 10

l Application: HOMME

¡ High Order Method Modeling Environment ¡ 3-D Atmospheric Simulation from NCAR ¡ Required for NSF acceptance testing ¡ Excellent scaling, highly optimized ¡ Arithmetic Intensity typical of stencil codes

l Supercomputers:

¡ Ranger – 62,976 cores, 579 Teraflops

2.3 GHz quad core AMD Barcelona chips

¡ Longhorn – 2,048 cores + 512 GPUs

2.5 GHz quad core Intel Nehalem-EP chips

10

Approach: An HPC Case Study

SLIDE 11

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

11

SLIDE 12

Multicore Performance Bottlenecks

12

SINGLE CHIP SINGLE DIMM PRIVATE L1/L2 Cache SHARED L3 CACHE SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES NODE LOCAL DRAM L3 L2 L2 L2 L2 L1 L1 L1 L1

SLIDE 13

13

!"#$$"$$$% !"#!$"$$$% !"#&$"$$$% !"##$"$$$% !"#'$"$$$% !"#($"$$$% !"#)$"$$$% !"#*$"$$$% !"#+$"$$$% !"#,$"$$$% !"'$$"$$$%

!% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!%

!"#$%&'()*+,#)-$.$/#01) 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>)

?%($&'()@#5A'5,3($#)BC#5)*+,#)

Disturbances Persist Longer

SLIDE 14

14

!"#$$"$$$% !"#!$"$$$% !"#&$"$$$% !"##$"$$$% !"#'$"$$$% !"#($"$$$% !"#)$"$$$% !"#*$"$$$% !"#+$"$$$% !"#,$"$$$% !"'$$"$$$%

!% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!%

!"#$%&'()*+,#)-$.$/#01) 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>)

?%($&'()@#5A'5,3($#)BC#5)*+,#)

./0%

123!% 123&%

Measurement Implications

SLIDE 15

Measurements Must Be Lightweight

15

Duration of major HOMME functions

Action Cycles Read Counter 9 Read Four Counters 30 Call Function 40 PAPI READ 400 System Call 5,000 TLB Page Initialization 25,000

Function Duration Calls Per Second % Exec Time 2,000 cycles or less 100,000 20% 2,000 to 10,000 cycles 20,000 10% 10K to 200K cycles 1,600 15% 200K to 1M cycles 200 15% 1M to 10M cycles

0%

10M or more cycle 4 35%

SLIDE 16

Multicore Measurement Issues

l Performance issues in shared memory system

¡ Context Sensitive ¡ Nondeterministic ¡ Highly non local

l Measurement disturbance is significant

¡ Accessing memory or delaying core ¡ Hard to “bracket” measurement effects ¡ Disturbances can last billions of cycles ¡ Bottlenecks can be “bursty”

l Conclusion – need multiple tools

16

SLIDE 17

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

17

SLIDE 18

Multicore Performance Bottlenecks

18

SINGLE CHIP SINGLE DIMM SHARED L3 CACHE SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES NODE LOCAL DRAM L3 L2 L2 L2 L2 L1 L1 L1 L1

SLIDE 19

Measurement Approach

l Find important functions l Compare performance counters at min/max core density l Identify key multicore bottleneck:

¡ L3 capacity – L3 miss rates increase with density ¡ Off-chip BW – BW usage at min density greater than share ¡ DRAM contention – DRAM page miss rates increase with density

l For small and medium functions, follow up with light weight / temporal measurements

19

SLIDE 20

20

Important HOMME Loop

do k=1,nlev do j=1,nv do i=1,nv T(i,j,k,n0) = T(i,j,k,n0) + smooth*(T(i,j,k,nm1) &

2.0D0*T(i,j,k,n0) + T(i,j,k,np1))

v(i,j,1,k,n0) = v(i,j,1,k,n0) + smooth*(v(i,j,1,k,nm1) &

2.0D0*v(i,j,1,k,n0) + v(i,j,1,k,np1))

v(i,j,2,k,n0) = v(i,j,2,k,n0) + smooth*(v(i,j,2,k,nm1) &

2.0D0*v(i,j,2,k,n0) + v(i,j,2,k,np1))

div(i,j,k,n0) = div(i,j,k,n0) + smooth*(div(i,j,k,nm1) &

2.0D0*div(i,j,k,n0) + div(i,j,k,np1))

end do end do end do

SLIDE 21

21

Apply Microfission (First Line)

SLIDE 22

Loop Microfission

l Local, context free optimization l Each array processed independently

¡ Add high-level blocking to fit cache

l Reduces total DRAM banks accessed

¡ Statistically reduces DRAM page miss rate

l Reduces instantaneous working set size

¡ Helps with L3 capacity and off-chip BW

22

SLIDE 23

23

Microfission Results

!"# $!"# %!"# &!"# '!"# (!!"# ($!"# (%!"# )*+,#

$#./#
0#./#

)12*#./#

!"#$%&'#"()*+,-./0,&1)

314*# 56744789#

SLIDE 24

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

24

SLIDE 25

25

Summary and Conclusions

l HPC scalability must include multicore

¡ Not well understood ¡ Requires new analysis and measurement techniques ¡ Optimizations differ from single-core

l Microfission is just one example

¡ Multicore locality optimization for shared caches ¡ Improves performance by 35%

SLIDE 26

26

Future Work

l Expect multicore observations apply to other HPC applications with low arithmetic intensity

¡ Irregular parallel applications: Adaptive meshes, heterogeneous workloads ¡ Irregular blocking applications: graph traversal

l Wider range of multicore (memory-focused)

ptimizations

¡ Recomputation ¡ Relocating Data ¡ Temporary storage reduction ¡ Structural changes

SLIDE 27

27

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications

Jeff Diamond1, Martin Burtscher2, John D. McCalpin3, Byoung-Do Kim3, Stephen W. Keckler1,4, James C. Browne1

Trends In Supercomputers

!",&!)-&./,)(0"1,/#)(*&2&'"()&'"/-#&

!"#$%&'()"#*+",+-(.)/"%0+!1&2*+$"+ 345677+89:;<+=>>?@A7=7+

Is multicore an issue?

The Problem: Multicore Scalability

The Problem: Multicore Scalability

'/#%01)#%2#)%034"%

!"#$%&"$'(%)'$%&*+,%

Optimizations Differ in Multicore

Base ¡code ¡vs ¡Mul-core ¡Op-mized ¡code ¡

Paper Contributions

l Studies multicore related bottlenecks l Identifies performance measurement challenges unique to multicore systems l Presents systematic approach to multicore performance analysis l Demonstrates principles of optimization

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

Approach: An HPC Case Study

l Examine a real HPC application

¡ Major functions add variety

l What is a typical HPC application?

¡ Many exhibit low arithmetic intensity

l Typical of explicit / iterative solvers, stencils l Finite volume / elements / differences l CFD, Molecular dynamics, particle simulations, graph search, Sparse MM, etc.

l Application: HOMME

¡ High Order Method Modeling Environment ¡ 3-D Atmospheric Simulation from NCAR ¡ Required for NSF acceptance testing ¡ Excellent scaling, highly optimized ¡ Arithmetic Intensity typical of stencil codes

l Supercomputers:

¡ Ranger – 62,976 cores, 579 Teraflops

¡ Longhorn – 2,048 cores + 512 GPUs

Approach: An HPC Case Study

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

Multicore Performance Bottlenecks

?%($&'()@#5A'5,3($#)BC#5)*+,#)

Disturbances Persist Longer

?%($&'()@#5A'5,3($#)BC#5)*+,#)

Measurement Implications

Measurements Must Be Lightweight

Multicore Measurement Issues

l Performance issues in shared memory system

¡ Context Sensitive ¡ Nondeterministic ¡ Highly non local

l Measurement disturbance is significant

¡ Accessing memory or delaying core ¡ Hard to “bracket” measurement effects ¡ Disturbances can last billions of cycles ¡ Bottlenecks can be “bursty”

l Conclusion – need multiple tools

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

Multicore Performance Bottlenecks

Measurement Approach

l Find important functions l Compare performance counters at min/max core density l Identify key multicore bottleneck:

¡ L3 capacity – L3 miss rates increase with density ¡ Off-chip BW – BW usage at min density greater than share ¡ DRAM contention – DRAM page miss rates increase with density

l For small and medium functions, follow up with light weight / temporal measurements

Important HOMME Loop

Apply Microfission (First Line)

Loop Microfission

l Local, context free optimization l Each array processed independently

¡ Add high-level blocking to fit cache

l Reduces total DRAM banks accessed

¡ Statistically reduces DRAM page miss rate

l Reduces instantaneous working set size

¡ Helps with L3 capacity and off-chip BW

Microfission Results

!"#$%&'#"()*+,-./0,&1)

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

Summary and Conclusions

l HPC scalability must include multicore

¡ Not well understood ¡ Requires new analysis and measurement techniques ¡ Optimizations differ from single-core

l Microfission is just one example

¡ Multicore locality optimization for shared caches ¡ Improves performance by 35%

Future Work

l Expect multicore observations apply to other HPC applications with low arithmetic intensity

¡ Irregular parallel applications: Adaptive meshes, heterogeneous workloads ¡ Irregular blocking applications: graph traversal

l Wider range of multicore (memory-focused)

¡ Recomputation ¡ Relocating Data ¡ Temporary storage reduction ¡ Structural changes

Thank You

l Any Questions?

!"#$%&'()"#+",+-(.)/"%0+!1&2+$"+ 345677+89:;<+=>>?@A7=7+