evaluation and optimization of multicore performance
play

Evaluation and Optimization of Multicore Performance Bottlenecks in - PowerPoint PPT Presentation

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas


  1. Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas State, 3 Texas Advanced Computing Center, 4 NVIDIA 1

  2. Trends In Supercomputers !",&!)-&./,)(0"1,/#)(*&2&'"()&'"/-#& '!!$!!!" &#!$!!!" &!!$!!!" !"#$%&'"()*& ,-."/0123" %#!$!!!" 456"/0123" %!!$!!!" ,78"/0123" #!$!!!" !" &!!(" &!!)" &!!*" &!!+" &!%!" &!%%" +)$(& 2

  3. !"#$%&'()"#*+",+-(.)/"%0+!1&2*+$"+ 345677+89:;<+=>>?@A7=7+ $!!"# -../0/123/4#5673/87# ,!"# $)9:;1/#:<=# +!"# $%9.;1/#:<=# *!"# >.32.;1/# )!"# Is multicore ?/@2.;1/# (!"# an issue? AB24#:;1/# '!"# CB20#:;1/# &!"# 5DEF0/#:;1/#:<=# %!"# GB0HI0/#:JDI7#</1#:<=# $!"# 8D@/4# !"# $,,'# $,,(# $,,*# $,,+# %!!!# %!!$# %!!&# %!!'# %!!)# %!!*# %!!,# %!$!# ;3J/1# 3

  4. The Problem: Multicore Scalability 123-42,$#&!(45627&/#-8,-942(#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& #,!!!" #,!!!" '%" +!!" +!!" '$" *!!" *!!" ""-./012"3400564" '#" ""-./012"3400564" )!!" )!!" "7018"391:./;" '!" (!!" (!!" !"##$%"& !"##$%"& !"##$%"& "7018"391:./;" 3<2=/;"391:./;" '!!" '!!" &" &!!" &!!" %" %!!" %!!" $" $!!" $!!" #" #!!" #!!" !" !" !" !" !" '!!" '!!" #!!!" #!!!" !" #" $" %" &" '!" '#" '$" '%" '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*#&+,-#.&/#-&0,$#& 4

  5. The Problem: Multicore Scalability 234-5(01"&!(56137&/#-8,-953(#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& #,!!!" #,!!!" &" +!!" +!!" *!!" *!!" ""-./012"3400564" %" ""-./012"3400564" )!!" )!!" "7018"391:./;" (!!" (!!" !"##$%"& !"##$%"& !"##$%"& "7018"391:./;" 3<2=/;"391:./;" '!!" '!!" $" &!!" &!!" %!!" %!!" #" $!!" $!!" #!!" #!!" !" !" !" #" $" %" &" !" !" '!!" '!!" #!!!" #!!!" '()*#&+,-#.&/#-&+01"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& 5

  6. Optimizations Differ in Multicore '/#%01)#%2#)%034"% !"#$%&"$'(%)'$%&*+,% $!!"# ('!"# !"##$%&"%'(#)%*+,#%-%'.% (&!"# (%!"# ($!"# (!!"# '!"# &!"# %!"# $!"# !"# # # # # ( ( ( ( $ / $ / ' ) ' ) . . . . & & & & - - - - % % % % , , . . $ $ & & + + # # 1 1 + + * * " " 0 0 * * ) ) ! ! Base ¡code ¡vs ¡Mul-core ¡Op-mized ¡code ¡ 6

  7. Paper Contributions l Studies multicore related bottlenecks l Identifies performance measurement challenges unique to multicore systems l Presents systematic approach to multicore performance analysis l Demonstrates principles of optimization 7

  8. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 8

  9. Approach: An HPC Case Study l Examine a real HPC application ¡ Major functions add variety l What is a typical HPC application? ¡ Many exhibit low arithmetic intensity l Typical of explicit / iterative solvers, stencils l Finite volume / elements / differences l CFD, Molecular dynamics, particle simulations, graph search, Sparse MM, etc. 9

  10. Approach: An HPC Case Study l Application: HOMME ¡ High Order Method Modeling Environment ¡ 3-D Atmospheric Simulation from NCAR ¡ Required for NSF acceptance testing ¡ Excellent scaling, highly optimized ¡ Arithmetic Intensity typical of stencil codes l Supercomputers: ¡ Ranger – 62,976 cores, 579 Teraflops • 2.3 GHz quad core AMD Barcelona chips ¡ Longhorn – 2,048 cores + 512 GPUs • 2.5 GHz quad core Intel Nehalem-EP chips 10

  11. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 11

  12. Multicore Performance Bottlenecks SHARED PRIVATE SINGLE CHIP L3 CACHE L1/L2 Cache L1 L1 L2 L2 NODE L3 L1 L1 L2 L2 SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES LOCAL DRAM SINGLE DIMM 12

  13. Disturbances Persist Longer ?%($&'()@#5A'5,3($#)BC#5)*+,#) !"'$$"$$$% !"#,$"$$$% !"#+$"$$$% !"#$%&'()*+,#)-$.$/#01) !"#*$"$$$% !"#)$"$$$% !"#($"$$$% !"#'$"$$$% !"##$"$$$% !"#&$"$$$% !"#!$"$$$% !"#$$"$$$% !% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!% 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>) 13

  14. Measurement Implications ?%($&'()@#5A'5,3($#)BC#5)*+,#) !"'$$"$$$% !"#,$"$$$% !"#+$"$$$% !"#$%&'()*+,#)-$.$/#01 ) !"#*$"$$$% !"#)$"$$$% -./0% !"#($"$$$% 123!% !"#'$"$$$% 123&% !"##$"$$$% !"#&$"$$$% !"#!$"$$$% !"#$$"$$$% !% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!% 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$> ) 14

  15. Measurements Must Be Lightweight Action Cycles Read Counter 9 Read Four Counters 30 Call Function 40 PAPI READ 400 System Call 5,000 TLB Page Initialization 25,000 Function Duration Calls Per Second % Exec Time 2,000 cycles or less 100,000 20% 2,000 to 10,000 cycles 20,000 10% 10K to 200K cycles 1,600 15% 200K to 1M cycles 200 15% 1M to 10M cycles - 0% 10M or more cycle 4 35% Duration of major HOMME functions 15

  16. Multicore Measurement Issues l Performance issues in shared memory system ¡ Context Sensitive ¡ Nondeterministic ¡ Highly non local l Measurement disturbance is significant ¡ Accessing memory or delaying core ¡ Hard to “bracket” measurement effects ¡ Disturbances can last billions of cycles ¡ Bottlenecks can be “bursty” l Conclusion – need multiple tools 16

  17. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 17

  18. Multicore Performance Bottlenecks SHARED SINGLE CHIP L3 CACHE L1 L1 L2 L2 NODE L3 L1 L1 L2 L2 SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES LOCAL DRAM SINGLE DIMM 18

  19. Measurement Approach l Find important functions l Compare performance counters at min/max core density l Identify key multicore bottleneck: ¡ L3 capacity – L3 miss rates increase with density ¡ Off-chip BW – BW usage at min density greater than share ¡ DRAM contention – DRAM page miss rates increase with density l For small and medium functions, follow up with light weight / temporal measurements 19

  20. Important HOMME Loop do k=1,nlev do j=1,nv do i=1,nv T(i,j,k,n0) = T(i,j,k,n0) + smooth*(T(i,j,k,nm1) & - 2.0D0*T(i,j,k,n0) + T(i,j,k,np1)) v(i,j,1,k,n0) = v(i,j,1,k,n0) + smooth*(v(i,j,1,k,nm1) & - 2.0D0*v(i,j,1,k,n0) + v(i,j,1,k,np1)) v(i,j,2,k,n0) = v(i,j,2,k,n0) + smooth*(v(i,j,2,k,nm1) & - 2.0D0*v(i,j,2,k,n0) + v(i,j,2,k,np1)) div(i,j,k,n0) = div(i,j,k,n0) + smooth*(div(i,j,k,nm1) & - 2.0D0*div(i,j,k,n0) + div(i,j,k,np1)) end do end do end do 20

  21. Apply Microfission (First Line) 21

  22. Loop Microfission l Local, context free optimization l Each array processed independently ¡ Add high-level blocking to fit cache l Reduces total DRAM banks accessed ¡ Statistically reduces DRAM page miss rate l Reduces instantaneous working set size ¡ Helps with L3 capacity and off-chip BW 22

  23. Microfission Results !"#$%&'#"()*+,-./0,&1) (%!"# ($!"# (!!"# '!"# 314*# &!"# 56744789# %!"# $!"# !"# )*+,# -$#./# -0#./# )12*#./# 23

  24. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend