Evaluation and Optimization of Multicore Performance Bottlenecks in - - PowerPoint PPT Presentation

evaluation and optimization of multicore performance
SMART_READER_LITE
LIVE PREVIEW

Evaluation and Optimization of Multicore Performance Bottlenecks in - - PowerPoint PPT Presentation

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas


slide-1
SLIDE 1

1

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications

Jeff Diamond1, Martin Burtscher2, John D. McCalpin3, Byoung-Do Kim3, Stephen W. Keckler1,4, James C. Browne1

1University of Texas, 2Texas State, 3Texas Advanced Computing Center, 4NVIDIA

slide-2
SLIDE 2

Trends In Supercomputers

2

!" #!$!!!" %!!$!!!" %#!$!!!" &!!$!!!" &#!$!!!" '!!$!!!" &!!(" &!!)" &!!*" &!!+" &!%!" &!%%" !"#$%&'"()*& +)$(&

!",&!)-&./,)(0"1,/#)(*&2&'"()&'"/-#&

,-."/0123" 456"/0123" ,78"/0123"

slide-3
SLIDE 3

!"# $!"# %!"# &!"# '!"# (!"# )!"# *!"# +!"# ,!"# $!!"# $,,'# $,,(# $,,*# $,,+# %!!!# %!!$# %!!&# %!!'# %!!)# %!!*# %!!,# %!$!#

!"#$%&'()"#*+",+-(.)/"%0+!1&2*+$"+ 345677+89:;<+=>>?@A7=7+

  • ../0/123/4#5673/87#

$)9:;1/#:<=# $%9.;1/#:<=# >.32.;1/# ?/@2.;1/# AB24#:;1/# CB20#:;1/# 5DEF0/#:;1/#:<=# GB0HI0/#:JDI7#</1#:<=# 8D@/4# ;3J/1#

3

Is multicore an issue?

slide-4
SLIDE 4

The Problem: Multicore Scalability

4

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" !" #" $" %" &" '!" '#" '$" '%" !" #" $" %" &" '!" '#" '$" '%" !"##$%"& '()*#&+,-#.&/#-&0,$#&

123-42,$#&!(45627&/#-8,-942(#&

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" 3<2=/;"391:./;"

slide-5
SLIDE 5

The Problem: Multicore Scalability

5

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" !" #" $" %" &" #" $" %" &" !"##$%"& '()*#&+,-#.&/#-&+01"&

234-5(01"&!(56137&/#-8,-953(#&

!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #,!!!" !" '!!" #!!!" !"##$%"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"&

95)#1-78"&!-*+85:&6#1;(1<*5-#&

""-./012"3400564" "7018"391:./;" 3<2=/;"391:./;"

slide-6
SLIDE 6

!"# $!"# %!"# &!"# '!"# (!!"# ($!"# (%!"# (&!"# ('!"# $!!"# ) * + ,

  • .

$ # ) * + ,

  • .

/ # 1 .

  • .

$ # 1 .

  • .

/ # !"##$%&"%'(#)%*+,#%-%'.%

'/#%01)#%2#)%034"%

! " # $ % & ' ( ! " # $ % & ) ( * + & % & ' ( * + & % & ) (

!"#$%&"$'(%)'$%&*+,%

6

Optimizations Differ in Multicore

Base ¡code ¡vs ¡Mul-core ¡Op-mized ¡code ¡

slide-7
SLIDE 7

Paper Contributions

l Studies multicore related bottlenecks l Identifies performance measurement challenges unique to multicore systems l Presents systematic approach to multicore performance analysis l Demonstrates principles of optimization

7

slide-8
SLIDE 8

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

8

slide-9
SLIDE 9

Approach: An HPC Case Study

l Examine a real HPC application

¡ Major functions add variety

l What is a typical HPC application?

¡ Many exhibit low arithmetic intensity

l Typical of explicit / iterative solvers, stencils l Finite volume / elements / differences l CFD, Molecular dynamics, particle simulations, graph search, Sparse MM, etc.

9

slide-10
SLIDE 10

l Application: HOMME

¡ High Order Method Modeling Environment ¡ 3-D Atmospheric Simulation from NCAR ¡ Required for NSF acceptance testing ¡ Excellent scaling, highly optimized ¡ Arithmetic Intensity typical of stencil codes

l Supercomputers:

¡ Ranger – 62,976 cores, 579 Teraflops

  • 2.3 GHz quad core AMD Barcelona chips

¡ Longhorn – 2,048 cores + 512 GPUs

  • 2.5 GHz quad core Intel Nehalem-EP chips

10

Approach: An HPC Case Study

slide-11
SLIDE 11

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

11

slide-12
SLIDE 12

Multicore Performance Bottlenecks

12

SINGLE CHIP SINGLE DIMM PRIVATE L1/L2 Cache SHARED L3 CACHE SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES NODE LOCAL DRAM L3 L2 L2 L2 L2 L1 L1 L1 L1

slide-13
SLIDE 13

13

!"#$$"$$$% !"#!$"$$$% !"#&$"$$$% !"##$"$$$% !"#'$"$$$% !"#($"$$$% !"#)$"$$$% !"#*$"$$$% !"#+$"$$$% !"#,$"$$$% !"'$$"$$$%

!% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!%

!"#$%&'()*+,#)-$.$/#01) 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>)

?%($&'()@#5A'5,3($#)BC#5)*+,#)

Disturbances Persist Longer

slide-14
SLIDE 14

14

!"#$$"$$$% !"#!$"$$$% !"#&$"$$$% !"##$"$$$% !"#'$"$$$% !"#($"$$$% !"#)$"$$$% !"#*$"$$$% !"#+$"$$$% !"#,$"$$$% !"'$$"$$$%

!% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!%

!"#$%&'()*+,#)-$.$/#01) 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>)

?%($&'()@#5A'5,3($#)BC#5)*+,#)

  • ./0%

123!% 123&%

Measurement Implications

slide-15
SLIDE 15

Measurements Must Be Lightweight

15

Duration of major HOMME functions

Action Cycles Read Counter 9 Read Four Counters 30 Call Function 40 PAPI READ 400 System Call 5,000 TLB Page Initialization 25,000

Function Duration Calls Per Second % Exec Time 2,000 cycles or less 100,000 20% 2,000 to 10,000 cycles 20,000 10% 10K to 200K cycles 1,600 15% 200K to 1M cycles 200 15% 1M to 10M cycles

  • 0%

10M or more cycle 4 35%

slide-16
SLIDE 16

Multicore Measurement Issues

l Performance issues in shared memory system

¡ Context Sensitive ¡ Nondeterministic ¡ Highly non local

l Measurement disturbance is significant

¡ Accessing memory or delaying core ¡ Hard to “bracket” measurement effects ¡ Disturbances can last billions of cycles ¡ Bottlenecks can be “bursty”

l Conclusion – need multiple tools

16

slide-17
SLIDE 17

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

17

slide-18
SLIDE 18

Multicore Performance Bottlenecks

18

SINGLE CHIP SINGLE DIMM SHARED L3 CACHE SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES NODE LOCAL DRAM L3 L2 L2 L2 L2 L1 L1 L1 L1

slide-19
SLIDE 19

Measurement Approach

l Find important functions l Compare performance counters at min/max core density l Identify key multicore bottleneck:

¡ L3 capacity – L3 miss rates increase with density ¡ Off-chip BW – BW usage at min density greater than share ¡ DRAM contention – DRAM page miss rates increase with density

l For small and medium functions, follow up with light weight / temporal measurements

19

slide-20
SLIDE 20

20

Important HOMME Loop

do k=1,nlev do j=1,nv do i=1,nv T(i,j,k,n0) = T(i,j,k,n0) + smooth*(T(i,j,k,nm1) &

  • 2.0D0*T(i,j,k,n0) + T(i,j,k,np1))

v(i,j,1,k,n0) = v(i,j,1,k,n0) + smooth*(v(i,j,1,k,nm1) &

  • 2.0D0*v(i,j,1,k,n0) + v(i,j,1,k,np1))

v(i,j,2,k,n0) = v(i,j,2,k,n0) + smooth*(v(i,j,2,k,nm1) &

  • 2.0D0*v(i,j,2,k,n0) + v(i,j,2,k,np1))

div(i,j,k,n0) = div(i,j,k,n0) + smooth*(div(i,j,k,nm1) &

  • 2.0D0*div(i,j,k,n0) + div(i,j,k,np1))

end do end do end do

slide-21
SLIDE 21

21

Apply Microfission (First Line)

slide-22
SLIDE 22

Loop Microfission

l Local, context free optimization l Each array processed independently

¡ Add high-level blocking to fit cache

l Reduces total DRAM banks accessed

¡ Statistically reduces DRAM page miss rate

l Reduces instantaneous working set size

¡ Helps with L3 capacity and off-chip BW

22

slide-23
SLIDE 23

23

Microfission Results

!"# $!"# %!"# &!"# '!"# (!!"# ($!"# (%!"# )*+,#

  • $#./#
  • 0#./#

)12*#./#

!"#$%&'#"()*+,-./0,&1)

314*# 56744789#

slide-24
SLIDE 24

Talk Outline

l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion

24

slide-25
SLIDE 25

25

Summary and Conclusions

l HPC scalability must include multicore

¡ Not well understood ¡ Requires new analysis and measurement techniques ¡ Optimizations differ from single-core

l Microfission is just one example

¡ Multicore locality optimization for shared caches ¡ Improves performance by 35%

slide-26
SLIDE 26

26

Future Work

l Expect multicore observations apply to other HPC applications with low arithmetic intensity

¡ Irregular parallel applications: Adaptive meshes, heterogeneous workloads ¡ Irregular blocking applications: graph traversal

l Wider range of multicore (memory-focused)

  • ptimizations

¡ Recomputation ¡ Relocating Data ¡ Temporary storage reduction ¡ Structural changes

slide-27
SLIDE 27

27

Thank You

l Any Questions?