An Empirical Performance Study of Chapel Programming Language Nan - PowerPoint PPT Presentation

An Empirical Performance Study of Chapel Programming Language Nan Dun ✝ and Kenjiro Taura The University of Tokyo ✝ dun@logos.ic.i.u-tokyo.ac.jp Monday, May 21, 12

Background Modern parallel machines Massive parallelism: 100K~ cores Heterogenous architecture: CPUs + GPGPUs Modern parallel programming languages Programmability, portability, robustness, performance Chapel, X10, and Fortress, etc. 2 Monday, May 21, 12

Motivation Programmability has been well illustrated My First FMM Program in Chapel Abstract of parallelism 30 Relative Elapsed Time Performance is yet unknown 20 Performance implications 10 Performance tuning Language improvements 0 Chapel C The performance should not surprise newbies... 3 Monday, May 21, 12

Agenda Short overview of Chapel Approach Evaluation Microbenchmark results Suggestions for writing efficient Chapel programs N-body FMM results Conclusions 4 Monday, May 21, 12

The Chapel Language Developed by Cray Inc, initiated by HPCS in 2003 Designed to improve programmability Global view model vs. fragmented model Abstract of parallelism (task, data parallelism, etc.) Object-oriented, generic programming For more details: http://chapel.cray.com 5 Monday, May 21, 12

Evaluation Approach Chapel benchmarks: data structures, language features, etc. Equivalent C Intermediate C code Comparisons Implementation Assembly code Assembly code Comparisons Executable Executable Performance Results 6 Monday, May 21, 12

Environment Xeon 2.33GHz 8 core CPU, 32GB MEM Linux 2.6.26, GCC 4.6.2, Chapel 1.4.0 Compile options $ chpl -o prog --fast prog.chpl // Chapel $ gcc -o prog -O3 -lm prog.c // C Use “ --savec ” to keep intermediate C code “ $CHPL_COMM=none ” for single locale, malloc series used Synthesized benchmarks from N-Body simulations 7 Monday, May 21, 12

Primitive Types (1/3) while (...) { var res: int(32); T1 = ((_real32)(i); for i in 1..N do res = res + i; T2 = (resReal32 + T1); resReal32 = T2; i = ...; int(32) vs. C int32 int(64) vs. C int64 } real(32) vs. C float real(64) vs. C double 1 Relative Performance (vs. Cref) .L1046: cvtsi2ss %eax, %xmm0 0.8 addl $1, %eax 0.6 cmpl %eax, %r12d addss %xmm2, %xmm0 0.4 movaps %xmm0, %xmm2 jge .L1046 0.2 The redundant instruction 0 can be removed by add sub mul div combining T2 assignments 8 Monday, May 21, 12

Primitive Types (2/3) while (T80) { _ret42 = arrInt; var arr: [1..N] int; // int and real _ret43 = (_ret42->origin); for d in arr.domain do _ret_10 = (&(_ret42->blk)); res = res + arr(d); // read only _ret_x110 = (*_ret_10)[0]; T82 = (i5 * _ret_x110); T83 = (_ret43 + T82); int vs. C int real vs. C double _ret44 = (_ret42->factoredOffs); T84 = (T83 - _ret44); 1 Relative Performance (vs. Cref) T85 = (_ret42->data); T86 = (&((T85)->_data[T84])); 0.8 _ret45 = *(T86); T87 = (resInt / _ret45); 0.6 resInt = T87; T88 = (i5 + 1); 0.4 i5 = T88; T89 = (T88 != end5); T80 = T89; 0.2 } 0 $ gcc ... -ftree-vectorize -ftree- add sub mul div vectorizer-verbose=5 9 Monday, May 21, 12

Primitive Types (3/3) # Assembly of Chapel C mappings .L1046: var arr: [1..N] int; // int and real cvtsi2sd %edx, %xmm1 for d in arr.domain do addl $1, %edx arr(d) = arr(d) + d; // read + write movsd (%rax), %xmm0 divsd %xmm1, %xmm0 movsd %xmm0, (%rax) int vs. C int real vs. C double addq %rcx, %rax cmpl %edx, %r12d 1 Relative Performance (vs. Cref) jne .L1046 0.8 # Assembly of hand-written C .L32: 0.6 leal (%rsi,%rax), %ecx movsd (%rdx,%rax,8), %xmm0 cvtsi2sd %ecx, %xmm1 0.4 divsd %xmm1, %xmm0 movsd %xmm0, (%rdx,%rax,8) 0.2 addq $1, %rax cmpq %rdi, %rax 0 jne .L32 asg add sub mul div LEA instruction is executed by a separate addressing unit 10 Monday, May 21, 12

Structured Types (1/3) Tuple C Mapping of Tuple var Tuple: double Tuple[3]; (real, real, real); var 2D_Tuple: double Tuple[3][3]; (Tuple, Tuple, Tuple); Record C Mapping of Record record Record { struct Record { var x, y, z: real double x, y, z; } } record 2D_Record { struct 2D_Record { var x, y, z: Record; struct Record x, y, z; } } 11 Monday, May 21, 12

Structured Types (2/2) tuple vs. C array tuple+ vs. C array Walk through the array and record vs. C struct record+ vs. C struct manipulate each element 2D-tuple vs. C 2D-array 2D-tuple+ vs. C 2D-array 2D-record vs. C 2D-struct 2D-record+ vs. C 2D-struct 1 Relative Performance (vs. Cref) 0.8 0.6 0.4 0.2 0 asg add sub mul div 12 Monday, May 21, 12

Structured Types (3/3) Redundant address substitution in 2D-Tuple while (...) { Asm: 197 vs. 33 of C ref _tmp_37 = (&(_ret57[0])); _tmp_x139 = (*_tmp_37)[0]; Complex for GCC to optimize _tmp_x239 = (*_tmp_37)[1]; _tmp_x339 = (*_tmp_37)[2]; ... Data references chpl__tupleRestHelper(...) ... Redundant operations T297[0] = _tmp_x139; T297[1] = _tmp_x239; T297[0] = _tmp_x339; May be related to construction ... of heterogenous tuple } 13 Monday, May 21, 12

Iterators for Loops (1/2) iter myIter(min: int, max: int, step: int = 1) { while min <= max { yield min; min += step; } } // Nested loops var dom = [1..N]; // or 1..N for i in 1..M do for j in [1..N] do ...; // domain for j in 1..N do ...; // range for j in dom do ...; // pre-defined domain for j in myIter(1, N) do ...; // iterator 14 Monday, May 21, 12

Iterators for Loops (2/2) 6x 890x // Domain 1E+06 chpl__buildDomainExpr(...); [1..N] while (loop_variable) { ... } 1.2x Elapsed Time (usec, log-scale) chpl__autoDestroy(...); 1E+05 // Range 42x _build_range(...); (1..N) while (loop_variable) { ... } 1E+04 1..N // Pre-defined domain 3.x _ret10 = dom; ... 1E+03 _ret12 = (T45._low); _ret13 = (T45._high); ... while (loop_variable) { ... } 1E+02 Inner Loop=1 Inner loop=100 // User defined iterator Domain Range while (loop_variable) { ... } Pre-defined domain Iterator 15 Monday, May 21, 12

Domain and Array var rctDom3D: domain(3) = [1..N, 1..N, 1..N]; // rectangular domain var rctArr3D: [rctDom3D] real; var irrDom3D: domain(3*int); // irregular domain var irrArr3D: [irrDom3D] real; 1D-Rect 1D-Associate 2D-Rect 2D-Associate 3D-Rect 3D-Associate 1E+07 1E+07 Relative Performance (vs. Cref) 1E+06 1E+06 1E+05 1E+05 1E+04 Array i.e. space allocation 1E+03 1E+04 1E+02 Domain i.e. index set 1E+03 1E+01 1E+00 1E+02 alloc add sub mul div alloc add sub mul div 16 Monday, May 21, 12

Domain Maps (1/2) var space = [1..N, 1..N]; var blockSpace = space dmapped Block(space); L0 L1 L2 L3 var arrBlock: [blockSpace] real; var cyclicSpace = space dmapped Cyclic(space); L4 L5 L6 L7 var arrCyclic: [cyclicSpace] real; var blkCycSpace = space dmapped BlockCyclic(space); var arrBlkCyc: [blkCycSpace] real; var replicatedSpace = space dmapped ReplicatedDist(); var arrRep: [replicatedSpace] real; for d in arr.domain do on Locales(here.id) do /* arithmetic on arr(d) */ 1 8 1 8 1 1 1 di 4 4 Block Distribution Cyclic Distribution 17 Monday, May 21, 12

Domain Maps (2/2) Block ro Block rw Cyclic ro Cyclic rw Elapsed Time (usec, log-scale) BlockCyclic ro BlockCyclic rw Replicated ro Replicated rw 1E+06 Single Locale 1E+05 1E+04 1E+03 asg (wo) add sub mul div Elapsed Time (usec, log-scale) 1E+09 300Kbps achieved << 434Mbps measured by Iperf Two Locales 1E+08 1E+07 1E+06 1E+05 1E+04 asg (wo) add sub mul div 18 Monday, May 21, 12

Speedup FMM Application Manipulate a large array of structured elements Use record instead of tuple Optimize small inner loop Auxiliary data structure Use rectangular domain instead of associative domain Reduce locks to improve scalability (increase computation in some cases) 19 Monday, May 21, 12

Molecular Dynamics (1/2) Fast Multipole Method Calculate the N -body interactions in O( N ) time Relative Performance (vs. Serial Cref) Parallel Version Serial Version 1 0.8 0.6 0.4 0.2 0 l t ) t s c s o ) a 1 2 i s n e l e m t ( i a ( o I g L c c g C r T o r r r o e o o b e r r h F F F e l F o T p N l l p a l p y a a a d v i l W e t p e l E l i L u p L u M A B 20 Monday, May 21, 12

Molecular Dynamics (2/2) 8 N=8^3 7 N=16^3 N=32^3 6 N=64^3 Speedup 5 4 3 2 1 1 2 3 4 5 6 7 8 # of Threads 21 Monday, May 21, 12

Related Work Evaluations of the Chapel language Programmability [Chamberlain et al. ’06,’07,’08,’11] Performance potential [Barrett et al. ’08] HPCC benchmark [Chamberlain et al. ’11] 95% for EP STREAM & 50% for Random Access Task parallel feature [Weiland et al. ’09] On GPGPU [Ren et al. ’11] 22 Monday, May 21, 12

Conclusions Chapel can achieve comparable performance to C 70%~ on single locale (w/ current v1.4.0) User should be aware of performance implications Choose proper data structure Write program in proper structure Current performance penalties are FIXABLE By improving the Chapel compiler 23 Monday, May 21, 12

Questions? Monday, May 21, 12

An Empirical Performance Study of Chapel Programming Language Nan - PowerPoint PPT Presentation

An Empirical Performance Study of Chapel Programming Language Nan Dun and Kenjiro Taura The University of Tokyo dun@logos.ic.i.u-tokyo.ac.jp Monday, May 21, 12 Background Modern parallel machines Massive parallelism: 100K~ cores

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

Chapel: Status/Community Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010 Outline Chapel

William Dalmer 20 Psalm & Hymn Tunes Trim Street Chapel, Bath. Completed 1796. Northgate

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

An Empirical Security Study of An Empirical Security Study of the Native Code in the JDK the

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The

Fits! Jason Richards, MD Resident, UNC Dept. of Neurology One year ago... 11/12/2015 2

LAUNCH CHAPEL HILL & 1789 CHAPEL HILLS GROWING ENTREPRENEURIAL ECOSYSTEM KFBS &

FOX CHAPEL AREA SCHOOL DISTRICT CLASS SIZE REPORT FOX CHAPEL AREA HIGH SCHOOL REPORT SCHEDULING

Chapel in the CHIUW 2016 (Cosmological) Wild Nikhil Padmanabhan About 2 June 2016 My

Outline Outline at at CHAPEL HILL CHAPEL HILL Background: Router-based congestion control

Chapel Business Meeting FALL 2020 October 29th, 2020 Opening and Prayer Chapel at-large

Building a Big Data Chapel Chris Taylor DoD Overview Big Data? Chapel on Mesos

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT HPC Challenge BOF, SC06

Scheduling Task Parallelism on Multi-Socket Multicore Systems Stephen Olivier, UNC Chapel

Elementary Math Sitton Elementary School May 2, 2017 Mindset What is YOUR experience with math?

WELCOME TO FOURTH GRADE MRS. HUENE 2015 - 2016 A LITTLE ABOUT ME This is my third year as a

Realtime Water Simulation Benjamin Harry CS148 Final Project Project Goal Create a realtime

P P at Mar at Mar chant chant Cessna Mustang Cessna Mustang 2015 2015 Mustang Fleet

LETS MAKE A SORTING GAME! Rules for Coding 1. A mistake is a chance to learn! 2. I will not

Presentation on BUs Indian Firm Data Amrit Amirapu and Michael Gechter April 3, 2014 (updated

DC-18-0004: Update to Land Use Code for Niwot Rural Community District (Article 4-116) Planning

1 AGENDA Limit Switches Production Overview Standards Overview Limit switches

An Empirical Performance Study of Chapel Programming Language Nan - PowerPoint PPT Presentation

An Empirical Performance Study of Chapel Programming Language Nan Dun and Kenjiro Taura The University of Tokyo dun@logos.ic.i.u-tokyo.ac.jp Monday, May 21, 12 Background Modern parallel machines Massive parallelism: 100K~ cores

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

Chapel: Status/Community Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010 Outline Chapel

William Dalmer 20 Psalm &amp; Hymn Tunes Trim Street Chapel, Bath. Completed 1796. Northgate

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

An Empirical Security Study of An Empirical Security Study of the Native Code in the JDK the

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The

Fits! Jason Richards, MD Resident, UNC Dept. of Neurology One year ago... 11/12/2015 2

LAUNCH CHAPEL HILL &amp; 1789 CHAPEL HILLS GROWING ENTREPRENEURIAL ECOSYSTEM KFBS &amp;

FOX CHAPEL AREA SCHOOL DISTRICT CLASS SIZE REPORT FOX CHAPEL AREA HIGH SCHOOL REPORT SCHEDULING

Chapel in the CHIUW 2016 (Cosmological) Wild Nikhil Padmanabhan About 2 June 2016 My

Outline Outline at at CHAPEL HILL CHAPEL HILL Background: Router-based congestion control

Chapel Business Meeting FALL 2020 October 29th, 2020 Opening and Prayer Chapel at-large

Building a Big Data Chapel Chris Taylor DoD Overview Big Data? Chapel on Mesos

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT HPC Challenge BOF, SC06

Scheduling Task Parallelism on Multi-Socket Multicore Systems Stephen Olivier, UNC Chapel

Elementary Math Sitton Elementary School May 2, 2017 Mindset What is YOUR experience with math?

WELCOME TO FOURTH GRADE MRS. HUENE 2015 - 2016 A LITTLE ABOUT ME This is my third year as a

Realtime Water Simulation Benjamin Harry CS148 Final Project Project Goal Create a realtime

P P at Mar at Mar chant chant Cessna Mustang Cessna Mustang 2015 2015 Mustang Fleet

LETS MAKE A SORTING GAME! Rules for Coding 1. A mistake is a chance to learn! 2. I will not

Presentation on BUs Indian Firm Data Amrit Amirapu and Michael Gechter April 3, 2014 (updated

DC-18-0004: Update to Land Use Code for Niwot Rural Community District (Article 4-116) Planning

1 AGENDA Limit Switches Production Overview Standards Overview Limit switches

William Dalmer 20 Psalm & Hymn Tunes Trim Street Chapel, Bath. Completed 1796. Northgate

LAUNCH CHAPEL HILL & 1789 CHAPEL HILLS GROWING ENTREPRENEURIAL ECOSYSTEM KFBS &