Steve Deitz Cray Inc. A new parallel language Under development at - PowerPoint PPT Presentation

Steve Deitz Cray Inc.

 A new parallel language  Under development at Cray Inc.  Supported through the DARPA HPCS program  Goals  Improve programmer productivity  Improve the programmability of parallel computers  Match or improve performance of MPI/UPC/CAF  Provide better portability than MPI/UPC/CAF  Improve robustness of parallel codes  Support multi-core and multi-node systems

 What is Chapel?  Chapel’s Parallel Programming Model  HPCC STREAM Triad in Chapel  HPCC RA in Chapel  Summary and Future Work HPCC STREAM and RA in Chapel 3

 Programming model The mental model of a programmer  Fragmented models Programmers take point-of-view of a single processor/thread  SPMD models (Single Program, Multiple Data) Fragmented models with multiple copies of one program  Global-view models Programmers write code to describe computation as a whole Chapel: Background 4

1 6 Initial state 2.0 0.0 0.0 0.0 0.0 12.0 Iteration 1 2.0 1.0 0.0 0.0 6.0 12.0 Iteration 2 2.0 1.0 0.5 3.0 6.0 12.0 Iteration 3 2.0 1.25 2.0 3.25 7.5 12.0 ... Steady state 2.0 4.0 6.0 8.0 10.0 12.0 Chapel: Background 5

Global-View vs. Fragmented Computation Global-View Fragmented ( + )/2 = ( ( ( )/2 + + )/2 + )/2 = = = Chapel: Background 6

Global-View vs. Fragmented Code Assumes p divides n Global-View Fragmented def main() { def main() { var n = 1000; var n = 1000; var me = commID (), p = commProcs (), var A, B: [1..n] real ; myN = n/p, myLo = 1, myHi = myN; var A, B: [0..myN+1] real ; forall i in 2..n-1 do B(i) = (A(i-1)+A(i+1))/2; if me < p { } send (me+1, A(myN)); recv (me+1, A(myN+1)); } else myHi = myN-1; if me > 1 { send (me-1, A(1)); recv (me-1, A(0)); } else myLo = 2; for i in myLo..myHi do B(i) = (A(i-1)+A(i+1))/2; } Chapel: Background 7

use caf_intrinsics enddo use caf_intrinsics u(i1,i2,1) = buff(indx, buff_id ) endif 1,i2, i3+1) enddo enddo if( axis .eq. 1 )then y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1- implicit none enddo dir = -1 do i3=2,n3-1 1,i2-1,i3+1) implicit none do i2=2,n2-1 > + r(i1-1,i2+1,i3-1) + r(i1- include 'cafnpb.h' buff(1:buff_len,buff_id+1)[nbr(axis,dir endif buff_id = 2 + dir indx = indx + 1 1,i2+1,i3+1) ,k)] = include 'globals.h' include 'cafnpb.h' endif buff_len = 0 u(1,i2,i3) = buff(indx, buff_id ) enddo > buff(1:buff_len,buff_id) include 'globals.h' enddo do j1=2,m1j-1 integer n1, n2, n3, kk return if( axis .eq. 1 )then enddo endif i1 = 2*j1-d1 double precision u(n1,n2,n3) integer axis, dir, n1, n2, n3 end do i3=2,n3-1 endif endif y2 = r(i1, i2-1,i3-1) + r(i1, i2- integer axis double precision u( n1, n2, n3 ) do i2=2,n2-1 1,i3+1) buff_len = buff_len + 1 if( axis .eq. 2 )then if( axis .eq. 2 )then > + r(i1, i2+1,i3-1) + r(i1, if( .not. dead(kk) )then integer buff_id, indx subroutine comm1p( axis, u, n1, n2, n3, kk ) buff(buff_len,buff_id ) = u( 2, do i3=2,n3-1 if( dir .eq. -1 )then i2+1,i3+1) do axis = 1, 3 use caf_intrinsics i2,i3) do i1=1,n1 do i3=2,n3-1 x2 = r(i1, i2-1,i3 ) + r(i1, i2+1,i3 enddo if( nprocs .ne. 1) then integer i3, i2, i1 do i1=1,n1 indx = indx + 1 ) enddo call sync_all() buff_len = buff_len + 1 endif u(i1,1,i3) = buff(indx, buff_id ) > + r(i1, i2, i3-1) + r(i1, i2, call give3( axis, +1, u, n1, n2, n3, buff_id = 3 + dir implicit none buff(buff_len, buff_id ) = u( i1, enddo i3+1) kk ) indx = 0 2,i3) s(j1,j2,j3) = call give3( axis, -1, u, n1, n2, n3, if( axis .eq. 2 )then enddo include 'cafnpb.h' kk ) enddo do i3=2,n3-1 endif > 0.5D0 * r(i1,i2,i3) if( axis .eq. 1 )then include 'globals.h' enddo call sync_all() do i1=1,n1 > + 0.25D0 * (r(i1-1,i2,i3) + if( dir .eq. -1 )then call take3( axis, -1, u, n1, n2, n3 ) r(i1+1,i2,i3) + x2) integer axis, dir, n1, n2, n3 buff_len = buff_len + 1 buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] = if( axis .eq. 3 )then call take3( axis, +1, u, n1, n2, n3 ) do i3=2,n3-1 buff(buff_len, buff_id ) = u( i1, > + 0.125D0 * ( x1(i1-1) + x1(i1+1) + double precision u( n1, n2, n3 ) do i2=1,n2 else > buff(1:buff_len,buff_id) 2,i3) y2) do i2=2,n2-1 else if( dir .eq. +1 ) then do i1=1,n1 call comm1p( axis, u, n1, n2, n3, kk enddo > + 0.0625D0 * ( y1(i1-1) + y1(i1+1) ) indx = indx + 1 integer i3, i2, i1, buff_len,buff_id ) indx = indx + 1 enddo u(n1,i2,i3) = buff(indx, buff_id ) integer i, kk, indx enddo endif do i3=2,n3-1 endif u(i1,i2,1) = buff(indx, buff_id ) enddo enddo enddo do i1=1,n1 enddo enddo dir = -1 enddo else buff_len = buff_len + 1 if( axis .eq. 3 )then enddo do axis = 1, 3 buff(buff_len, buff_id )= u( j = k-1 do i2=1,n2 else if( dir .eq. +1 ) then buff_id = 3 + dir endif call sync_all() i1,n2-1,i3) call comm3(s,m1j,m2j,m3j,j) do i1=1,n1 buff_len = nm2 call sync_all() enddo buff_len = buff_len + 1 return enddo do i3=2,n3-1 return enddo buff(buff_len, buff_id ) = u( end do i2=2,n2-1 do i=1,nm2 call zero3(u,n1,n2,n3) i1,i2,2) end indx = indx + 1 buff(i,buff_id) = 0.0D0 endif enddo buff(1:buff_len,buff_id+1)[nbr(axis,dir u(1,i2,i3) = buff(indx, buff_id ) enddo return enddo subroutine ,k)] = enddo end endif rprj3(r,m1k,m2k,m3k,s,m1j,m2j,m3j,k) > buff(1:buff_len,buff_id) enddo implicit none dir = +1 subroutine give3( axis, dir, u, n1, n2, n3, k do i=1,nm2 endif include 'cafnpb.h' endif ) buff(i,4) = buff(i,3) endif include 'globals.h' use caf_intrinsics endif buff_id = 3 + dir buff(i,2) = buff(i,1) buff_len = nm2 enddo if( axis .eq. 3 )then if( axis .eq. 2 )then integer m1k, m2k, m3k, m1j, m2j, m3j,k if( dir .eq. -1 )then if( dir .eq. -1 )then do i=1,nm2 implicit none dir = -1 buff(i,buff_id) = 0.0D0 double precision r(m1k,m2k,m3k), do i2=1,n2 include 'cafnpb.h' do i3=2,n3-1 enddo s(m1j,m2j,m3j) buff_id = 3 + dir do i1=1,n1 include 'globals.h' subroutine do i1=1,n1 integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j indx = 0 buff_len = buff_len + 1 comm3(u,n1,n2,n3,kk) indx = indx + 1 dir = +1 double precision x1(m), y1(m), x2,y2 buff(buff_len, buff_id ) = u( u(i1,n2,i3) = buff(indx, buff_id ) i1,i2,2) if( axis .eq. 1 )then integer axis, dir, n1, n2, n3, k, ierr enddo buff_id = 2 + dir enddo do i3=2,n3-1 if(m1k.eq.3)then double precision u( n1, n2, n3 ) enddo buff_len = 0 enddo do i2=2,n2-1 d1 = 2 indx = indx + 1 else integer i3, i2, i1, buff_len,buff_id else if( dir .eq. +1 ) then if( axis .eq. 1 )then u(n1,i2,i3) = buff(indx, buff_id ) d1 = 1 buff(1:buff_len,buff_id+1)[nbr(axis,dir do i3=2,n3-1 enddo buff_id = 2 + dir ,k)] = do i3=2,n3-1 do i2=2,n2-1 endif enddo buff_len = 0 > buff(1:buff_len,buff_id) do i1=1,n1 buff_len = buff_len + 1 endif indx = indx + 1 buff(buff_len, buff_id ) = u( n1-1, if(m2k.eq.3)then if( axis .eq. 1 )then i2,i3) else if( dir .eq. +1 ) then u(i1,1,i3) = buff(indx, buff_id ) d2 = 2 if( axis .eq. 2 )then enddo if( dir .eq. -1 )then enddo do i3=2,n3-1 else enddo do i2=1,n2 enddo do i1=1,n1 d2 = 1 do i3=2,n3-1 do i1=1,n1 endif indx = indx + 1 endif do i2=2,n2-1 buff_len = buff_len + 1 endif u(i1,n2,i3) = buff(indx, buff_id ) if( axis .eq. 2 )then buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( endif enddo i1,i2,n3-1) do i3=2,n3-1 if(m3k.eq.3)then buff(buff_len,buff_id ) = u( 2, enddo i2,i3) enddo if( axis .eq. 3 )then do i1=1,n1 d3 = 2 endif enddo enddo buff_len = buff_len + 1 if( dir .eq. -1 )then else enddo buff(buff_len, buff_id )= u( i1,n2- d3 = 1 if( axis .eq. 3 )then 1,i3) do i2=1,n2 do i2=1,n2 endif buff(1:buff_len,buff_id+1)[nbr(axis,dir enddo do i1=1,n1 ,k)] = do i1=1,n1 buff(1:buff_len,buff_id+1)[nbr(axis,dir enddo ,k)] = indx = indx + 1 > buff(1:buff_len,buff_id) endif indx = indx + 1 do j3=2,m3j-1 u(i1,i2,n3) = buff(indx, buff_id ) > buff(1:buff_len,buff_id) u(i1,i2,n3) = buff(indx, buff_id ) i3 = 2*j3-d3 enddo endif enddo if( axis .eq. 3 )then do j2=2,m2j-1 enddo else if( dir .eq. +1 ) then endif enddo do i2=1,n2 i2 = 2*j2-d2 do i1=1,n1 endif do j1=2,m1j do i3=2,n3-1 else if( dir .eq. +1 ) then return buff_len = buff_len + 1 i1 = 2*j1-d1 do i2=2,n2-1 end dir = +1 buff(buff_len, buff_id ) = u( do i2=1,n2 buff_len = buff_len + 1 i1,i2,n3-1) x1(i1-1) = r(i1-1,i2-1,i3 ) + r(i1- do i1=1,n1 buff(buff_len, buff_id ) = u( n1- 1,i2+1,i3 ) enddo buff_id = 3 + dir 1, i2,i3) indx = indx + 1 > + r(i1-1,i2, i3-1) + r(i1- subroutine take3( axis, dir, u, n1, n2, n3 ) enddo indx = 0 def rprj3(S, R) { const Stencil = [-1..1, -1..1, -1..1], W: [0..3] real = (0.5, 0.25, 0.125, 0.0625), W3D = [(i,j,k) in Stencil] W((i!=0)+(j!=0)+(k!=0)); forall inds in S.domain do S(inds) = + reduce [offset in Stencil] (W3D(offset) * R(inds + offset*R.stride)); } Chapel: Background 8

 What is Chapel?  Chapel’s Parallel Programming Model  HPCC STREAM Triad in Chapel  HPCC RA in Chapel  Summary and Future Work HPCC STREAM and RA in Chapel 9

Given: m -element vectors A , B , C Compute: forall i in 1..m do A (i) = B (i) + α * C (i); + * = HPCC STREAM and RA in Chapel 10

Given: m -element vectors A , B , C Compute: forall i in 1..m do A (i) = B (i) + α * C (i); + + + + * * * * = = = = HPCC STREAM and RA in Chapel 11

Given: m -element vectors A , B , C Compute: forall i in 1..m do A (i) = B (i) + α * C (i); + + + + * * * * = = = = HPCC STREAM and RA in Chapel 12

Steve Deitz Cray Inc. A new parallel language Under development at - PowerPoint PPT Presentation

Steve Deitz Cray Inc. A new parallel language Under development at Cray Inc. Supported through the DARPA HPCS program Goals Improve programmer productivity Improve the programmability of parallel computers Match or

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Steve Deitz, Brad Chamberlain, Sung-Eun Choi, David Iten, Lee Prokowich Cray Inc. A new

Chapel Cray Cascades High Productivity Language Mary Beth Hribar Steven Deitz Brad

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

User-Defined Distributions and Layouts in Chapel Philosophy and Framework Brad Chamberlain,