 
              Introduction DSL Interface Implementation Results Related Work Conclusions References An embedded C++ domain-specific language for stream parallelism Authors : Dalvan Griebler 1 , 2 , Marco Danelutto 2 , Massimo Torquati 2 , Luiz Gustavo Fernandes 1 E-mail: dalvan.griebler@acad.pucrs.br 1 Faculdade de Inform´ atica, Computer Science Graduate Program Parallel Application Modeling Group - GMAP PUCRS – Porto Alegre – Brazil 2 Computer Science Department, University of Pisa, Italy International Conference on Parallel Computing (ParCo) - 2015 , September 2015 1 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Outline Introduction 1 DSL Interface 2 Implementation 3 Results 4 Related Work 5 Conclusions 6 References 7 , 2 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Introduction Stream Processing [1] Application domain : data, image and network. Pattern of behavior : process input and produce output. Requirements : high throughput and low latency. Our Domain (Stream Parallelism) Problem : sequential code rewriting and low-level optimization for developers not familiar with parallelism. Goal : solve the problem, flexibility for fast code prototyping, and efficient parallel code generation. Solution : standard C++ annotation mechanism by using GCC-Pluings technique for parallel code generation, and the FastFlow runtime. , 3 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Introduction Motivation DSL-POPP : provides suitable building block for Master/Slave and Pipeline, where the DSL is composed by intrusive annotations using a dedicated compiler [2, 3]. REPARA : intermediate attribute annotations which are not exposed to end users and inserted by appropriate tools [4]. Contributions A domain-specific language for expressing stateless stream parallelism. An annotation-based programing interface that preserves the semantics of DSL-POPP’s principles and adopts REPARA’s C++ attribute style, avoiding significant code rewriting in sequential programs. , 4 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Introducing C++ Attributes with GCC plugin technique Standard C++ Attributes Originated : GNU C attributes ( ((<name>)) ) to standard C++ attribute language ( [[attr-list]] ) [5, 6]. Advantage : can be declared almost everywhere in a program ( e.g. , types, classes, code blocks, etc.), and the compiler is able to fully recognize GCC Plugins Why to use ? Which are the drawbacks ? , 5 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Attributes for Stream Parallelism Attributes Description annotates a loop or block of code for implementing a ToStream() stream parallelism region annotates a block of code that computes the stream Stage() inside a ToStream region used to indicate input streams in the ID attributes Input() used to indicate output streams ID attributes Output() auxiliary attribute to indicate a stage replication Replicate() Table: The generalized C++ attributes for the DSL. , 6 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References How to use , 7 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Design Implementation High-level abstractions Algorithmic skeleton flexibility [7] Interacting mode: Computation activities : identify when using the ToStream and Stage ID attributes Spatial constrains : identify by using input and output attributes Temporal constrains : defined by the order of the declarations and their spatial constrains Interaction : based on the users’ stream dependency specification and using lock-free queues of FastFlow Persistent nesting when adding Replicate attribute ( replicates “ R ” only sequential code “ S ” in stage attribute) , 8 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Transformation Rules , 9 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Methodology Experiments Sobel Application. Prime Number Application. Evaluation When possible, equivalent code with OpenMP pragmas is tested Best execution times are reported for each test The average value is obtained over 40 runs Environment Dual-socket NUMA Intel multi-core Xeon E5-2695 Ivy Bridge micro-architecture running at 2.40GHz featuring 24 cores 2-way Hyperthreading. Each core had 32KB L1 and 256KB L2 private, and 30MB L3 shared. Linux 2.6.32 x86 64 (CentOS 6.5) and GNU GCC 4.9.2 with the –O3 flag. , 10 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Sobel Application ( S → R ( S ) → S ) using namespace spar ; 1 2 / / global declaration int main ( int argc , char ∗ argv [ ] ) { 3 4 / / open d i r e c t o r y . . . 5 DIR ∗ dptr = opendir ( . . . ) ; struct d i r e n t ∗ d f p t r ; 6 [ [ ToStream ( Input ( dfptr , dptr , argv ) , Output ( tot while ( ( d f p t r = readdir 7 not , tot img ) ) ] ] ( dptr ) ) != NULL) { 8 / / preprocessing if ( file 9 extension == ”bmp” ) { 10 / / Reads the image . . . 11 tot img ++; 12 image = read ( filename , height , width ) ; [ [ Stage ( Input ( image , height , width ) , Output (new image) ) , Replicate ( workers ) ] ] { 13 14 / / Applies the Sobel 15 new image=sobel ( image , height , width ) ; 16 } 17 [ [ Stage ( Input (new image , height , width ) ) ] ] { 18 / / Writes the image . . . 19 write (new image , height , width ) ; 20 } / / end stage 21 } else { 22 tot not ++; 23 } 24 } / / end of stream computing 25 / / end processing 26 return 0; 27 } , 11 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Sobel Application ( S → R ( S ) ) 1 using namespace spar ; 2 / / global declaration 3 int main ( int argc , char ∗ argv [ ] ) { 4 / / open d i r e c t o r y . . . 5 DIR ∗ dptr = opendir ( . . . ) ; 6 struct d i r e n t ∗ d f p t r ; 7 [ [ ToStream ( Input ( dfptr , dptr , argv ) , Output ( tot not , tot img ) ) ] ] while ( ( d f p t r = readdir ( dptr ) ) != NULL) { 8 / / preprocessing 9 if ( file extension == ”bmp” ) { 10 / / Reads the image . . . 11 tot img ++; 12 image = read ( filename , height , width ) ; 13 [ [ Stage ( Input ( image , height , width ) ) , Replicate ( workers ) ] ] { 14 / / Applies the Sobel 15 new image=sobel ( image , height , width ) ; 16 / / Writes the image . . . 17 write (new image , height , width ) ; 18 } / / end stage 19 } else { 20 tot not ++; 21 } 22 } / / end of stream computing 23 / / end processing return 0; 24 25 } , 12 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Sobel Application (Parallel Activity Graph) , 13 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Sobel Application (Performance) Results (Size=800x600 -- N=400) Execution Time (S) 25 21.7 21.7 DSL OMP-4.0 20 15 10 5.60 4.71 4.85 4.65 3.90 5 S [S->R(S)->S] [S->R(S)] [S->R(S)->R(S)] Tested versions Results (Size=mixed -- N=400) 243 Execution Time (S) 108,9 108.9 DSL OMP-4.0 81 25.88 24.39 22.79 21.12 20.32 27 9 3 1 S [S->R(S)->S] [S->R(S)] [S->R(S)->R(S)] Tested versions , 14 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Prime Number Application ( S ↔ R ( S ) ) using namespace spar ; 1 2 / / global declarations . . . int prime number ( int n ) { 3 int t o t a l = 0; 4 [ [ ToStream ( Input ( t o t a l , n ) , Output ( t o t a l ) ) ] ] 5 for ( int i = 2; 6 i < = n ; i ++ ) { int prime = 1; 7 [ [ Stage ( Input ( i , prime ) , Output ( prime ) ) , Replicate ( workers ) ] ] 8 for ( int j = 2; 9 j < i ; j ++ ) { if ( i % j == 0 ) { 10 11 prime = 0; 12 break ; 13 } 14 } 15 t o t a l = t o t a l + prime ; 16 } 17 return t o t a l ; 18 } , 15 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Prime Number Application ( S → R ( S ) → S ) using namespace spar ; 1 2 / / global declarations . . . int prime number ( int n ) { 3 int t o t a l = 0; 4 [ [ ToStream ( Input ( t o t a l , n ) , Output ( t o t a l ) ) ] ] 5 for ( int i = 2; 6 i < = n ; i ++ ) { int prime = 1; 7 [ [ Stage ( Input ( i , prime ) , Output ( prime ) ) , Replicate ( workers ) ] ] 8 for ( int j = 2; 9 j < i ; j ++ ) { if ( i % j == 0 ) { 10 11 prime = 0; 12 break ; 13 } 14 } 15 [ [ Stage ( Input ( t o t a l , prime ) , Output ( t o t a l ) ) ] ] { t o t a l = t o t a l + prime ; } 16 } 17 return t o t a l ; 18 } , 16 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Prime Number Application (Parallel Activity Graph) , 17 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Prime Number Application (Performance) , 18 / 22
Introduction DSL Interface Implementation Results Related Work Conclusions References Related Work Standard runtimes OpenMP [8] Cilk [9] TBB [10] Research runtimes Programming Language : StreamIt [11] Skeleton library : FastFlow [12] Standard extensions : Cilk-Piper [13] and OpenStream [14] , 19 / 22
Recommend
More recommend