toward a core design to distribute an execution on a
play

Toward a Core Design to Distribute an Execution on a Manycore - PowerPoint PPT Presentation

PaCT2015, Petrozavodsk, August 31 - September 4, 2015 Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune Universit e de Perpignan Via Domitia


  1. PaCT’2015, Petrozavodsk, August 31 - September 4, 2015 Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune Universit´ e de Perpignan Via Domitia DALI-LIRMM 1 / 33

  2. Summary. Parallelization of a C Code. 1 Automatic Hardware Parallelization. 2 Determinism. 3 Conclusion. 4 2 / 33

  3. Parallelization of a C Code. 3 / 33

  4. Example : a sum reduction. long sum( long t [ ] , unsigned i n t n) { i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n − n / 2 ) ; } This code looks sequential. Let us parallelize it. 4 / 33

  5. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } 5 / 33

  6. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } The code is multithreaded. 5 / 33

  7. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } The code is multithreaded. Threads executions are non deterministically ordered. 5 / 33

  8. What we do today : e.g. using pthreads . s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long ∗ sum( void ∗ s t ) { void ST str1 , s t r 2 ; s , s1 , s2 ; unsigned long p t h r e a d t tid1 , t i d 2 ; ( ( ( ST ∗ ) s t) − > i > 2) { i f s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } The code is multithreaded. Threads executions are non deterministically ordered. Too few synchronization = > the result is not deterministic. 5 / 33

  9. Synchronized threads. typedef s t r u c t { unsigned long ∗ p ; unsigned long i ; } ST ; void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } 6 / 33

  10. Synchronized threads. typedef s t r u c t { unsigned long ∗ p ; unsigned long i ; } ST ; void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } Among all the run orderings, the synchronization keeps only good ones (i.e. computing the same result as a sequential execution). 6 / 33

  11. Synchronized threads. typedef s t r u c t { unsigned long ∗ p ; unsigned long i ; } ST ; void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } Among all the run orderings, the synchronization keeps only good ones (i.e. computing the same result as a sequential execution). Too much synchronization = > not parallel enough. 6 / 33

  12. Correctly synchronized threads. s t r u c t { unsigned long ∗ p ; i ; } ST ; typedef unsigned long void ∗ sum( void ∗ s t ) { ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗ ) s t) − > i > 2) { s t r 1 . p=((ST ∗ ) s t) − > p ; s t r 1 . i =((ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗ )& s t r 1 ) ; s t r 2 . p=((ST ∗ ) s t) − > p + ( (ST ∗ ) s t) − > i /2; s t r 2 . i =((ST ∗ ) s t) − > i − ( (ST ∗ ) s t) − > i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗ )& s t r 2 ) ; p t h r e a d j o i n ( tid1 , ( void ∗ )&s1 ) ; p t h r e a d j o i n ( tid2 , ( void ∗ )&s2 ) ; } e l s e i f ( ( ( ST ∗ ) s t) − > i ==1) { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =0; } e l s e { s1 =((ST ∗ ) s t) − > p [ 0 ] ; s2 =((ST ∗ ) s t) − > p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗ ) s ) ; } 7 / 33

  13. What we propose to do. long sum( long t [ ] , unsigned i n t n) { i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n − n / 2 ) ; } 8 / 33

  14. What we propose to do. long sum( long t [ ] , unsigned i n t n) { i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n − n / 2 ) ; } This code is usually understood as sequential. 8 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend