discopop a profiling tool to identify parallelization
play

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - PowerPoint PPT Presentation

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014 Outline Background Approach Results 2 01.10.2014 Background Multicore CPUs


  1. DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014

  2. Outline • Background • Approach • Results 2 01.10.2014

  3. Background • Multicore CPUs are dominating the market of desktops and servers, but writing programs that utilize the available hardware parallelism on these architectures still remains a challenge. • Today, software development is mostly the transformation of programs written by someone else rather than starting from scratch. [1] • Parallelizing legacy sequential programs presents a huge economic challenge. • Appropriate tool support required. [1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research , FoSER ’10, pages 177-180. 3 01.10.2014

  4. Related work • Dynamic approaches Kremlin – “gprof in parallel age”  o “available parallelism” o targets on OpenMP style loops o based on critical path analysis Alchemist  o number of instructions / number of dependencies o control regions o counts data dependencies - Previous dynamic approaches usually do not reveal the root causes that prevent parallelization, as the profiling overhead is too high. 4 01.10.2014

  5. Related work • Static approaches Cetus  o compiler infrastructure for source-to-source transformation o framework for writing automatic parallelization tools ParallWare, Par4All, Polly, PLUTO, …  o loop parallelism o automatic parallel code generation o mainly for scientific computing kernels - Previous static approaches mainly focus on loop parallelism in scientific computing area since static dependence analysis is conservative, and kernels have more regular access patterns. 5 01.10.2014

  6. Our goal • Discover potential parallelism in sequential programs • Target parallelism: o DOALL loops o Pipeline o Tasking • Reveal specific data dependences that prevent parallelization • Efficient in time and space 6 01.10.2014

  7. Outline • Background • Approach • Results 7 01.10.2014

  8. Approach • Work flow P h a s e 1 P h a s e 2 Depen- Conversion to IR Memory Access dency & Control-fmow Graph Parallelism Discovery Instrumentation Ranking Ranked Sour ce Execution Parallel Code Oppor tunities Control Static Control- Region fmow Analysis Information static dynamic 8 01.10.2014

  9. Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 9 01.10.2014

  10. Dependence profiling • Detailed data dependences with control-flow information 1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200 10 01.10.2014

  11. Dependence profiling • Support for multithreaded programs 4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *} - Discover more parallelism in parallel programs - Support other analyses where necessary information can be derived from dependence 11 01.10.2014

  12. Dependence profiling • Parallel implementation, efficient in both time and space • Implemented based on LLVM 1 • Instrumentation applied to IR • Instrumentation library integrated in Compiler-RT • Interface integrated in Clang 1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/ 12 01.10.2014

  13. Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 13 01.10.2014

  14. Computational Unit (CU) • A collection of instructions • Follows the read-compute-write pattern: a program state is first read from memory, the new state is computed, and finally written back • A small piece of code containing no parallelism or only ILP • Building blocks of parallel tasks 14 01.10.2014

  15. Computational Unit (CU) //Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return; } 15 01.10.2014

  16. Computational Unit (CU) 16 01.10.2014

  17. CU graph 53 2 • Two CUs can share common 6 1 7 instructions  blue edges 54  Are two CUs refer to the same code 55 section? 3 1 4 56 • A CU can depend on another via 7 data dependence  red edges 57 5  Are two CUs tightly depend on each 3 2 1 4 other? 58 2 59  Should two CUs be merged? No. of Common Instructions No. of Dependences 17 01.10.2014

  18. Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 18 01.10.2014

  19. Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA: no loopA { …… loopB: yes …… loopB { …… …… …… } …… …… } 19 01.10.2014

  20. Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopA: yes loopB { loopB: no …… …… …… } …… …… } 20 01.10.2014

  21. Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopB { …… …… …… loopA: no } …… loopB: yes …… } 21 01.10.2014

  22. Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopA: no loopB { loopB: yes …… …… …… } …… …… } 22 01.10.2014

  23. Parallelism discovery #pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i], otype[i], 0); prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif } The main loop of Parsec.blackscholes. 23 01.10.2014

  24. Parallelism discovery • Tasking A A A B B B 1 2 C F C F G H F G H D G D C D E SCC SCC chain E H E I I I 24 01.10.2014

  25. Parallelism discovery • Tasking 53 53 53 2 0.35 0.35 6 1 0.28 0.28 7 54 54 54 55 55 55 3 1 0.17 0.17 4 0.10 56 0.10 56 7 56 0.18 0.18 57 5 57 3 57 0.43 2 0.43 1 4 0.20 0.20 0.20 58 0.20 2 59 58 58 0.05 59 0.05 59 No. of Common Instructions No. of Dependences Affinity Min Cut 25 01.10.2014

  26. Parallelism discovery #pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } } 26 01.10.2014

  27. Parallelism discovery #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } } 27 01.10.2014

  28. Parallelism discovery • Pipeline o template matching o input: a CU graph mapped onto the execution tree of the program r o o t x y f u n c t i o n L e a f l o o p . . . l o o p y i s c a l l e d f r o m x x y y i s d a t a - d e p e n d e n t o n x . . l e a f . . . . l e a f l e a f E x e c u t i o n T r e e 28 01.10.2014

  29. Parallelism discovery • Pipeline • Cross-correlation between two vectors to determine similarity • The vector of a program is derived from its CU graph • The vector of a parallel pattern is built using specific properties of the pattern p . g [ ] CorrCoef CorrCoef 0 , 1 = ∈ p g • CorrCoef: - 0: pattern not detected - 1: pattern detected successfully - (0,1): pattern may exist but there are obstacles in implementing it 29 01.10.2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend