DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - PowerPoint PPT Presentation

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014

Outline • Background • Approach • Results 2 01.10.2014

Background • Multicore CPUs are dominating the market of desktops and servers, but writing programs that utilize the available hardware parallelism on these architectures still remains a challenge. • Today, software development is mostly the transformation of programs written by someone else rather than starting from scratch. [1] • Parallelizing legacy sequential programs presents a huge economic challenge. • Appropriate tool support required. [1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research , FoSER ’10, pages 177-180. 3 01.10.2014

Related work • Dynamic approaches Kremlin – “gprof in parallel age”  o “available parallelism” o targets on OpenMP style loops o based on critical path analysis Alchemist  o number of instructions / number of dependencies o control regions o counts data dependencies - Previous dynamic approaches usually do not reveal the root causes that prevent parallelization, as the profiling overhead is too high. 4 01.10.2014

Related work • Static approaches Cetus  o compiler infrastructure for source-to-source transformation o framework for writing automatic parallelization tools ParallWare, Par4All, Polly, PLUTO, …  o loop parallelism o automatic parallel code generation o mainly for scientific computing kernels - Previous static approaches mainly focus on loop parallelism in scientific computing area since static dependence analysis is conservative, and kernels have more regular access patterns. 5 01.10.2014

Our goal • Discover potential parallelism in sequential programs • Target parallelism: o DOALL loops o Pipeline o Tasking • Reveal specific data dependences that prevent parallelization • Efficient in time and space 6 01.10.2014

Outline • Background • Approach • Results 7 01.10.2014

Approach • Work flow P h a s e 1 P h a s e 2 Depen- Conversion to IR Memory Access dency & Control-fmow Graph Parallelism Discovery Instrumentation Ranking Ranked Sour ce Execution Parallel Code Oppor tunities Control Static Control- Region fmow Analysis Information static dynamic 8 01.10.2014

Approach • Background • Approach o Dependence profiling o Computational Unit and CU graph o Parallelism discovery • Results 9 01.10.2014

Dependence profiling • Detailed data dependences with control-flow information 1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200 10 01.10.2014

Dependence profiling • Support for multithreaded programs 4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *} - Discover more parallelism in parallel programs - Support other analyses where necessary information can be derived from dependence 11 01.10.2014

Dependence profiling • Parallel implementation, efficient in both time and space • Implemented based on LLVM 1 • Instrumentation applied to IR • Instrumentation library integrated in Compiler-RT • Interface integrated in Clang 1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/ 12 01.10.2014

Computational Unit (CU) • A collection of instructions • Follows the read-compute-write pattern: a program state is first read from memory, the new state is computed, and finally written back • A small piece of code containing no parallelism or only ILP • Building blocks of parallel tasks 14 01.10.2014

Computational Unit (CU) //Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return; } 15 01.10.2014

Computational Unit (CU) 16 01.10.2014

CU graph 53 2 • Two CUs can share common 6 1 7 instructions  blue edges 54  Are two CUs refer to the same code 55 section? 3 1 4 56 • A CU can depend on another via 7 data dependence  red edges 57 5  Are two CUs tightly depend on each 3 2 1 4 other? 58 2 59  Should two CUs be merged? No. of Common Instructions No. of Dependences 17 01.10.2014

Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA: no loopA { …… loopB: yes …… loopB { …… …… …… } …… …… } 19 01.10.2014

Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopA: yes loopB { loopB: no …… …… …… } …… …… } 20 01.10.2014

Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopB { …… …… …… loopA: no } …… loopB: yes …… } 21 01.10.2014

Parallelism discovery • DOALL loops o Looking for loop-carried dependences loopA { …… …… loopA: no loopB { loopB: yes …… …… …… } …… …… } 22 01.10.2014

Parallelism discovery #pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i], otype[i], 0); prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif } The main loop of Parsec.blackscholes. 23 01.10.2014

Parallelism discovery • Tasking A A A B B B 1 2 C F C F G H F G H D G D C D E SCC SCC chain E H E I I I 24 01.10.2014

Parallelism discovery • Tasking 53 53 53 2 0.35 0.35 6 1 0.28 0.28 7 54 54 54 55 55 55 3 1 0.17 0.17 4 0.10 56 0.10 56 7 56 0.18 0.18 57 5 57 3 57 0.43 2 0.43 1 4 0.20 0.20 0.20 58 0.20 2 59 58 58 0.05 59 0.05 59 No. of Common Instructions No. of Dependences Affinity Min Cut 25 01.10.2014

Parallelism discovery #pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } } 26 01.10.2014

Parallelism discovery #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } } 27 01.10.2014

Parallelism discovery • Pipeline o template matching o input: a CU graph mapped onto the execution tree of the program r o o t x y f u n c t i o n L e a f l o o p . . . l o o p y i s c a l l e d f r o m x x y y i s d a t a - d e p e n d e n t o n x . . l e a f . . . . l e a f l e a f E x e c u t i o n T r e e 28 01.10.2014

Parallelism discovery • Pipeline • Cross-correlation between two vectors to determine similarity • The vector of a program is derived from its CU graph • The vector of a parallel pattern is built using specific properties of the pattern p . g [ ] CorrCoef CorrCoef 0 , 1 = ∈ p g • CorrCoef: - 0: pattern not detected - 1: pattern detected successfully - (0,1): pattern may exist but there are obstacles in implementing it 29 01.10.2014

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - PowerPoint PPT Presentation

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014 Outline Background Approach Results 2 01.10.2014 Background Multicore CPUs

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

MALT : MALloc Tracker A memory profiling tool 3/02/2019 MALT, Sbastien Valat 1 Questions

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

What is most helpful for transformation to Regional Climate Adaptation Governance: Spatial Scale,

Beautiful World Landmarks & Scenery Jef St. Marys Altar Krakow - Poland Corsica -

HOW PRINCESS TEACHES YOU TO THINK Thomas Baar KeY-Workshop Summer 2016, Giersch-Chalet, France

Meta-F* Language Extensibility, Metaprogramming and Proof automation

2011 Report 2012 Report

CSE 255 Lecture 2 Data Mining and Predictive Analytics Supervised learning Regression

Transforming Care Creating a c lt re of Creating a culture of wellness ellness Franciscan

Revelation 4:3, And He who sat there was like a jasper and a sardius stone in appearance; and

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - PowerPoint PPT Presentation

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014 Outline Background Approach Results 2 01.10.2014 Background Multicore CPUs

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

MALT : MALloc Tracker A memory profiling tool 3/02/2019 MALT, Sbastien Valat 1 Questions

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

What is most helpful for transformation to Regional Climate Adaptation Governance: Spatial Scale,

Beautiful World Landmarks &amp; Scenery Jef St. Marys Altar Krakow - Poland Corsica -

HOW PRINCESS TEACHES YOU TO THINK Thomas Baar KeY-Workshop Summer 2016, Giersch-Chalet, France

Meta-F* Language Extensibility, Metaprogramming and Proof automation

2011 Report 2012 Report

CSE 255 Lecture 2 Data Mining and Predictive Analytics Supervised learning Regression

Transforming Care Creating a c lt re of Creating a culture of wellness ellness Franciscan

Revelation 4:3, And He who sat there was like a jasper and a sardius stone in appearance; and

Beautiful World Landmarks & Scenery Jef St. Marys Altar Krakow - Poland Corsica -