DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - - PowerPoint PPT Presentation
DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - - PowerPoint PPT Presentation
DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014 Outline Background Approach Results 2 01.10.2014 Background Multicore CPUs
Outline
- Background
- Approach
- Results
01.10.2014 2
Background
- Multicore CPUs are dominating the market of desktops and servers, but
writing programs that utilize the available hardware parallelism on these architectures still remains a challenge.
- Today, software development is mostly the transformation of programs
written by someone else rather than starting from scratch. [1]
- Parallelizing legacy sequential programs presents a huge economic
challenge.
- Appropriate tool support required.
[1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop
- n Future of Software Engineering Research, FoSER ’10, pages 177-180.
01.10.2014 3
Related work
- Dynamic approaches
Kremlin – “gprof in parallel age”
- “available parallelism”
- targets on OpenMP style loops
- based on critical path analysis
Alchemist
- number of instructions / number of dependencies
- control regions
- counts data dependencies
- Previous dynamic approaches usually do not reveal the root causes that prevent
parallelization, as the profiling overhead is too high.
01.10.2014 4
Related work
- Static approaches
Cetus
- compiler infrastructure for source-to-source transformation
- framework for writing automatic parallelization tools
ParallWare, Par4All, Polly, PLUTO, …
- loop parallelism
- automatic parallel code generation
- mainly for scientific computing kernels
- Previous static approaches mainly focus on loop parallelism in scientific
computing area since static dependence analysis is conservative, and kernels have more regular access patterns.
01.10.2014 5
Our goal
- Discover potential parallelism in sequential programs
- Target parallelism:
- DOALL loops
- Pipeline
- Tasking
- Reveal specific data dependences that prevent parallelization
- Efficient in time and space
01.10.2014 6
Outline
- Background
- Approach
- Results
01.10.2014 7
Approach
- Work flow
01.10.2014 8
P h a s e 2 P h a s e 1 Sour ce Code
Conversion to IR
Memory Access & Control-fmow Instrumentation Static Control- fmow Analysis
Depen- dency Graph Control Region Information
Parallelism Discovery
Ranked Parallel Oppor tunities
static dynamic
Ranking
Execution
Approach
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 9
Dependence profiling
- Detailed data dependences with control-flow information
1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200
01.10.2014 10
Dependence profiling
- Support for multithreaded programs
4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *}
- Discover more parallelism in parallel programs
- Support other analyses where necessary information can be derived from dependence
01.10.2014 11
Dependence profiling
- Parallel implementation, efficient in both time and space
- Implemented based on LLVM1
- Instrumentation applied to IR
- Instrumentation library integrated in Compiler-RT
- Interface integrated in Clang
1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/
01.10.2014 12
Approach
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 13
Computational Unit (CU)
- A collection of instructions
- Follows the read-compute-write pattern: a program state is first read
from memory, the new state is computed, and finally written back
- A small piece of code containing no parallelism or only ILP
- Building blocks of parallel tasks
01.10.2014 14
Computational Unit (CU)
//Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return;
} 01.10.2014 15
Computational Unit (CU)
01.10.2014 16
CU graph
- Two CUs can share common
instructions blue edges Are two CUs refer to the same code
section?
- A CU can depend on another via
data dependence red edges Are two CUs tightly depend on each
- ther?
Should two CUs be merged?
01.10.2014 17
53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3
- No. of Common Instructions
- No. of Dependences
Approach
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 18
Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 19
loopA: no loopB: yes
Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 20
loopA: yes loopB: no
Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 21
loopA: no loopB: yes
Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 22
loopA: no loopB: yes
Parallelism discovery
#pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i],
- type[i], 0);
prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif }
The main loop of Parsec.blackscholes. 01.10.2014 23
Parallelism discovery
- Tasking
01.10.2014 24
A B C D E F G H I A B C D E I F G H A B I F G H C D E
1 2 SCC SCC chain
Parallelism discovery
- Tasking
01.10.2014 25
53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3
- No. of Common Instructions
- No. of Dependences
53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05
Affinity
53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05
Min Cut
Parallelism discovery
#pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } }
01.10.2014 26
Parallelism discovery
#pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } }
01.10.2014 27
Parallelism discovery
- Pipeline
- template matching
- input: a CU graph mapped onto the execution tree of the program
01.10.2014 28
E x e c u t i
- n
T r e e r
- t
f u n c t i
- n
L e a f l
- p
l
- p
. . . l e a f . . . . . . l e a f l e a f x y y i s c a l l e d f r
- m
x x y y i s d a t a
- d
e p e n d e n t
- n
x
Parallelism discovery
- Pipeline
- Cross-correlation between two vectors to determine similarity
- The vector of a program is derived from its CU graph
- The vector of a parallel pattern is built using specific properties of the
pattern
- CorrCoef:
- 0: pattern not detected
- 1: pattern detected successfully
- (0,1): pattern may exist but there are obstacles in implementing it
01.10.2014 29
g p g p CorrCoef . = [ ] 1 , ∈ CorrCoef
Parallelism discovery
- Pipeline
for (i = 0; i < num_of_frames; ++i) { if (!pf.Update(i)) return 0; pf.Estimate(estimate); WritePose(output, estimate); if (outputBMP)
- utputBMP(estimate);
}
01.10.2014 30 w
j,k =
k− j # stage s−1
The result of correlation-coefficient between these graph and pipeline vectors is 1.
L
- p
C U
u p d a t e
C U
- u
t p u t
C U
e s t i m a t e
C UG r a p h
- f
b
- d
y t r a c k l
- p
C Uu C Ue C Uo C U
u
C U
e
C U
- G
r a p h M a t r i x f r
- m
C UG r a p h 1 1 A B C A x w
1 , 2
B x w
2 , 3
C x x P i p e l i n e M a t r i x 1 1
G r a p hV e c t
- r
w
j , k
3
- s
t a g e p i p e l i n e v e c t
- r
P i p e l i n e V e c t
- r
w i t hw e i g h t v a l u e 1 1 1 1 1 1
Outline
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 31
Evalution
- DOALL loops
- NAS Parallel Benchmarks
- Parsec
- real world kernels and applications
- Tasking
- Starbench
- Parsec
- Pipeline
- Parsec
- libVorbis
01.10.2014 32
Results
- DOALL loops in NAS Parallel Benchmarks
01.10.2014 33
Program # loops # OMP # identified BT 184 30 30 SP 252 34 34 LU 173 33 33 IS 25 11 8 EP 10 1 1 CG 32 16 9 MG 74 14 14 FT 37 8 7 Overall 787 147 136
Results
- Precise loop parallelism detection
- detected 92.5% of the DOALL loops from NPB
- Result ranking
- covered 65.3% of the parallelized loops from NPB in top 30%
01.10.2014 34
Results
Benchma rk LOC Input Size Number of Suggestio ns # Adopt ed Seq. Time (s) Par. Time (s) Speedu p (4T) histogra m 102 50M numbers 5 1 0.36 0.098 3.67 mandelbr
- t
521 1024 x 1024 matrix 2 2 46.02 22.73 (11.61) 2.02 (3.96) light propagati
- n
74 500k random points 1 1 5.67 2.33 2.43 ANN training 107 50 x 500 x 4 matrix 10 2 5.11 1.66 3.07
01.10.2014 35
Speedup achieved when adopting DOALL loop suggestions
Results
Benchmark Number of Suggestio ns Location Parallelized in Parallel Implementation Matching Suggestion Details of Matching Suggestion # Iter. Size Effort blackschole s 2 blackscholes.c: 238 blackscholes.c: 238 400 20 Low streamclust er 16 streamcluster.cpp: 1723 streamcluster.cp p: 1714 5 8 Mediu m gzip 1.3.5 43 pigz.c: 1478 gzip.c: 1595 284 101 High bzip2 1.0.2 62 bzip2smp.c: 81 bzip2.c: 3793 104 34 High
01.10.2014 36
Comparison with existing parallel implementations for DOALL loops
Results
- Tasking suggestions compared to existing parallel implementations
01.10.2014 37 Program Function % exec. time Match in parallel version # CUs c-ray render_scanlines( ) 100.0 yes 4 k-means cluster() 99.6 yes 3 md5 process() 93.5 yes 7 rotate RotateEngine::ru n() 90.3 yes 6 rgbyuv processImage() 100.0 yes 7 ray-rot render_scanlines( ) 97.2 yes 10 rot-cc RotateEngine::ru n() 54.7 yes 13
Results
- Speedup achieved when adopting tasking suggestions
01.10.2014 38
Program Function Refactoring # threads Local speedup
Fluidanimate RebuildGrid() yes 2 1.60 Fluidanimate ProcessCollision s() no 4 1.81 Canneal routing_cost_giv en_loc() yes 2 1.32 Blackscholes CNDF() no 2 0.98 FFT fft_unshaffle_32, etc no 4 3.01
Results
- Pipeline discovery in Parsec benchmarks and libVorbis
01.10.2014 39 Program # in parallel version
- Corr. coef.
Detected Speedup bodytrack 1 0.96 1 N.A. dedup 1 1.00 1 N.A. ferret 1 1.00 1 N.A. blackscholes 0.00 N.A. fluidanimate 0.94 1 1.52 (3T) libVorbis N/A 1.00 1 3.62 (4T)
Results
- Overhead
01.10.2014 40
Time overhead for NAS and Starbench.
50 100 150 200 250 300 350
424 428 serial 8T_lock-based 8T_lock-free 16T_lock-free
Slowdown (x)
Results
- Overhead
01.10.2014 41
Memory overhead for NAS and Starbench.
256 512 768 1024 1280 1536
1589 7856 1681 Native 8T_lock-free 16T_lock-free
Memory consumption (MB)
Conclusion
- A general concept which allows arbitrary code sections that can run
concurrently with each other to be identified.
- Useful suggestions are given. After parallelizing sequential programs by
adopting such suggestions, significant speedup could be gained.
- Suggestions for well-known open source software are comparable with
their existing parallel implementations.
- Practical overhead in both time and space.
01.10.2014 42
Latest Progress
- Task parallelism detection
- not limited to predefined language constructs
- covers independent tasks and pipeline parallelism
01.10.2014 43
f u n c t i
- n
: 3 6 5
- 3
8 1 P a r a l l e l i z a b l e : t r u e l
- p
: 3 7 2- 3 8 P a r a l l e l i z a b l e : f a l s e I N I T 3 7 C U 3 7 4
- 3
7 9 I N I T 6 6 6 i f
- e
l s e : 6 6 7- 6 7 8 P a r a l l e l i z a b l e : f a l s e l
- p
: 6 8 2
- 7
9 P a r a l l e l i z a b l e : t r u e i f
- e
l s e : 7 1 9 P a r a l l e l i z a b l e : f a l s e R A W C U C
- n
t r
- l
R e g i
- n
B l u e Y e l l
- w
G r e y
Latest Progress
- Results after utilizing found task parallelism
01.10.2014 44 Program Function % of time Para. plan # threads Local speedup Overall speedup fluidani mate RebuildG rid 9.8 Indep. tasks 2 1.69 1.04 IS main 100.0 Indep. tasks
- FFT
fft_unshaf fle_32, etc 94.8 Indep. tasks 4 3.01 2.67 fluidani mate Compute Forces 91.2 Pipeline 3 1.67 1.52 bodytrac k mainSingl eThread 100.0 Pipeline 3 1.17 1.17 LibVorbi s main (encoder) 100.0 Pipeline 4 3.62 3.62
Performance
- Parallelize the analysis to lower the overhead further
programs from NAS, input size = W, 4 threads
01.10.2014 45
Program Serial (s) Parallel (s) Speedup (x) BT 4475.19 1138.52 3.93 CG 477.1 163.25 2.92 MG 423.03 132.68 3.18
Performance
- Slowdown when profiling multithreaded programs
Fußzeile, 01.10.2014 46
200 400 600 800 1000 c
- r
a y k M e a n s m d 5 r a y
- r
- t
r g b y u v r
- t
a t e r
- t
- c
c s t r e a m c l u s t e r n y j p e g b
- d
y t r a c k h 2 6 4 d e c a v e r a g e S low d
- w
n( × ) 8T, 4Tn 16T, 4Tn
Performance
- Memory consumption when profiling multithreaded programs
Fußzeile, 01.10.2014 47
512 1024 1536 2048 2560 c
- r
a y k M e a n s m d 5 r a y
- r
- t
r g b y u v r
- t
a t e r
- t
- c
c s t r e a m c l u s t e r n y j p e g b
- d
y t r a c k h 2 6 4 d e c a v e r a g e Me m
- ryc
- n
s u m p on(MB ) Na ve, 4Tn 8T, 4Tn 16T, 4Tn
Code transformation
- Automatic serial-to-parallel code transformation
- based on data dependencies
- using TBB
- firstly loops, then flow graph and pipeline
01.10.2014 48
10/1/14
DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities
Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014
10/1/14 Outline
- Background
- Approach
- Results
01.10.2014 2
Background
- Multicore CPUs are dominating the market of desktops and servers, but
writing programs that utilize the available hardware parallelism on these architectures still remains a challenge.
- Today, software development is mostly the transformation of programs
written by someone else rather than starting from scratch. [1]
- Parallelizing legacy sequential programs presents a huge economic
challenge.
- Appropriate tool support required.
[1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop
- n Future of Software Engineering Research, FoSER ’10, pages 177-180.
01.10.2014 3
3
Related work
- Dynamic approaches
Kremlin – “gprof in parallel age”
- “available parallelism”
- targets on OpenMP style loops
- based on critical path analysis
Alchemist
- number of instructions / number of dependencies
- control regions
- counts data dependencies
- Previous dynamic approaches usually do not reveal the root causes that prevent
parallelization, as the profiling overhead is too high.
01.10.2014 4
4 Our purpose: full data dependency analysis.
Related work
- Static approaches
Cetus
- compiler infrastructure for source-to-source transformation
- framework for writing automatic parallelization tools
ParallWare, Par4All, Polly, PLUTO, …
- loop parallelism
- automatic parallel code generation
- mainly for scientific computing kernels
- Previous static approaches mainly focus on loop parallelism in scientific
computing area since static dependence analysis is conservative, and kernels have more regular access patterns.
01.10.2014 5
5
Our goal
- Discover potential parallelism in sequential programs
- Target parallelism:
- DOALL loops
- Pipeline
- Tasking
- Reveal specific data dependences that prevent parallelization
- Efficient in time and space
01.10.2014 6
6 DOALL loops are loops without loop-carried dependencies, i.e., dependencies between two iterations. Parallelizing DOALL loops is usually trivial but leads to an obvious speedup. We defjnitely want to cover such parallelism.
10/1/14 Outline
- Background
- Approach
- Results
01.10.2014 7
Approach
- Work flow
01.10.2014 8
P h a s e 2 P h a s e 1 Sour ce Code
Conversion to IR Memory Access & Control-fmow Instrumentation Static Control- fmow Analysis
Depen- dency Graph Control Region Information
Parallelism Discovery
Ranked Parallel Oppor tunities static dynamic
Ranking Execution
8 The work fmow of DiscoPoP is divided into two phases: In the fjrst phase, we instrument the target program and execute it. Control fmow information and data dependencies are obtained in this phase. In the second phase, we build computational units (CUs) for the target program, and search for potential parallelism based on the CUs and dependence among them. The output is a list of parallelization opportunities, consisting of several code sections that may run in parallel. These opportunities are also ranked to allow the users focus on the most interesting opportunities.
10/1/14 Approach
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 9
Dependence profiling
- Detailed data dependences with control-flow information
1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200
01.10.2014 10
10 A data dependence is represented as a triple <sink, type, source>. “T ype” is the dependence type (RAW, WAR or WAW). Note that a special type INIT represents the fjrst write operation to a memory address. “Sink” and “source” are the source code locations of the latter and the former memory accesses, respectively . “Sink” is further represented as a pair <fjleID:lineID>, while source is represented as a triple <fjleID:lineID|variableName>. Data dependences with the same “sink” are aggregated together. The keyword NOM (short for "NORMAL") indicates that the source line specifjed by aggregated “sink” has no control-fmow information. Otherwise, BGN and END represent the entry and exit point of a control region, respectively . In this example, a loop starts at source line 1:60 and ends at source line 1:74. The number 1200 following END loop shows the actual number of iterations executed.
Dependence profiling
- Support for multithreaded programs
4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *}
- Discover more parallelism in parallel programs
- Support other analyses where necessary information can be derived from dependence
01.10.2014 11
Dependences of a code section in Mandelbrot. Thread IDs are highlighted. 11
10/1/14 Dependence profiling
- Parallel implementation, efficient in both time and space
- Implemented based on LLVM1
- Instrumentation applied to IR
- Instrumentation library integrated in Compiler-RT
- Interface integrated in Clang
1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/
01.10.2014 12
10/1/14 Approach
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 13
Computational Unit (CU)
- A collection of instructions
- Follows the read-compute-write pattern: a program state is first read
from memory, the new state is computed, and finally written back
- A small piece of code containing no parallelism or only ILP
- Building blocks of parallel tasks
01.10.2014 14
14 Advantage: instructions in a CU do not need to be contiguous, and a chain of CU can cross control regions. This means a potential task is not limited in a predefjned construct.
Computational Unit (CU)
//Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return; } 01.10.2014 15
15 Function netlist::get_random_pair() of Canneal, one of the benchmarks from PARSEC benchmark suite.
Computational Unit (CU)
01.10.2014 16
The two computations mentioned above follow a basic rule where a variable or a group of variables are read and then they are used to perform another calculation. This is followed by the fjnal state being written to another variable as a store operation. Hence, these two computations can be said to follow a read-compute-write pattern. These CUs form the building blocks of the tasks which can be created for exploiting parallelism in the sequential programs. 16
10/1/14 CU graph
- Two CUs can share common
instructions blue edges Are two CUs refer to the same code
section?
- A CU can depend on another via
data dependence red edges Are two CUs tightly depend on each
- ther?
Should two CUs be merged?
01.10.2014 17 53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3
- No. of Common Instructions
- No. of Dependences
10/1/14 Approach
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 18
10/1/14 Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 19
loopA: no loopB: yes
10/1/14 Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 20
loopA: yes loopB: no
10/1/14 Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 21
loopA: no loopB: yes
10/1/14 Parallelism discovery
- DOALL loops
- Looking for loop-carried dependences
loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 22
loopA: no loopB: yes
10/1/14 Parallelism discovery
#pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i],
- type[i], 0);
prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif } The main loop of Parsec.blackscholes. 01.10.2014 23
Parallelism discovery
- Tasking
01.10.2014 24 A B C D E F G H I A B C D E I F G H A B I F G H C D E
1 2 SCC SCC chain
24 After the process of forming chains and SCCs, we can suggest some task parallelism between independent chains and SCCs, that is, without RAW dependencies between them. Note that a chain of CUs may start and end anywhere in the program, without the limitation of predefjned constructs, and the code in a chain of CUs does not need to be continuous.
Parallelism discovery
- Tasking
01.10.2014 25
53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3
- No. of Common Instructions
- No. of Dependences
53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05 Affinity 53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05 Min Cut
25 However, some task parallelism can also be utilized with a small amount of refactoring efgort, that is, dependences between potential tasks exist but are weak. We cover these parallelism by applying {\em minimum cut} on CU graph. In CU graph, a high value of weight on the edges of any two vertices indicates that those two CUs either share large amount of computation or they are strongly dependent on one
- another. Using these two metrics, we calculate a value called {\em
affjnity} for every pair of CU nodes in the graph. The affjnity between any two CU nodes hence indicates how tightly coupled the two CUs
- are. A low value of affjnity between two CUs signifjes that it's logical to
separate the two CUs while forming tasks. The next step is to calculate the minimum cut of a connected component using Stoer-Wagner's algorithm. In graph theory, a minimum cut is a set of edges that has the smallest number of edges (for an unweighted graph) or smallest sum of weights possible (for a weighted graph). Identifying the minimum cut of a graph divides the graph into two components that were weakly linked.
Parallelism discovery
#pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } }
01.10.2014 26
An example of parallelized code section in canneal, a kernel from Parsec benchmark suite. 26
10/1/14 Parallelism discovery
#pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } }
01.10.2014 27
10/1/14 Parallelism discovery
- Pipeline
- template matching
- input: a CU graph mapped onto the execution tree of the program
01.10.2014 28
E x e c u t i
- n
T r e e r
- t
f u n c t i
- n
L e a f l
- p
l
- p
. . . l e a f . . . . . . l e a f l e a f x y y i s c a l l e d f r
- m
x x y y i s d a t a
- d
e p e n d e n t
- n
x
10/1/14 Parallelism discovery
- Pipeline
- Cross-correlation between two vectors to determine similarity
- The vector of a program is derived from its CU graph
- The vector of a parallel pattern is built using specific properties of the
pattern
- CorrCoef:
- 0: pattern not detected
- 1: pattern detected successfully
- (0,1): pattern may exist but there are obstacles in implementing it
01.10.2014 29
g p g p CorrCoef . = [ ] 1 , ∈ CorrCoef
Parallelism discovery
- Pipeline
for (i = 0; i < num_of_frames; ++i) { if (!pf.Update(i)) return 0; pf.Estimate(estimate); WritePose(output, estimate); if (outputBMP)
- utputBMP(estimate);
}
01.10.2014 30 w
j,k =
k− j #stag e s−1
The result of correlation-coefficient between these graph and pipeline vectors is 1.
L- p
- u
- f
- d
- p
C Uu C U
eC U
- C
U
uC U
eC U
- G
r a p h M a t r i x f r
- m
C UG r a p h 1 1 A B C A x w
1 , 2B x w
2 , 3C x x P i p e l i n e M a t r i x 1 1 G r a p h V e c t
- r
w
j , k3
- s
t a g e p i p e l i n ev e c t
- r
P i p e l i n e V e c t
- r
w i t h w e i g h t v a l u e 1 1 1 1 1 1
30 Pipeline Matrix
- According to the dimensions of the graph matrix, we create a
pipeline pattern matrix of size 3x3.
- Each row or column of this matrix represents a stage in the pipeline
and the entries of the matrix represent dependences between
- them. The entries of the pipeline matrix have the following specifjc
meaning:
- An “x” means don’t care, either 1 or 0 can be in its place.
These dependences do not afgect pipeline creation.
- A 1 indicates a mandatory dependence. The 1 entries together
represent the chain of dependences along the stages of the pipeline. We call them the chain dependences.
- The w j,k indicate forward dependences in the pipeline. A
forward dependence exists if a stage Sj of pipeline iteration i depends
- n the result of a stage Sk of the previous iteration
i − 1, with j < k. Hence, a forward dependence adversely afgects the execution of the pipeline because an earlier stage of an iteration has to wait for the results of a later stage of the previous iteration.
- #stages represents the total
number of stages in the pipeline.
- The weight decreases as the
distance between two stages with forward dependences increases.
- The 0 in the last column of the fjrst row
10/1/14 Outline
- Background
- Approach
- Dependence profiling
- Computational Unit and CU graph
- Parallelism discovery
- Results
01.10.2014 31
10/1/14 Evalution
- DOALL loops
- NAS Parallel Benchmarks
- Parsec
- real world kernels and applications
- Tasking
- Starbench
- Parsec
- Pipeline
- Parsec
- libVorbis
01.10.2014 32
Results
- DOALL loops in NAS Parallel Benchmarks
01.10.2014 33
Program # loops # OMP # identified BT 184 30 30 SP 252 34 34 LU 173 33 33 IS 25 11 8 EP 10 1 1 CG 32 16 9 MG 74 14 14 FT 37 8 7 Overall 787 147 136
NAS parallel benchmarks. 33
10/1/14 Results
- Precise loop parallelism detection
- detected 92.5% of the DOALL loops from NPB
- Result ranking
- covered 65.3% of the parallelized loops from NPB in top 30%
01.10.2014 34
10/1/14 Results
Benchma rk LOC Input Size Number of Suggestio ns # Adopt ed Seq. Time (s) Par. Time (s) Speedu p (4T) histogra m 102 50M numbers 5 1 0.36 0.098 3.67 mandelbr
- t
521 1024 x 1024 matrix 2 2 46.02 22.73 (11.61) 2.02 (3.96) light propagati
- n
74 500k random points 1 1 5.67 2.33 2.43 ANN training 107 50 x 500 x 4 matrix 10 2 5.11 1.66 3.07
01.10.2014 35
Speedup achieved when adopting DOALL loop suggestions
10/1/14 Results
Benchmark Number of Suggestio ns Location Parallelized in Parallel Implementation Matching Suggestion Details of Matching Suggestion # Iter. Size Effort blackschole s 2 blackscholes.c: 238 blackscholes.c: 238 400 20 Low streamclust er 16 streamcluster.cpp: 1723 streamcluster.cp p: 1714 5 8 Mediu m gzip 1.3.5 43 pigz.c: 1478 gzip.c: 1595 284 101 High bzip2 1.0.2 62 bzip2smp.c: 81 bzip2.c: 3793 104 34 High
01.10.2014 36
Comparison with existing parallel implementations for DOALL loops
10/1/14 Results
- Tasking suggestions compared to existing parallel implementations
01.10.2014 37 Program Function % exec. time Match in parallel version # CUs c-ray render_scanlines( ) 100.0 yes 4 k-means cluster() 99.6 yes 3 md5 process() 93.5 yes 7 rotate RotateEngine::ru n() 90.3 yes 6 rgbyuv processImage() 100.0 yes 7 ray-rot render_scanlines( ) 97.2 yes 10 rot-cc RotateEngine::ru n() 54.7 yes 13
Results
- Speedup achieved when adopting tasking suggestions
01.10.2014 38
Program Function Refactoring # threads Local speedup
Fluidanimate RebuildGrid() yes 2 1.60 Fluidanimate ProcessCollision s() no 4 1.81 Canneal routing_cost_giv en_loc() yes 2 1.32 Blackscholes CNDF() no 2 0.98 FFT fft_unshaffle_32, etc no 4 3.01
PARSEC benchmark suite. 38
10/1/14 Results
- Pipeline discovery in Parsec benchmarks and libVorbis
01.10.2014 39 Program # in parallel version
- Corr. coef.
Detected Speedup bodytrack 1 0.96 1 N.A. dedup 1 1.00 1 N.A. ferret 1 1.00 1 N.A. blackscholes 0.00 N.A. fluidanimate 0.94 1 1.52 (3T) libVorbis N/A 1.00 1 3.62 (4T)
Results
- Overhead
01.10.2014 40
Time overhead for NAS and Starbench.
50 100 150 200 250 300 350 424 428 serial 8T_lock-based 8T_lock-free 16T_lock-free
Slowdown (x)
40 Our serial profjler has a 190x slowdown on average for NAS benchmarks and a 191x slowdown on average for Starbench programs. The overhead is not surprising since we perform an exhaustive profjling for the whole program. When using 8 threads, our parallel profjler gives a 97x slowdown on average for NAS benchmarks and a 101x slowdown on average for Starbench programs. After increasing the number of threads to 16, the average slowdown is only 78x for NAS benchmarks, and 93x for Starbench programs. Compared to the serial profjler, our parallel profjler achieves a 2.4x and a 2.1x speedup using 16 threads on NAS and Starbench benchmark suites, respectively .
Results
- Overhead
01.10.2014 41
Memory overhead for NAS and Starbench.
256 512 768 1024 1280 1536 1589 7856 1681 Native 8T_lock-free 16T_lock-free
Memory consumption (MB)
41 We measure memory consumption using the "maximum resident set size" value provided by /usr/bin/time with the verbose (-v) option. When using 8 threads, our profjler consumes 473 MB of memory on average for NAS benchmarks and 505 MB of memory on average for Starbench programs. The average memory consumption is increased to 649 MB and 1390 MB for NAS and Starbench programs, respectively .
10/1/14 Conclusion
- A general concept which allows arbitrary code sections that can run
concurrently with each other to be identified.
- Useful suggestions are given. After parallelizing sequential programs by
adopting such suggestions, significant speedup could be gained.
- Suggestions for well-known open source software are comparable with
their existing parallel implementations.
- Practical overhead in both time and space.
01.10.2014 42
10/1/14 Latest Progress
- Task parallelism detection
- not limited to predefined language constructs
- covers independent tasks and pipeline parallelism
01.10.2014 43
f u n c t i
- n
: 3 6 5
- 3
8 1 P a r a l l e l i z a b l e : t r u e l
- p
: 3 7 2
- 3
8 P a r a l l e l i z a b l e : f a l s e I N I T 3 7 C U 3 7 4
- 3
7 9 I N I T 6 6 6 i f
- e
l s e : 6 6 7
- 6
7 8 P a r a l l e l i z a b l e : f a l s e l
- p
: 6 8 2
- 7
9 P a r a l l e l i z a b l e : t r u e i f
- e
l s e : 7 1 9 P a r a l l e l i z a b l e : f a l s e R A W C U C
- n
t r
- l
R e g i
- n
B l u e Y e l l
- w
G r e y
10/1/14 Latest Progress
- Results after utilizing found task parallelism
01.10.2014 44 Program Function % of time Para. plan # threads Local speedup Overall speedup fluidani mate RebuildG rid 9.8 Indep. tasks 2 1.69 1.04 IS main 100.0 Indep. tasks
- FFT
fft_unshaf fle_32, etc 94.8 Indep. tasks 4 3.01 2.67 fluidani mate Compute Forces 91.2 Pipeline 3 1.67 1.52 bodytrac k mainSingl eThread 100.0 Pipeline 3 1.17 1.17 LibVorbi s main (encoder) 100.0 Pipeline 4 3.62 3.62
10/1/14 Performance
- Parallelize the analysis to lower the overhead further
programs from NAS, input size = W, 4 threads
01.10.2014 45
Program Serial (s) Parallel (s) Speedup (x) BT 4475.19 1138.52 3.93 CG 477.1 163.25 2.92 MG 423.03 132.68 3.18
10/1/14 Performance
- Slowdown when profiling multithreaded programs
Fußzeile, 01.10.2014 46
200 400 600 800 1000 c
- r
a y k M e a n s m d 5 r a y
- r
- t
r g b y u v r
- t
a t e r
- t
- c
c s t r e a m c l u s t e r n y j p e g b
- d
y t r a c k h 2 6 4 d e c a v e r a g e S low d
- w
n(× ) 8T, 4Tn 16T, 4Tn
10/1/14 Performance
- Memory consumption when profiling multithreaded programs
Fußzeile, 01.10.2014 47
512 1024 1536 2048 2560 c
- r
a y k M e a n s m d 5 r a y
- r
- t
r g b y u v r
- t
a t e r
- t
- c
c s t r e a m c l u s t e r n y j p e g b
- d
y t r a c k h 2 6 4 d e c a v e r a g e Me m
- ryc
- n
s u m p on(MB ) Na ve, 4Tn 8T, 4Tn 16T, 4Tn
10/1/14 Code transformation
- Automatic serial-to-parallel code transformation
- based on data dependencies
- using TBB
- firstly loops, then flow graph and pipeline
01.10.2014 48