DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - - PowerPoint PPT Presentation

discopop a profiling tool to identify parallelization
SMART_READER_LITE
LIVE PREVIEW

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities - - PowerPoint PPT Presentation

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014 Outline Background Approach Results 2 01.10.2014 Background Multicore CPUs


slide-1
SLIDE 1

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities

Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014

slide-2
SLIDE 2

Outline

  • Background
  • Approach
  • Results

01.10.2014 2

slide-3
SLIDE 3

Background

  • Multicore CPUs are dominating the market of desktops and servers, but

writing programs that utilize the available hardware parallelism on these architectures still remains a challenge.

  • Today, software development is mostly the transformation of programs

written by someone else rather than starting from scratch. [1]

  • Parallelizing legacy sequential programs presents a huge economic

challenge.

  • Appropriate tool support required.

[1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop

  • n Future of Software Engineering Research, FoSER ’10, pages 177-180.

01.10.2014 3

slide-4
SLIDE 4

Related work

  • Dynamic approaches

Kremlin – “gprof in parallel age”

  • “available parallelism”
  • targets on OpenMP style loops
  • based on critical path analysis

Alchemist

  • number of instructions / number of dependencies
  • control regions
  • counts data dependencies
  • Previous dynamic approaches usually do not reveal the root causes that prevent

parallelization, as the profiling overhead is too high.

01.10.2014 4

slide-5
SLIDE 5

Related work

  • Static approaches

Cetus

  • compiler infrastructure for source-to-source transformation
  • framework for writing automatic parallelization tools

ParallWare, Par4All, Polly, PLUTO, …

  • loop parallelism
  • automatic parallel code generation
  • mainly for scientific computing kernels
  • Previous static approaches mainly focus on loop parallelism in scientific

computing area since static dependence analysis is conservative, and kernels have more regular access patterns.

01.10.2014 5

slide-6
SLIDE 6

Our goal

  • Discover potential parallelism in sequential programs
  • Target parallelism:
  • DOALL loops
  • Pipeline
  • Tasking
  • Reveal specific data dependences that prevent parallelization
  • Efficient in time and space

01.10.2014 6

slide-7
SLIDE 7

Outline

  • Background
  • Approach
  • Results

01.10.2014 7

slide-8
SLIDE 8

Approach

  • Work flow

01.10.2014 8

P h a s e 2 P h a s e 1 Sour ce Code

Conversion to IR

Memory Access & Control-fmow Instrumentation Static Control- fmow Analysis

Depen- dency Graph Control Region Information

Parallelism Discovery

Ranked Parallel Oppor tunities

static dynamic

Ranking

Execution

slide-9
SLIDE 9

Approach

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 9

slide-10
SLIDE 10

Dependence profiling

  • Detailed data dependences with control-flow information

1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200

01.10.2014 10

slide-11
SLIDE 11

Dependence profiling

  • Support for multithreaded programs

4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *}

  • Discover more parallelism in parallel programs
  • Support other analyses where necessary information can be derived from dependence

01.10.2014 11

slide-12
SLIDE 12

Dependence profiling

  • Parallel implementation, efficient in both time and space
  • Implemented based on LLVM1
  • Instrumentation applied to IR
  • Instrumentation library integrated in Compiler-RT
  • Interface integrated in Clang

1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/

01.10.2014 12

slide-13
SLIDE 13

Approach

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 13

slide-14
SLIDE 14

Computational Unit (CU)

  • A collection of instructions
  • Follows the read-compute-write pattern: a program state is first read

from memory, the new state is computed, and finally written back

  • A small piece of code containing no parallelism or only ILP
  • Building blocks of parallel tasks

01.10.2014 14

slide-15
SLIDE 15

Computational Unit (CU)

//Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return;

} 01.10.2014 15

slide-16
SLIDE 16

Computational Unit (CU)

01.10.2014 16

slide-17
SLIDE 17

CU graph

  • Two CUs can share common

instructions  blue edges  Are two CUs refer to the same code

section?

  • A CU can depend on another via

data dependence  red edges  Are two CUs tightly depend on each

  • ther?

 Should two CUs be merged?

01.10.2014 17

53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3

  • No. of Common Instructions
  • No. of Dependences
slide-18
SLIDE 18

Approach

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 18

slide-19
SLIDE 19

Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 19

loopA: no loopB: yes

slide-20
SLIDE 20

Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 20

loopA: yes loopB: no

slide-21
SLIDE 21

Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 21

loopA: no loopB: yes

slide-22
SLIDE 22

Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 22

loopA: no loopB: yes

slide-23
SLIDE 23

Parallelism discovery

#pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i],

  • type[i], 0);

prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif }

The main loop of Parsec.blackscholes. 01.10.2014 23

slide-24
SLIDE 24

Parallelism discovery

  • Tasking

01.10.2014 24

A B C D E F G H I A B C D E I F G H A B I F G H C D E

1 2 SCC SCC chain

slide-25
SLIDE 25

Parallelism discovery

  • Tasking

01.10.2014 25

53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3

  • No. of Common Instructions
  • No. of Dependences

53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05

Affinity

53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05

Min Cut

slide-26
SLIDE 26

Parallelism discovery

#pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } }

01.10.2014 26

slide-27
SLIDE 27

Parallelism discovery

#pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } }

01.10.2014 27

slide-28
SLIDE 28

Parallelism discovery

  • Pipeline
  • template matching
  • input: a CU graph mapped onto the execution tree of the program

01.10.2014 28

E x e c u t i

  • n

T r e e r

  • t

f u n c t i

  • n

L e a f l

  • p

l

  • p

. . . l e a f . . . . . . l e a f l e a f x y y i s c a l l e d f r

  • m

x x y y i s d a t a

  • d

e p e n d e n t

  • n

x

slide-29
SLIDE 29

Parallelism discovery

  • Pipeline
  • Cross-correlation between two vectors to determine similarity
  • The vector of a program is derived from its CU graph
  • The vector of a parallel pattern is built using specific properties of the

pattern

  • CorrCoef:
  • 0: pattern not detected
  • 1: pattern detected successfully
  • (0,1): pattern may exist but there are obstacles in implementing it

01.10.2014 29

g p g p CorrCoef . = [ ] 1 , ∈ CorrCoef

slide-30
SLIDE 30

Parallelism discovery

  • Pipeline

for (i = 0; i < num_of_frames; ++i) { if (!pf.Update(i)) return 0; pf.Estimate(estimate); WritePose(output, estimate); if (outputBMP)

  • utputBMP(estimate);

}

01.10.2014 30 w

j,k =

k− j # stage s−1

The result of correlation-coefficient between these graph and pipeline vectors is 1.

L

  • p

C U

u p d a t e

C U

  • u

t p u t

C U

e s t i m a t e

C UG r a p h

  • f

b

  • d

y t r a c k l

  • p

C Uu C Ue C Uo C U

u

C U

e

C U

  • G

r a p h M a t r i x f r

  • m

C UG r a p h 1 1 A B C A x w

1 , 2

B x w

2 , 3

C x x P i p e l i n e M a t r i x 1 1

G r a p hV e c t

  • r

w

j , k

3

  • s

t a g e p i p e l i n e v e c t

  • r

P i p e l i n e V e c t

  • r

w i t hw e i g h t v a l u e 1 1 1 1 1 1

slide-31
SLIDE 31

Outline

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 31

slide-32
SLIDE 32

Evalution

  • DOALL loops
  • NAS Parallel Benchmarks
  • Parsec
  • real world kernels and applications
  • Tasking
  • Starbench
  • Parsec
  • Pipeline
  • Parsec
  • libVorbis

01.10.2014 32

slide-33
SLIDE 33

Results

  • DOALL loops in NAS Parallel Benchmarks

01.10.2014 33

Program # loops # OMP # identified BT 184 30 30 SP 252 34 34 LU 173 33 33 IS 25 11 8 EP 10 1 1 CG 32 16 9 MG 74 14 14 FT 37 8 7 Overall 787 147 136

slide-34
SLIDE 34

Results

  • Precise loop parallelism detection
  • detected 92.5% of the DOALL loops from NPB
  • Result ranking
  • covered 65.3% of the parallelized loops from NPB in top 30%

01.10.2014 34

slide-35
SLIDE 35

Results

Benchma rk LOC Input Size Number of Suggestio ns # Adopt ed Seq. Time (s) Par. Time (s) Speedu p (4T) histogra m 102 50M numbers 5 1 0.36 0.098 3.67 mandelbr

  • t

521 1024 x 1024 matrix 2 2 46.02 22.73 (11.61) 2.02 (3.96) light propagati

  • n

74 500k random points 1 1 5.67 2.33 2.43 ANN training 107 50 x 500 x 4 matrix 10 2 5.11 1.66 3.07

01.10.2014 35

Speedup achieved when adopting DOALL loop suggestions

slide-36
SLIDE 36

Results

Benchmark Number of Suggestio ns Location Parallelized in Parallel Implementation Matching Suggestion Details of Matching Suggestion # Iter. Size Effort blackschole s 2 blackscholes.c: 238 blackscholes.c: 238 400 20 Low streamclust er 16 streamcluster.cpp: 1723 streamcluster.cp p: 1714 5 8 Mediu m gzip 1.3.5 43 pigz.c: 1478 gzip.c: 1595 284 101 High bzip2 1.0.2 62 bzip2smp.c: 81 bzip2.c: 3793 104 34 High

01.10.2014 36

Comparison with existing parallel implementations for DOALL loops

slide-37
SLIDE 37

Results

  • Tasking suggestions compared to existing parallel implementations

01.10.2014 37 Program Function % exec. time Match in parallel version # CUs c-ray render_scanlines( ) 100.0 yes 4 k-means cluster() 99.6 yes 3 md5 process() 93.5 yes 7 rotate RotateEngine::ru n() 90.3 yes 6 rgbyuv processImage() 100.0 yes 7 ray-rot render_scanlines( ) 97.2 yes 10 rot-cc RotateEngine::ru n() 54.7 yes 13

slide-38
SLIDE 38

Results

  • Speedup achieved when adopting tasking suggestions

01.10.2014 38

Program Function Refactoring # threads Local speedup

Fluidanimate RebuildGrid() yes 2 1.60 Fluidanimate ProcessCollision s() no 4 1.81 Canneal routing_cost_giv en_loc() yes 2 1.32 Blackscholes CNDF() no 2 0.98 FFT fft_unshaffle_32, etc no 4 3.01

slide-39
SLIDE 39

Results

  • Pipeline discovery in Parsec benchmarks and libVorbis

01.10.2014 39 Program # in parallel version

  • Corr. coef.

Detected Speedup bodytrack 1 0.96 1 N.A. dedup 1 1.00 1 N.A. ferret 1 1.00 1 N.A. blackscholes 0.00 N.A. fluidanimate 0.94 1 1.52 (3T) libVorbis N/A 1.00 1 3.62 (4T)

slide-40
SLIDE 40

Results

  • Overhead

01.10.2014 40

Time overhead for NAS and Starbench.

50 100 150 200 250 300 350

424 428 serial 8T_lock-based 8T_lock-free 16T_lock-free

Slowdown (x)

slide-41
SLIDE 41

Results

  • Overhead

01.10.2014 41

Memory overhead for NAS and Starbench.

256 512 768 1024 1280 1536

1589 7856 1681 Native 8T_lock-free 16T_lock-free

Memory consumption (MB)

slide-42
SLIDE 42

Conclusion

  • A general concept which allows arbitrary code sections that can run

concurrently with each other to be identified.

  • Useful suggestions are given. After parallelizing sequential programs by

adopting such suggestions, significant speedup could be gained.

  • Suggestions for well-known open source software are comparable with

their existing parallel implementations.

  • Practical overhead in both time and space.

01.10.2014 42

slide-43
SLIDE 43

Latest Progress

  • Task parallelism detection
  • not limited to predefined language constructs
  • covers independent tasks and pipeline parallelism

01.10.2014 43

f u n c t i

  • n

: 3 6 5

  • 3

8 1 P a r a l l e l i z a b l e : t r u e l

  • p

: 3 7 2- 3 8 P a r a l l e l i z a b l e : f a l s e I N I T 3 7 C U 3 7 4

  • 3

7 9 I N I T 6 6 6 i f

  • e

l s e : 6 6 7- 6 7 8 P a r a l l e l i z a b l e : f a l s e l

  • p

: 6 8 2

  • 7

9 P a r a l l e l i z a b l e : t r u e i f

  • e

l s e : 7 1 9 P a r a l l e l i z a b l e : f a l s e R A W C U C

  • n

t r

  • l

R e g i

  • n

B l u e Y e l l

  • w

G r e y

slide-44
SLIDE 44

Latest Progress

  • Results after utilizing found task parallelism

01.10.2014 44 Program Function % of time Para. plan # threads Local speedup Overall speedup fluidani mate RebuildG rid 9.8 Indep. tasks 2 1.69 1.04 IS main 100.0 Indep. tasks

  • FFT

fft_unshaf fle_32, etc 94.8 Indep. tasks 4 3.01 2.67 fluidani mate Compute Forces 91.2 Pipeline 3 1.67 1.52 bodytrac k mainSingl eThread 100.0 Pipeline 3 1.17 1.17 LibVorbi s main (encoder) 100.0 Pipeline 4 3.62 3.62

slide-45
SLIDE 45

Performance

  • Parallelize the analysis to lower the overhead further

programs from NAS, input size = W, 4 threads

01.10.2014 45

Program Serial (s) Parallel (s) Speedup (x) BT 4475.19 1138.52 3.93 CG 477.1 163.25 2.92 MG 423.03 132.68 3.18

slide-46
SLIDE 46

Performance

  • Slowdown when profiling multithreaded programs

Fußzeile, 01.10.2014 46

200 400 600 800 1000 c

  • r

a y k M e a n s m d 5 r a y

  • r
  • t

r g b y u v r

  • t

a t e r

  • t
  • c

c s t r e a m c l u s t e r n y j p e g b

  • d

y t r a c k h 2 6 4 d e c a v e r a g e S low d

  • w

n( × ) 8T, 4Tn 16T, 4Tn

slide-47
SLIDE 47

Performance

  • Memory consumption when profiling multithreaded programs

Fußzeile, 01.10.2014 47

512 1024 1536 2048 2560 c

  • r

a y k M e a n s m d 5 r a y

  • r
  • t

r g b y u v r

  • t

a t e r

  • t
  • c

c s t r e a m c l u s t e r n y j p e g b

  • d

y t r a c k h 2 6 4 d e c a v e r a g e Me m

  • ryc
  • n

s u m p on(MB ) Na ve, 4Tn 8T, 4Tn 16T, 4Tn

slide-48
SLIDE 48

Code transformation

  • Automatic serial-to-parallel code transformation
  • based on data dependencies
  • using TBB
  • firstly loops, then flow graph and pipeline

01.10.2014 48

slide-49
SLIDE 49

10/1/14

DiscoPoP: A Profiling Tool to Identify Parallelization Opportunities

Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf 02.10.2014

slide-50
SLIDE 50

10/1/14 Outline

  • Background
  • Approach
  • Results

01.10.2014 2

slide-51
SLIDE 51

Background

  • Multicore CPUs are dominating the market of desktops and servers, but

writing programs that utilize the available hardware parallelism on these architectures still remains a challenge.

  • Today, software development is mostly the transformation of programs

written by someone else rather than starting from scratch. [1]

  • Parallelizing legacy sequential programs presents a huge economic

challenge.

  • Appropriate tool support required.

[1] R. E. Johnson. Software development is program transformation. In Proceedings of the FSE/SDP Workshop

  • n Future of Software Engineering Research, FoSER ’10, pages 177-180.

01.10.2014 3

3

slide-52
SLIDE 52

Related work

  • Dynamic approaches

Kremlin – “gprof in parallel age”

  • “available parallelism”
  • targets on OpenMP style loops
  • based on critical path analysis

Alchemist

  • number of instructions / number of dependencies
  • control regions
  • counts data dependencies
  • Previous dynamic approaches usually do not reveal the root causes that prevent

parallelization, as the profiling overhead is too high.

01.10.2014 4

4 Our purpose: full data dependency analysis.

slide-53
SLIDE 53

Related work

  • Static approaches

Cetus

  • compiler infrastructure for source-to-source transformation
  • framework for writing automatic parallelization tools

ParallWare, Par4All, Polly, PLUTO, …

  • loop parallelism
  • automatic parallel code generation
  • mainly for scientific computing kernels
  • Previous static approaches mainly focus on loop parallelism in scientific

computing area since static dependence analysis is conservative, and kernels have more regular access patterns.

01.10.2014 5

5

slide-54
SLIDE 54

Our goal

  • Discover potential parallelism in sequential programs
  • Target parallelism:
  • DOALL loops
  • Pipeline
  • Tasking
  • Reveal specific data dependences that prevent parallelization
  • Efficient in time and space

01.10.2014 6

6 DOALL loops are loops without loop-carried dependencies, i.e., dependencies between two iterations. Parallelizing DOALL loops is usually trivial but leads to an obvious speedup. We defjnitely want to cover such parallelism.

slide-55
SLIDE 55

10/1/14 Outline

  • Background
  • Approach
  • Results

01.10.2014 7

slide-56
SLIDE 56

Approach

  • Work flow

01.10.2014 8

P h a s e 2 P h a s e 1 Sour ce Code

Conversion to IR Memory Access & Control-fmow Instrumentation Static Control- fmow Analysis

Depen- dency Graph Control Region Information

Parallelism Discovery

Ranked Parallel Oppor tunities static dynamic

Ranking Execution

8 The work fmow of DiscoPoP is divided into two phases: In the fjrst phase, we instrument the target program and execute it. Control fmow information and data dependencies are obtained in this phase. In the second phase, we build computational units (CUs) for the target program, and search for potential parallelism based on the CUs and dependence among them. The output is a list of parallelization opportunities, consisting of several code sections that may run in parallel. These opportunities are also ranked to allow the users focus on the most interesting opportunities.

slide-57
SLIDE 57

10/1/14 Approach

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 9

slide-58
SLIDE 58

Dependence profiling

  • Detailed data dependences with control-flow information

1:60 BGN loop 1:60 NOM {RAW 1:60|i} {WAR 1:60|i} {INIT *} 1:63 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} 1:64 NOM {RAW 1:60|i} 1:65 NOM {RAW 1:59|temp1} {RAW 1:67|temp1} {WAR 1:67|temp2} {INIT *} 1:66 NOM {RAW 1:59|temp1} {RAW 1:65|temp2} {RAW 1:67|temp1} {INIT *} 1:67 NOM {RAW 1:65|temp2} {WAR 1:66|temp1} 1:70 NOM {RAW 1:67|temp1} {INIT *} 1:74 NOM {RAW 1:41|block} 1:74 END loop 1200

01.10.2014 10

10 A data dependence is represented as a triple <sink, type, source>. “T ype” is the dependence type (RAW, WAR or WAW). Note that a special type INIT represents the fjrst write operation to a memory address. “Sink” and “source” are the source code locations of the latter and the former memory accesses, respectively . “Sink” is further represented as a pair <fjleID:lineID>, while source is represented as a triple <fjleID:lineID|variableName>. Data dependences with the same “sink” are aggregated together. The keyword NOM (short for "NORMAL") indicates that the source line specifjed by aggregated “sink” has no control-fmow information. Otherwise, BGN and END represent the entry and exit point of a control region, respectively . In this example, a loop starts at source line 1:60 and ends at source line 1:74. The number 1200 following END loop shows the actual number of iterations executed.

slide-59
SLIDE 59

Dependence profiling

  • Support for multithreaded programs

4:58|2 NOM {WAR 4:77|2|iter} 4:59|2 NOM {WAR 4:71|2|z_real} 4:64|3 NOM {RAW 3:75|0|maxiter} {RAW 4:58|3|iter} {RAW 4:61|3|z_norm} {RAW 4:71|3|z_norm} {RAW 4:73|3|iter} 4:69|3 NOM {RAW 4:57|3|c_real} {RAW 4:66|3|z2_real} {WAR 4:67|3|z_real} 4:71|2 NOM {RAW 4:69|2|z_real} {RAW 4:70|2|z_imag} {WAR 4:64|2|z_norm} 4:80|1 NOM {WAW 4:80|1|green} {INIT *}

  • Discover more parallelism in parallel programs
  • Support other analyses where necessary information can be derived from dependence

01.10.2014 11

Dependences of a code section in Mandelbrot. Thread IDs are highlighted. 11

slide-60
SLIDE 60

10/1/14 Dependence profiling

  • Parallel implementation, efficient in both time and space
  • Implemented based on LLVM1
  • Instrumentation applied to IR
  • Instrumentation library integrated in Compiler-RT
  • Interface integrated in Clang

1 DiscoPoP on LLVM website: http://llvm.org/ProjectsWithLLVM/

01.10.2014 12

slide-61
SLIDE 61

10/1/14 Approach

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 13

slide-62
SLIDE 62

Computational Unit (CU)

  • A collection of instructions
  • Follows the read-compute-write pattern: a program state is first read

from memory, the new state is computed, and finally written back

  • A small piece of code containing no parallelism or only ILP
  • Building blocks of parallel tasks

01.10.2014 14

14 Advantage: instructions in a CU do not need to be contiguous, and a chain of CU can cross control regions. This means a potential task is not limited in a predefjned construct.

slide-63
SLIDE 63

Computational Unit (CU)

//Region 0; Depth 0 void netlist::get_random_pair(netlist_elem** a, netlist_elem** b, Rng* rng) { //get a random element long id_a = rng->rand(_chip_size); netlist_elem* elem_a = &(_elements[id_a]); //now do the same for b long id_b = rng->rand(_chip_size); netlist_elem* elem_b = &(_elements[id_b]); //Region 1; Depth 1; while (id_b == id_a) { id_b = rng->rand(_chip_size); elem_b = &(_elements[id_b]); } *a = elem_a; *b = elem_b; return; } 01.10.2014 15

15 Function netlist::get_random_pair() of Canneal, one of the benchmarks from PARSEC benchmark suite.

slide-64
SLIDE 64

Computational Unit (CU)

01.10.2014 16

The two computations mentioned above follow a basic rule where a variable or a group of variables are read and then they are used to perform another calculation. This is followed by the fjnal state being written to another variable as a store operation. Hence, these two computations can be said to follow a read-compute-write pattern. These CUs form the building blocks of the tasks which can be created for exploiting parallelism in the sequential programs. 16

slide-65
SLIDE 65

10/1/14 CU graph

  • Two CUs can share common

instructions  blue edges  Are two CUs refer to the same code

section?

  • A CU can depend on another via

data dependence  red edges  Are two CUs tightly depend on each

  • ther?

 Should two CUs be merged?

01.10.2014 17 53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3

  • No. of Common Instructions
  • No. of Dependences
slide-66
SLIDE 66

10/1/14 Approach

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 18

slide-67
SLIDE 67

10/1/14 Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 19

loopA: no loopB: yes

slide-68
SLIDE 68

10/1/14 Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 20

loopA: yes loopB: no

slide-69
SLIDE 69

10/1/14 Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 21

loopA: no loopB: yes

slide-70
SLIDE 70

10/1/14 Parallelism discovery

  • DOALL loops
  • Looking for loop-carried dependences

loopA { …… …… loopB { …… …… …… } …… …… } 01.10.2014 22

loopA: no loopB: yes

slide-71
SLIDE 71

10/1/14 Parallelism discovery

#pragma omp parallel for private(i, price, priceDelta) for (i=0; i<numOptions; i++) { /* Calling main function to calculate option value based on * Black & Scholes's equation. */ price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i], rate[i], volatility[i], otime[i],

  • type[i], 0);

prices[i] = price; #ifdef ERR_CHK priceDelta = data[i].DGrefval - price; if( fabs(priceDelta) >= 1e-4 ){ printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n", i, price, data[i].DGrefval, priceDelta); numError++; } #endif } The main loop of Parsec.blackscholes. 01.10.2014 23

slide-72
SLIDE 72

Parallelism discovery

  • Tasking

01.10.2014 24 A B C D E F G H I A B C D E I F G H A B I F G H C D E

1 2 SCC SCC chain

24 After the process of forming chains and SCCs, we can suggest some task parallelism between independent chains and SCCs, that is, without RAW dependencies between them. Note that a chain of CUs may start and end anywhere in the program, without the limitation of predefjned constructs, and the code in a chain of CUs does not need to be continuous.

slide-73
SLIDE 73

Parallelism discovery

  • Tasking

01.10.2014 25

53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3

  • No. of Common Instructions
  • No. of Dependences

53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05 Affinity 53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05 Min Cut

25 However, some task parallelism can also be utilized with a small amount of refactoring efgort, that is, dependences between potential tasks exist but are weak. We cover these parallelism by applying {\em minimum cut} on CU graph. In CU graph, a high value of weight on the edges of any two vertices indicates that those two CUs either share large amount of computation or they are strongly dependent on one

  • another. Using these two metrics, we calculate a value called {\em

affjnity} for every pair of CU nodes in the graph. The affjnity between any two CU nodes hence indicates how tightly coupled the two CUs

  • are. A low value of affjnity between two CUs signifjes that it's logical to

separate the two CUs while forming tasks. The next step is to calculate the minimum cut of a connected component using Stoer-Wagner's algorithm. In graph theory, a minimum cut is a set of edges that has the smallest number of edges (for an unweighted graph) or smallest sum of weights possible (for a weighted graph). Identifying the minimum cut of a graph divides the graph into two components that were weakly linked.

slide-74
SLIDE 74

Parallelism discovery

#pragma omp parallel { #pragma omp sections { #pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanin.size(); ++i){ location_t* fanin_loc = elem->fanin[i]->present_loc.Get(); fanin_cost += fabs(elem->present_loc.Get()->x - fanin_loc->x); fanin_cost += fabs(elem->present_loc.Get()->y - fanin_loc->y); } } }

01.10.2014 26

An example of parallelized code section in canneal, a kernel from Parsec benchmark suite. 26

slide-75
SLIDE 75

10/1/14 Parallelism discovery

#pragma omp section { for (auto iter = _elem_names.begin(); iter != _elem_names.end(); ++iter){ netlist_elem* elem = iter->second; for (int i = 0; i< elem->fanout.size(); ++i){ location_t* fanout_loc = elem->fanout[i]->present_loc.Get(); fanout_cost += fabs(elem->present_loc.Get()->x - fanout_loc->x); fanout_cost += fabs(elem->present_loc.Get()->y - fanout_loc->y); } } } } }

01.10.2014 27

slide-76
SLIDE 76

10/1/14 Parallelism discovery

  • Pipeline
  • template matching
  • input: a CU graph mapped onto the execution tree of the program

01.10.2014 28

E x e c u t i

  • n

T r e e r

  • t

f u n c t i

  • n

L e a f l

  • p

l

  • p

. . . l e a f . . . . . . l e a f l e a f x y y i s c a l l e d f r

  • m

x x y y i s d a t a

  • d

e p e n d e n t

  • n

x

slide-77
SLIDE 77

10/1/14 Parallelism discovery

  • Pipeline
  • Cross-correlation between two vectors to determine similarity
  • The vector of a program is derived from its CU graph
  • The vector of a parallel pattern is built using specific properties of the

pattern

  • CorrCoef:
  • 0: pattern not detected
  • 1: pattern detected successfully
  • (0,1): pattern may exist but there are obstacles in implementing it

01.10.2014 29

g p g p CorrCoef . = [ ] 1 , ∈ CorrCoef

slide-78
SLIDE 78

Parallelism discovery

  • Pipeline

for (i = 0; i < num_of_frames; ++i) { if (!pf.Update(i)) return 0; pf.Estimate(estimate); WritePose(output, estimate); if (outputBMP)

  • utputBMP(estimate);

}

01.10.2014 30 w

j,k =

k− j #stag e s−1

The result of correlation-coefficient between these graph and pipeline vectors is 1.

L
  • p
C U u p d a t e C U
  • u
t p u t C U e s t i ma t e C U G r a p h
  • f
b
  • d
y t r a c k l
  • p

C Uu C U

e

C U

  • C

U

u

C U

e

C U

  • G

r a p h M a t r i x f r

  • m

C UG r a p h 1 1 A B C A x w

1 , 2

B x w

2 , 3

C x x P i p e l i n e M a t r i x 1 1 G r a p h V e c t

  • r

w

j , k

3

  • s

t a g e p i p e l i n ev e c t

  • r

P i p e l i n e V e c t

  • r

w i t h w e i g h t v a l u e 1 1 1 1 1 1

30 Pipeline Matrix

  • According to the dimensions of the graph matrix, we create a

pipeline pattern matrix of size 3x3.

  • Each row or column of this matrix represents a stage in the pipeline

and the entries of the matrix represent dependences between

  • them. The entries of the pipeline matrix have the following specifjc

meaning:

  • An “x” means don’t care, either 1 or 0 can be in its place.

These dependences do not afgect pipeline creation.

  • A 1 indicates a mandatory dependence. The 1 entries together

represent the chain of dependences along the stages of the pipeline. We call them the chain dependences.

  • The w j,k indicate forward dependences in the pipeline. A

forward dependence exists if a stage Sj of pipeline iteration i depends

  • n the result of a stage Sk of the previous iteration

i − 1, with j < k. Hence, a forward dependence adversely afgects the execution of the pipeline because an earlier stage of an iteration has to wait for the results of a later stage of the previous iteration.

  • #stages represents the total

number of stages in the pipeline.

  • The weight decreases as the

distance between two stages with forward dependences increases.

  • The 0 in the last column of the fjrst row
slide-79
SLIDE 79

10/1/14 Outline

  • Background
  • Approach
  • Dependence profiling
  • Computational Unit and CU graph
  • Parallelism discovery
  • Results

01.10.2014 31

slide-80
SLIDE 80

10/1/14 Evalution

  • DOALL loops
  • NAS Parallel Benchmarks
  • Parsec
  • real world kernels and applications
  • Tasking
  • Starbench
  • Parsec
  • Pipeline
  • Parsec
  • libVorbis

01.10.2014 32

slide-81
SLIDE 81

Results

  • DOALL loops in NAS Parallel Benchmarks

01.10.2014 33

Program # loops # OMP # identified BT 184 30 30 SP 252 34 34 LU 173 33 33 IS 25 11 8 EP 10 1 1 CG 32 16 9 MG 74 14 14 FT 37 8 7 Overall 787 147 136

NAS parallel benchmarks. 33

slide-82
SLIDE 82

10/1/14 Results

  • Precise loop parallelism detection
  • detected 92.5% of the DOALL loops from NPB
  • Result ranking
  • covered 65.3% of the parallelized loops from NPB in top 30%

01.10.2014 34

slide-83
SLIDE 83

10/1/14 Results

Benchma rk LOC Input Size Number of Suggestio ns # Adopt ed Seq. Time (s) Par. Time (s) Speedu p (4T) histogra m 102 50M numbers 5 1 0.36 0.098 3.67 mandelbr

  • t

521 1024 x 1024 matrix 2 2 46.02 22.73 (11.61) 2.02 (3.96) light propagati

  • n

74 500k random points 1 1 5.67 2.33 2.43 ANN training 107 50 x 500 x 4 matrix 10 2 5.11 1.66 3.07

01.10.2014 35

Speedup achieved when adopting DOALL loop suggestions

slide-84
SLIDE 84

10/1/14 Results

Benchmark Number of Suggestio ns Location Parallelized in Parallel Implementation Matching Suggestion Details of Matching Suggestion # Iter. Size Effort blackschole s 2 blackscholes.c: 238 blackscholes.c: 238 400 20 Low streamclust er 16 streamcluster.cpp: 1723 streamcluster.cp p: 1714 5 8 Mediu m gzip 1.3.5 43 pigz.c: 1478 gzip.c: 1595 284 101 High bzip2 1.0.2 62 bzip2smp.c: 81 bzip2.c: 3793 104 34 High

01.10.2014 36

Comparison with existing parallel implementations for DOALL loops

slide-85
SLIDE 85

10/1/14 Results

  • Tasking suggestions compared to existing parallel implementations

01.10.2014 37 Program Function % exec. time Match in parallel version # CUs c-ray render_scanlines( ) 100.0 yes 4 k-means cluster() 99.6 yes 3 md5 process() 93.5 yes 7 rotate RotateEngine::ru n() 90.3 yes 6 rgbyuv processImage() 100.0 yes 7 ray-rot render_scanlines( ) 97.2 yes 10 rot-cc RotateEngine::ru n() 54.7 yes 13

slide-86
SLIDE 86

Results

  • Speedup achieved when adopting tasking suggestions

01.10.2014 38

Program Function Refactoring # threads Local speedup

Fluidanimate RebuildGrid() yes 2 1.60 Fluidanimate ProcessCollision s() no 4 1.81 Canneal routing_cost_giv en_loc() yes 2 1.32 Blackscholes CNDF() no 2 0.98 FFT fft_unshaffle_32, etc no 4 3.01

PARSEC benchmark suite. 38

slide-87
SLIDE 87

10/1/14 Results

  • Pipeline discovery in Parsec benchmarks and libVorbis

01.10.2014 39 Program # in parallel version

  • Corr. coef.

Detected Speedup bodytrack 1 0.96 1 N.A. dedup 1 1.00 1 N.A. ferret 1 1.00 1 N.A. blackscholes 0.00 N.A. fluidanimate 0.94 1 1.52 (3T) libVorbis N/A 1.00 1 3.62 (4T)

slide-88
SLIDE 88

Results

  • Overhead

01.10.2014 40

Time overhead for NAS and Starbench.

50 100 150 200 250 300 350 424 428 serial 8T_lock-based 8T_lock-free 16T_lock-free

Slowdown (x)

40 Our serial profjler has a 190x slowdown on average for NAS benchmarks and a 191x slowdown on average for Starbench programs. The overhead is not surprising since we perform an exhaustive profjling for the whole program. When using 8 threads, our parallel profjler gives a 97x slowdown on average for NAS benchmarks and a 101x slowdown on average for Starbench programs. After increasing the number of threads to 16, the average slowdown is only 78x for NAS benchmarks, and 93x for Starbench programs. Compared to the serial profjler, our parallel profjler achieves a 2.4x and a 2.1x speedup using 16 threads on NAS and Starbench benchmark suites, respectively .

slide-89
SLIDE 89

Results

  • Overhead

01.10.2014 41

Memory overhead for NAS and Starbench.

256 512 768 1024 1280 1536 1589 7856 1681 Native 8T_lock-free 16T_lock-free

Memory consumption (MB)

41 We measure memory consumption using the "maximum resident set size" value provided by /usr/bin/time with the verbose (-v) option. When using 8 threads, our profjler consumes 473 MB of memory on average for NAS benchmarks and 505 MB of memory on average for Starbench programs. The average memory consumption is increased to 649 MB and 1390 MB for NAS and Starbench programs, respectively .

slide-90
SLIDE 90

10/1/14 Conclusion

  • A general concept which allows arbitrary code sections that can run

concurrently with each other to be identified.

  • Useful suggestions are given. After parallelizing sequential programs by

adopting such suggestions, significant speedup could be gained.

  • Suggestions for well-known open source software are comparable with

their existing parallel implementations.

  • Practical overhead in both time and space.

01.10.2014 42

slide-91
SLIDE 91

10/1/14 Latest Progress

  • Task parallelism detection
  • not limited to predefined language constructs
  • covers independent tasks and pipeline parallelism

01.10.2014 43

f u n c t i

  • n

: 3 6 5

  • 3

8 1 P a r a l l e l i z a b l e : t r u e l

  • p

: 3 7 2

  • 3

8 P a r a l l e l i z a b l e : f a l s e I N I T 3 7 C U 3 7 4

  • 3

7 9 I N I T 6 6 6 i f

  • e

l s e : 6 6 7

  • 6

7 8 P a r a l l e l i z a b l e : f a l s e l

  • p

: 6 8 2

  • 7

9 P a r a l l e l i z a b l e : t r u e i f

  • e

l s e : 7 1 9 P a r a l l e l i z a b l e : f a l s e R A W C U C

  • n

t r

  • l

R e g i

  • n

B l u e Y e l l

  • w

G r e y

slide-92
SLIDE 92

10/1/14 Latest Progress

  • Results after utilizing found task parallelism

01.10.2014 44 Program Function % of time Para. plan # threads Local speedup Overall speedup fluidani mate RebuildG rid 9.8 Indep. tasks 2 1.69 1.04 IS main 100.0 Indep. tasks

  • FFT

fft_unshaf fle_32, etc 94.8 Indep. tasks 4 3.01 2.67 fluidani mate Compute Forces 91.2 Pipeline 3 1.67 1.52 bodytrac k mainSingl eThread 100.0 Pipeline 3 1.17 1.17 LibVorbi s main (encoder) 100.0 Pipeline 4 3.62 3.62

slide-93
SLIDE 93

10/1/14 Performance

  • Parallelize the analysis to lower the overhead further

programs from NAS, input size = W, 4 threads

01.10.2014 45

Program Serial (s) Parallel (s) Speedup (x) BT 4475.19 1138.52 3.93 CG 477.1 163.25 2.92 MG 423.03 132.68 3.18

slide-94
SLIDE 94

10/1/14 Performance

  • Slowdown when profiling multithreaded programs

Fußzeile, 01.10.2014 46

200 400 600 800 1000 c

  • r

a y k M e a n s m d 5 r a y

  • r
  • t

r g b y u v r

  • t

a t e r

  • t
  • c

c s t r e a m c l u s t e r n y j p e g b

  • d

y t r a c k h 2 6 4 d e c a v e r a g e S low d

  • w

n(× ) 8T, 4Tn 16T, 4Tn

slide-95
SLIDE 95

10/1/14 Performance

  • Memory consumption when profiling multithreaded programs

Fußzeile, 01.10.2014 47

512 1024 1536 2048 2560 c

  • r

a y k M e a n s m d 5 r a y

  • r
  • t

r g b y u v r

  • t

a t e r

  • t
  • c

c s t r e a m c l u s t e r n y j p e g b

  • d

y t r a c k h 2 6 4 d e c a v e r a g e Me m

  • ryc
  • n

s u m p on(MB ) Na ve, 4Tn 8T, 4Tn 16T, 4Tn

slide-96
SLIDE 96

10/1/14 Code transformation

  • Automatic serial-to-parallel code transformation
  • based on data dependencies
  • using TBB
  • firstly loops, then flow graph and pipeline

01.10.2014 48