10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, - - PowerPoint PPT Presentation

10th international parallel tools workshop
SMART_READER_LITE
LIVE PREVIEW

10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, - - PowerPoint PPT Presentation

DiscoPoP: A Profiling, Analysis, and Visualization Tool for Parallelism Discovery 10th International Parallel Tools Workshop Rohit Atre, Zia Ul-Huda, Mohammad Norouzi, Arya Mazaheri, Zhen Li, Dr. Ali Jannesari, Prof. Felix Wolf 10/12/2016 |


slide-1
SLIDE 1

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 1

DiscoPoP: A Profiling, Analysis, and Visualization Tool for Parallelism Discovery

10th International Parallel Tools Workshop

Rohit Atre, Zia Ul-Huda, Mohammad Norouzi, Arya Mazaheri, Zhen Li,

  • Dr. Ali Jannesari, Prof. Felix Wolf
slide-2
SLIDE 2

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 2

Introduction

  • A large numbers of legacy programs need to be parallelized
  • Transforming an existing sequential program into a parallel
  • ne is not easy
  • DiscoPoP (Discovery of Potential Parallelism) is a tool to

detect parallelism in sequential applications

  • Detects hotspots in the sequential applications
  • Gives hints to programmers: making parallelization process easy
slide-3
SLIDE 3

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 3

Phase 1

DiscoPoP Workflow

Phase 3 Phase 2

Source Code

Conversion to LLVM IR

Control-flow Analysis Computational Unit (CU) Analysis Data Dependency Analysis

Parallelism Discovery Ranking

Ranked Parallel Opportu- nities Control Region Info CU Graph

slide-4
SLIDE 4

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 4

Outline

  • Profiler
  • Computational Units & Program graph
  • Applications
  • Evaluation
  • Future Works
slide-5
SLIDE 5

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 5

Profiling with signatures

  • A signature is usually implemented as a Bloom filter:

− A fixed-size bit array − k different hash functions that together map an element to a number of array indices

  • Two signatures: one for recording read operations and one for

recording write operations

slide-6
SLIDE 6

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 6

Profiling with signatures

... a. write 3 ... b. read 2 ... c. read 1 ...

  • d. write 2

... write signature read signature a b d a c b

previous read at 2 in line b d WAR b|2

slide-7
SLIDE 7

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 7

Parallel data-dependence profiling

main thread “producer”

distribute

worker threads “consumers”

read signature write signature read signature write signature fetch fetch local dependence storage local dependence storage global dependence storage merge

slide-8
SLIDE 8

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 8

Profiling multithreaded programs

  • Allow program analyses for multithreaded applications

− Communication pattern detection − Scheduling − Performance tuning

slide-9
SLIDE 9

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 9

Outline

  • Profiler
  • Computational Units & Program graph
  • Applications
  • Evaluation
  • Future Works
slide-10
SLIDE 10

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 10

Computational Unit(CU)

  • A collection of program statements
  • Follows the read-compute-write pattern
  • Logical units to make larger tasks
  • Could be merged together
  • Assigned to threads
  • Building blocks for various patterns
  • Tasks in a taskpool
  • Stages of a pipeline

1 x = 3 2 y = 4 3 a = x + rand() / x 4 b = x - rand() / x 5 x = a + b 6 a = y + rand() / y 7 b = y - rand() / y 8 y = a + b

x = 3 a = x + rand() / x b = x - rand() / x CUx x = a + b y = 4 a = y + rand() / y b = y - rand() / y y = a + b CUy

slide-11
SLIDE 11

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 11

  • Every read on a global variable should happen before a

corresponding write on it

  • For every read instruction that violates, a new CU is created

Computational Unit (CU) – Example

1 int x = 3; 2 for (int i = 0; i < MAX_ITER; ++i) { 3 int a = x + rand() / x; 4 int b = x - rand() / x; 5 x = a + b; 6 }

Region 1 Identify variables global to each region Add the global variable x to read-set if it is read Add the global variable x to the write-set when written

slide-12
SLIDE 12

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 12

Program Graph (Output of phase-1)

Dependencies Control-Flow Children

slide-13
SLIDE 13

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 13

Program Graph Visualization

slide-14
SLIDE 14

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 14

Outline

  • Profiler
  • Computational Units & Program graph
  • Applications
  • Evaluation
  • Future Works
slide-15
SLIDE 15

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 15

Applications

  • Detection of parallel design patterns
  • Output of phase-1 is used for detecting:
  • Pipeline
  • Multiloop pipeline
  • Task parallelism
  • Geometric decomposition
  • Do-all loops
  • Reduction
  • Different approaches are used to detect these patterns
slide-16
SLIDE 16

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 16

Parallel Patterns

slide-17
SLIDE 17

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 17

Pipeline

  • Exists only if stages are executed many times
  • Loops, recursions and functions with multiple loops
  • Graph Matrix is computed from CU Graph of the hotspot
  • Pipeline Matrix is created based on the number of CUs in

Graph Matrix

  • Pipeline Matrix have specific properties like chain

dependences and Forward dependence weights

slide-18
SLIDE 18

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 18

Multi-loop Pipeline

  • Iterations of one loop

depends on iterations of another loop

  • Each loop can be a

stage of a pipeline

  • We profile loops and

gather iteration dependence data

for (. . .)// Loop x a[i] = foo(i); for (. . .)// Loop y b[i] = bar(a[i]); Variable Iteration # of Loop x (Ix) Iteration # of Loop y (Iy) a[0] a[1] 1 1 … … … a[n] n n

Example code Results of profiled run

slide-19
SLIDE 19

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 19

Fusion & Reduction

  • Loop fusion:
  • Fusion of loops x and y can occur if:
  • Both loops x and y are do-all loops
  • There are no loop carried dependences.
  • Reduction
  • State-of-the-art compilers may miss reduction due to pointer aliasing
  • r array referencing
  • Dynamic analysis helps overcome limitations of static analysis
  • Detection approach same as multi-loop pipeline
  • Profile iterations of a single loop
slide-20
SLIDE 20

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 20

Applications

  • SPMD/MIMD type task parallelism
  • Detection of independent sets of CUs that can run in parallel

with each other

  • Detection of parallelism between different region levels
  • Detection of synchronization points between different

parallel tasks

slide-21
SLIDE 21

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 21

Task Parallelism

Dependencies Control-Flow Children

slide-22
SLIDE 22

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 22

Task Parallelism

  • Using Breadth-first search we classify the CUs into Fork, Worker and

Barrier CUs

  • Two barriers can run in parallel if there is no directed path between them

Fork task Worker task Barrier task Dependence Unmarked task

1 2 3 4 5 6 7

Simplified CU Graph – Task Parallelism

Current node

slide-23
SLIDE 23

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 23

CU Instantiation

void cilksort(ELM *low, ELM *tmp, long size) { ... cilksort(A, tmpA, quarter); cilksort(B, tmpB, quarter); cilksort(C, tmpC, quarter); cilksort(D, tmpD, size - 3 * quarter); cilkmerge(A, A + quarter - 1, B, B + quarter - 1, tmpA); cilkmerge(C, C + quarter - 1, D, low + size - 1, tmpC); cilkmerge(tmpA, tmpC - 1, tmpC, tmpA + size - 1, A); ... } void cilkmerge(ELM *low1, ELM *high1, ELM *low2, ELM *high2, ELM *lowdestif { … cilkmerge(low1, split1 - 1, low2, split2, lowdest); cilkmerge(split1 + 1, high1, split2 + 1, high2, lowdest+lowsize+2); … }

slide-24
SLIDE 24

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 24

CU Instantiation - PET

cilkmerge cilksort cilksort cilksort cilksort cilksort cilkmerge cilkmerge cilkmerge cilkmerge

slide-25
SLIDE 25

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 25

Applications

  • Energy efficient parallelism
  • Energy consumption per CU
  • Energy efficient pattern detection
  • Energy efficient task formation
slide-26
SLIDE 26

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 26

Energy optimization

  • Reduce memory accesses
  • Which openmp constructs to use?

Energy efficient task formation

  • Considering CU attributes (data size, memory access frequency, etc.) to form

tasks

  • Which openmp constructs to use?

Energy Efficient Parallelism

slide-27
SLIDE 27

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 27

Applications

  • Detecting communication patterns
  • Produce communication pattern of splash2x

applications on multicore platform based on profiled data dependences

  • shows the communication intensity between

producer and consumer threads.

  • Critical to understand the performance of

parallel applications

splash2x.water-spatial splash2x.lu_ncb

slide-28
SLIDE 28

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 28

Outline

  • Profiler
  • Computational Units & Program graph
  • Applications
  • Evaluation
  • Future Works
slide-29
SLIDE 29

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 29

Pipeline

Evaluation

Application Pipelines Detected Pipelines Implemented Speedup Parsec:Bodytrack 1 1 1.22 Parsec:Blackscholes NA Parsec:Dedup 2 2 2.15 Parsec:Ferret 2 2 6.14 libVorbis 2 1 8.4

slide-30
SLIDE 30

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 30

Evaluation

Multi-loop pipeline & fusion

Program LOC Detected Pattern # of inter-dependent loops Speedup # of Threads Polybench:ludcmp 135 Multi-loop pipeline 2 14.06 32 Polybench:reg_detect 137 Multi-loop pipeline 2 2.26 16 Parsec:fluidanimate 3987 Multi-loop pipeline 4 1.50 3 Starbench:rot-cc 578 Fusion 2 16.18 32 Polybench:correlation 137 Fusion 4 10.74 32 Polybench:2mm 153 Fusion 2 13.50 32

slide-31
SLIDE 31

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 31

Evaluation

Task parallelism

  • Recursion in fib and strassen is the cause for the big difference in

estimated and actual speedups

  • DiscoPoP does not track the depth of recursion when profiling

Program LOC Detected Pattern Estimated Speedup Speedup # of Threads Bots:fib 32 Task parallelism 3.25 13.25 32 Bots:sort 305 Task parallelism 2.11 3.67 32 Bots:strassen 399 Task parallelism 3.5 8.93 32 Polybench:3mm 166 Task parallelism + Do-all 1.5 13.93 16 Polybench:mvt 114 Task parallelism + Do-all 1.96 11.39 32 Polybench:fdtd-2d 142 Task parallelism 2.17 5.19 8

slide-32
SLIDE 32

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 32

Future works

  • Optimize DiscoPoP profiler
  • Add static dependence analysis
  • Generate coarser CUs
slide-33
SLIDE 33

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 33

Future works

  • Support for OpenMP based parallel applications
  • Currently profiler only supports Pthread based applications
  • OpenMP is the industry standard for multi-core applications
  • Addition of support for openMP based parallel applications

in DiscoPoP profiler

slide-34
SLIDE 34

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 34

Future works

  • Energy profiling
  • Needed for energy efficient task formation and pattern

detection

  • Dynamic: instrumentation methods
  • Static: Energy per instruction (EPI)
slide-35
SLIDE 35

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 35

Future works

  • Detect energy efficient parallel patterns
  • Detect available parallel patterns in a code section
  • Considering energy efficiency parameters
  • Number of cores
  • Number of tasks
  • Energy balance of detected parallel tasks
  • Data size of tasks
  • Communication frequency
slide-36
SLIDE 36

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 36

Current status

  • An efficient parallelized dependence profiler is available
  • Supports Pthread based parallel applications
  • Successful detection of several parallel design patterns
  • Adoption of new version of LLVM is in progress
  • Profiler soon to be published as an open source tool
slide-37
SLIDE 37

10/12/2016 | Department of Computer Science | Laboratory for Parallel Programming | Rohit Atre | 37

Thank You!