Dependence-Based Automatic Parallelization using CnC Bo Zhao, Ali - - PowerPoint PPT Presentation

dependence based automatic parallelization using cnc
SMART_READER_LITE
LIVE PREVIEW

Dependence-Based Automatic Parallelization using CnC Bo Zhao, Ali - - PowerPoint PPT Presentation

Dependence-Based Automatic Parallelization using CnC Bo Zhao, Ali Janessari Technische Universit at Darmstadt bo.zhao@rwth-aachen.de, jannesari@cs.tu-darmstadt.de September 8, 2015 Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015


slide-1
SLIDE 1

Dependence-Based Automatic Parallelization using CnC

Bo Zhao, Ali Janessari

Technische Universit¨ at Darmstadt bo.zhao@rwth-aachen.de, jannesari@cs.tu-darmstadt.de

September 8, 2015

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 1 / 19

slide-2
SLIDE 2

Overview

1

Introduction Motivation Objectives

2

Approach Overview Framework Program Analysis Task parallelism Extraction Code Generation

3

Conclusion

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 2 / 19

slide-3
SLIDE 3

Introduction Motivation

Motivation

Multicore and architecture has become popular as a result of the stagnating single core performance Many software products are implemented sequentially

fail to tap potential of the parallel hardware

Problem : the gap between parallel hardware and sequential software

take advantage of new hardware features preserve the current software investment save human resource

Solution: automatically (semi-automatically) transform sequential code into parallel code

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 3 / 19

slide-4
SLIDE 4

Introduction Objectives

Objectives

Discover potential parallelism

Loop parallelism Irregular task parallelism

Detect data and control dependencies Generate parallel code using Concurrent Collections

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 4 / 19

slide-5
SLIDE 5

Approach Overview Framework

Overview Workflow

Static Code Analysis Dynamic Code Analysis Code instrumentation Front End Task Graph Generator Seq IR Parallel Execution Unit Testing Sequential Source Code LLVM JIT Compiler DiscoPoP Correctness Feedback CU Graph

Compile Time Runtime

Phase1: Program Analysis Phase2: Coarse-Grained Task Extraction & IR2IR Trans Phase3: Code Generation CnC-Par IR IR-to-IR transformation Ctrl Info Task Graph Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 5 / 19

slide-6
SLIDE 6

Approach Program Analysis

DiscoPoP (Discovery of Potential Parallelism)

Phase 1:

Static and dynamic analyses Instruments the target program and identifies control and data dependencies

Phase 2 & 3:

Post-mortem analysis for parallelism discovery Builds Computational Units (CUs) for the target program Ranking

Phase 3 Phase 2 Phase 1 Source Code

Conversion to IR

Memory Access & Control-flow Instrumentation Static Control-flow Analysis

Data Dependency Analysis CU Graph Control Region Information

Parallelism Discovery & Parallel Pattern Detection

Ranked Parallel Opportunities execution

static dynamic

Ranking

Dynamic Control-flow Analysis Variable Lifetime Analysis Runtime Dependency Merging

Computational Unit Analysis

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 6 / 19

slide-7
SLIDE 7

Approach Program Analysis

Dependence Profiling

Control dependence

<FileID:LineID <Contr.ID> <Label> <Exec.Time> 1:60 BGN loop void 1:74 END loop 1200 Data dependence <FileID:LineID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName> 1:63 NOM void RAW 1:59|temp1 1:70 NOM void WAR 1:67|temp2 Data dependence (multi-threaded)

<FileID:LineID|ThreadID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName|ThreadID> 4:59|2 NOM void WAR 4:71|2|z real Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 7 / 19

slide-8
SLIDE 8

Approach Task parallelism Extraction

Computation Unit (CU)

A collection of instructions (LLVM-IR instruction) Follows the read-compute-write pattern

A program state is first read from memory, the new state is computed, and finally written back

A small piece of code containing no parallelism or only ILP Building blocks for forming parallel tasks CU graph

Dependences are mapped to CUs Exposes tightly-connected CUs

1 x = 3 2 y = 4 3 a = x + rand() / x 4 b = x - rand() / x 5 x = a + b 6 a = y + rand() / y 7 b = y - rand() / y 8 y = a + b x = 3 a = x + rand() / x b = x - rand() / x CUx INIT x = a + b y = 4 a = y + rand() / y b = y - rand() / y y = a + b CUy

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 8 / 19

slide-9
SLIDE 9

Approach Task parallelism Extraction

CU Graph

CU-19 [146,152,153,154] *in = args->in_img; for(all_pixels){ R = *in++; G = *in++; B = *in++; } CU-32 [152,153,154,156] for(all_pixels){ R = *in++; G = *in++; B = *in++; Y= round(c1*R+c2*G+c3*B); } CU-33 [152,153,154,157] for(all_pixels){ R = *in++; G = *in++; B = *in++; U = round(c4*R+c5*G+c6*B); } CU-34 [152,153,154,158] for(all_pixels){ R = *in++; G = *in++; B = *in++; V = round(c7*R+c8*G+c9*B); } CU-21 [147,156,160] *pY = args->pY; for(all_pixels){ Y = round(c7*R+c8*G+c9*B); *pY++ = Y; } CU-24 [148,157,161] *pU = args->pU; for(all_pixels){ U = round(c7*R+c8*G+c9*B); *pU++ = U; } CU-27 [149,158,162] *pV = args->pV; for(all_pixels){ V = round(c7*R+c8*G+c9*B); *pV++ = V; }

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 9 / 19

slide-10
SLIDE 10

Approach Task parallelism Extraction

Program Execution tree

A call tree combined with loop information and basic blocks CU graph is mapped on to the execution tree

Program 1 - 377 Basic Block 11 - 19 Loop 21 - 36 Basic Block 22 - 27

Tree node CU Data Dependency

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 10 / 19

slide-11
SLIDE 11

Approach Task parallelism Extraction

Task Extraction

Merge CUs contained in strongly connected components (SCCs) or in chains

A B C D E F G H I A B C D E I FGH A B I FGH CDE 1 2 SCC SCC chain

SCCFGH and chainCDE are two tasks Hide complex dependences inside SSCs, exposing parallelization

  • pportunities outside

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 11 / 19

slide-12
SLIDE 12

Approach Task parallelism Extraction

Task Extraction

Two CUs can share common instructions

53 54 57 56 59 58 55 2 1 2 7 3 6 4 7 1 5 4 1 2 3

  • No. of Common Instructions
  • No. of Dependences

(a) CU graph with CUs

as vertices and RAW dependences and common instructions as edges

53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05

Affinity

(b) CU graph with

affinities between the CUs

53 54 57 56 59 58 55 0.20 0.28 0.17 0.35 0.10 0.18 0.43 0.20 0.05

Min Cut

(c) CU graph with a

minimum cut

53 54 57 56 59 58 55 0.20 0.28 0.35 0.18 0.43 0.20 0.05

(d) CU graph

partitioned to identify tasks

Figure : Demonstration of a CU graph and graph partitioning to form tasks.

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 12 / 19

slide-13
SLIDE 13

Approach Task parallelism Extraction

Task Graph

Task Extraction

Not limited to predefined language constructs Covers independent tasks and dependent tasks (coarse-grained tasks)

function: 365 - 381 Parallelizable: true loop: 372 - 380 Parallelizable: false INIT 370 CU 374 - 379 INIT 666 if-else: 667 - 678 Parallelizable: false loop: 682 - 709 Parallelizable: true if-else: 719 Parallelizable: false RAW CU Control Region Blue Yellow Grey Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 13 / 19

slide-14
SLIDE 14

Approach Code Generation

Code Generation

On going work Map the task graph to CnC graph CnC defines two scheduling constraints in parallel execution

producer/consumer relationships controller/controllee relationships

A task (coarse-grained CU) is similar to a step collection Data dependency among tasks are known form the task graph Detected control information is not sufficient

Users specify the controller/controllee relationships

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 14 / 19

slide-15
SLIDE 15

Approach Code Generation

Code Generation

Propose CnC-specific IR template Transform the original IR to Cnc specific IR using task graph and users’ control information Generate binary code form CnC speceific IR

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 15 / 19

slide-16
SLIDE 16

Approach Code Generation

Code Generation

previous code transformation results

Source-to-source transformation using Intel TBB flow graph (semi-automatic) FaceDetection (CnC sample application) (a) Logic of FaceDetection (b) Flow graph

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 16 / 19

slide-17
SLIDE 17

Approach Code Generation

Code Generation

Speedups on 2x8-core Intel Xeon E5-2650 2 GHz

1 2 4 8 16 32 5 10 15 20

thread speedup

Official Manual CnC Parallelization Semi-automatic TBB Parallelization Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 17 / 19

slide-18
SLIDE 18

Conclusion

Conclusion

Profile data and control dependencies

DiscoPoP Users’ specification

Extract coarse-grained task parallelism

CU graph Program execution tree Task graph

Generate parallel code using CnC

Define CnC-specific IR Code transformation at IR level Employ CnC runtime library

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 18 / 19

slide-19
SLIDE 19

Conclusion

Thanks! Q & A

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 19 / 19