Automatically Tuning Task-Based Programs for Multi-core Processors - - PowerPoint PPT Presentation

automatically tuning task based programs for multi core
SMART_READER_LITE
LIVE PREVIEW

Automatically Tuning Task-Based Programs for Multi-core Processors - - PowerPoint PPT Presentation

Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine Motivation Recent microprocessor trends Number of


slide-1
SLIDE 1

Automatically Tuning Task-Based Programs for Multi-core Processors

Jin Zhou Brian Demsky

Department of Electrical Engineering and Computer Science University of California, Irvine

slide-2
SLIDE 2

Motivation

  • Recent microprocessor trends

– Number of cores increased rapidly – Architectures vary widely

  • Challenges for software development

– Parallelization is now key for performance – Current parallel programming model: threads + locks

  • Hard to develop correct and efficient parallel software
  • Hard to adapt software to changes in architectures
slide-3
SLIDE 3

Goals

  • Automatically generate parallel implementation
  • Automatically tune parallel implementation
slide-4
SLIDE 4

Bamboo Compiler

Overview

Program Processor Specification Implementation Generator Simulation-based Evaluator

Candidate implementations

Implementation Optimizer

Leading implementations

Profile Data Multi-core Processor

Tuned implementations Optimized multi-core binary

Code Generator

Optimized implementation

slide-5
SLIDE 5

Example

  • MonteCarlo Example

– Partitions problem into several simulations – Executes the simulations in parallel – Aggregates results of all simulations

slide-6
SLIDE 6

Bamboo Language

  • A hybrid language combines data-flow and Java

– Programs are composed of tasks – Tasks compose with dataflow-like semantics – Tasks contain Java-like object-oriented code internally – Programs cannot explicitly invoke tasks – Runtime automatically invokes tasks

  • Supports standard object-oriented constructs

including methods and classes

slide-7
SLIDE 7

Bamboo Language

  • Flags

– Capture current role (type state) of object in computation – Each flag captures an aspect of the object’s state – Change as the object’s role evolves in program – Support orthogonal classifications of objects

slide-8
SLIDE 8

task startup(StartupObject s in initialstate) { Aggregator aggr = new Aggregator(s.args[0]){merge:=true}; for(int i = 0; i < 4; i++) Simulator sim = new Simulator(aggr){run:=true}; taskexit(s: initialstate:=false); } task simulate(Simulator sim in run) { sim.runSimulate(); taskexit(sim: run:=false, submit:=true); } task aggregate(Aggregator aggr in merge, Simulator sim in submit) { boolean allprocessed = aggr.aggregateResult(sim); if (allprocessed) taskexit(aggr: merge:=false, finished:=true; sim: submit:=false, finished:=true); taskexit(sim: submit:=false, finished:=true); } class Aggregator { flag merge; flag finished; … } class Simulator { flag run; flag submit; flag finished; ... }

slide-9
SLIDE 9

Bamboo Program Execution

Global Flagged Object Space Runtime initialization StartupObject new initialstate state finished state StartupObject

slide-10
SLIDE 10

Bamboo Program Execution

Global Flagged Object Space StartupObject startup task execute

  • n

initialstate state finished state StartupObject

slide-11
SLIDE 11

Bamboo Program Execution

Global Flagged Object Space startup task StartupObject set Aggregator Simulator Simulator Simulator new Simulator merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-12
SLIDE 12

Bamboo Program Execution

Global Flagged Object Space StartupObject Aggregator Simulator Simulator Simulator Simulator simulate execute

  • n

execute on simulate task execute on simulate task execute

  • n

simulate task simulate task merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-13
SLIDE 13

Bamboo Program Execution

Global Flagged Object Space StartupObject set Aggregator Simulator Simulator Simulator Simulator simulate task simulate task simulate task simulate task set set set merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-14
SLIDE 14

Bamboo Program Execution

Global Flagged Object Space aggregate task StartupObject Aggregator Simulator Simulator Simulator Simulator execute on merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-15
SLIDE 15

Bamboo Program Execution

Global Flagged Object Space aggregate task StartupObject Aggregator Simulator Simulator Simulator Simulator set merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-16
SLIDE 16

Bamboo Program Execution

Global Flagged Object Space StartupObject Aggregator Simulator Simulator Simulator Simulator aggregate task execute on merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-17
SLIDE 17

Bamboo Program Execution

Global Flagged Object Space StartupObject Aggregator Simulator Simulator Simulator Simulator aggregate task set merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-18
SLIDE 18

Bamboo Program Execution

Global Flagged Object Space aggregate task StartupObject Aggregator Simulator Simulator Simulator Simulator execute on merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-19
SLIDE 19

Bamboo Program Execution

Global Flagged Object Space aggregate task StartupObject Aggregator Simulator Simulator Simulator Simulator set merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-20
SLIDE 20

Bamboo Program Execution

Global Flagged Object Space StartupObject Aggregator Simulator Simulator Simulator Simulator aggregate task execute on merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-21
SLIDE 21

Bamboo Program Execution

Global Flagged Object Space StartupObject Aggregator Simulator Simulator Simulator Simulator aggregate task set merge state finished state submit state initialstate state finished state StartupObject Aggregator Simulator run state finished state

slide-22
SLIDE 22

Implementation Generation

Bamboo Compiler Bamboo Program Processor Specification Implementation Generator

Candidate implementations

Profile Data

slide-23
SLIDE 23

Implementation Generation

  • Dependence Analysis: analyzes data dependence

between tasks

  • Parallelism Exploration: extracts potential parallelism
  • Mapping to Cores: maps the program to real processor
slide-24
SLIDE 24

Flag State Transition Graph (FSTG)

Simulator submit finished aggregate:2Mcyc; 100% run simulate:32Mcyc; 100%

slide-25
SLIDE 25

Combined Flag State Transition Graph (CFSTG)

StartupObject

initialstate

finished startup:3Mcyc; 100% Simulator

run submit

simulate:32Mcyc; 100% finished aggregate:2Mcyc; 100%

1

Aggregator aggregate:2Mcyc; 75% finished aggregate:2Mcyc; 25%

merge

4

Number of new objects

slide-26
SLIDE 26

Core Group

Initial Mapping

StartupObject

initialstate

finished startup:3Mcyc; 100% Simulator

run submit

simulate:32Mcyc; 100% finished aggregate:2Mcyc; 100%

1

Aggregator aggregate:2Mcyc; 75% finished aggregate:2Mcyc; 25%

merge 4

slide-27
SLIDE 27

Preprocessing Phase

  • Identifies strongly connected components (SCC) and

merges them into a single core group

  • Converts CFSTG into a tree of core groups by

replicating core groups as necessary

slide-28
SLIDE 28

Data Locality Rule

StartupObject

initialstate

finished startup:3Mcyc; 100%

Simulator 4 1 Aggregator

aggregate:2Mcyc; 75% finished aggregate:2Mcyc; 25%

merge run

Aggregator StartupObject 1 Simulator 4

  • Default rule
  • Maximize data locality to

improve performance

– Minimizes inter-core communications – Improves cache behavior

slide-29
SLIDE 29

Data Parallelization Rule

  • To explore potential data

parallelism

Aggregator StartupObject Simulator

1

1 Simulator Simulator Simulator

1 1 1

Aggregator StartupObject 1 Simulator

4

StartupObject

initialstate

finished startup:3Mcyc; 100%

Simulator 4 1 Aggregator

aggregate:2Mcyc; 75% finished aggregate:2Mcyc; 25%

merge run

slide-30
SLIDE 30

Rate Matching Rule

  • If the producer executes

multiple times in a cycle, how many consumers are required?

  • Match two rates to estimate

the number of consumers

– Peak new object creation rate – Object consumption rate

Producer

init

produce produce

Producer Consumer Consumer

Consumer

run

slide-31
SLIDE 31

Mapping to Processor

  • Constraint: limited cores

Core 1 Core 2

  • Map CFSTG core groups to physical cores
  • Extended CFSTG

Aggregator StartupObject Simulator 1 1 Simulator Simulator Simulator 1 1 1

slide-32
SLIDE 32

Mapping to Cores

  • One possible mapping

Aggregator StartupObject Simulator 1 1 Simulator Simulator Simulator 1 1 1 Core 2 Core 1

slide-33
SLIDE 33

Mapping to Cores

  • Isomorphic mappings: have same performance
  • Backtracking-based search: to generate non-isomorphic

implementations

Aggregator StartupObject Simulator 1 1 Simulator Simulator Simulator 1 1 1 Aggregator StartupObject Simulator 1 1 Simulator Simulator Simulator 1 1 1 Core 2 Core 1 Core 1 Core 2

slide-34
SLIDE 34

Implementation Generation

Bamboo Compiler Simulation-based Evaluator

Candidate implementations

Implementation Optimizer

Leading implementations Tuned implementations Optimized implementation

slide-35
SLIDE 35

Simulation-Based Evaluation

  • To select the best candidate implementation
  • High-level simulation

– Does NOT actually execute the program – Constructs abstract execution trace with similar statistics – Compare the execution time or throughput and core usage

Simulator Core Task Task Core Task Task

slide-36
SLIDE 36

Simulation-Based Evaluation

  • Markov model

– Built from profile data – For each task estimates:

  • The destination state
  • The execution time
  • A count of each type of new
  • bjects

StartupObject

initialstate

fnished startup:3Mcyc; 100% 1 Aggregator 1 Simulator Simulator Simulator Simulator 1 1 1

aggregate:2Mcyc; 75% aggregate:2Mcyc; 25%

merge run

finished

submit simulate:32Mcyc; 100% finished aggregate:2Mcyc; 100%

slide-37
SLIDE 37

Simulated Execution Trace

core 0 core 1 StartupObject(1) 3 Aggregator(1), Simulator (4) 4 Simulator(1) transfer a Simulator 35 Aggregator(1), Simulator(1), Simulator(2) 36 Simulator(1) 67 Aggregator(1), Simulator(3), Simulator(1) 37 transfer a Simulator 99 Aggregator(1), Simulator(4) 101 Aggregator(1), Simulator(3) 103 Aggregator(1), Simulator(2) 105 Aggregator(1), Simulator(1) 107 empty Aggregator(1), Simulator(2), Simulator(2) Aggregator(1), Simulator(4) 1 Aggregator in the initial state and 4 Simulators in the submit state

slide-38
SLIDE 38

Problem of Exhaustive Searching

  • The search space expands quickly
  • Exhaustive search is not feasible for complicated

applications

Number of CFSTG Core Groups Number of Cores Number of Candidates

32 16 > 6,000 64 32 > 14,000,000

slide-39
SLIDE 39

Random Search?

  • Very low chance to find the best implementation

Chance to find the best implementation

slide-40
SLIDE 40

Developer Optimization Process

  • Create an initial implementation
  • Evaluate it and identify performance bottlenecks
  • Heuristically develop new implementations to

remove bottlenecks

  • Iteratively repeat evaluation and optimization
slide-41
SLIDE 41

Directed Simulated Annealing (DSA)

Directed Simulated Annealing Randomly generate candidate implementations High-level Simulator As-built Critical Path Analysis

Leading candidate implementations

Implementation Generator

Potential bottlenecks

Tuned candidate implementation

New candidate implementations

slide-42
SLIDE 42

As-Built Critical Path (ABCP)

Aggregator StartupObject Simulator 1 1 Simulator Simulator Simulator 1 1 1

  • Provide post-mortem analysis of project management

core 0 core 1 StartupObject(1) 3 Aggregator(1), Simulator (4) 4 Simulator(1) transfer a Simulator 35 Aggregator(1), Simulator(1), Simulator(2) 36 Simulator(1) 67 Aggregator(1), Simulator(3), Simulator(1) 37 transfer a Simulator 99 Aggregator(1), Simulator(4) 101 Aggregator(1), Simulator(3) 103 Aggregator(1), Simulator(2) 105 Aggregator(1), Simulator(1) 107 empty Aggregator(1), Simulator(2), Simulator(2)

slide-43
SLIDE 43

As-Built Critical Path Analysis

  • Compute the time when a

task invocation’s data dependences are resolved

core 0 core 1 StartupObject(1) 3 Aggregator(1), Simulator (4) 4 Simulator(1) transfer a Simulator 35 Aggregator(1), Simulator(1), Simulator(2) 36 Simulator(1) 67 Aggregator(1), Simulator(3), Simulator(1) 37 transfer a Simulator 99 Aggregator(1), Simulator(4) 101 Aggregator(1), Simulator(3) 103 Aggregator(1), Simulator(2) 105 Aggregator(1), Simulator(1) 107 empty Aggregator(1), Simulator(2), Simulator(2)

slide-44
SLIDE 44

Waiting Task Optimization

  • Waiting tasks:

– Tasks whose real invocation time is later than the time when all its data dependences are resolved – Delayed because of resource conflicts – Bottlenecks, remove them from ABCP

  • Optimization

– Migrate waiting tasks to spare cores – Shorten the ABCP to improve performance

slide-45
SLIDE 45

Critical Task Optimization

  • There may not exist spare cores to move waiting tasks to
  • Identify critical tasks: tasks that produce data that is

consumed immediately

  • Attempt to execute critical tasks as early as possible
  • Migrate other tasks which blocked some critical task to
  • ther cores

core 0 core 1 35 Aggregator(1), Simulator(1), Simulator(2) 36 Simulator(1) 67 Aggregator(1), Simulator(3), Simulator(1) 99 Aggregator(1), Simulator(4) 101 Aggregator(1), Simulator(3) Simulator(2)

1 2

slide-46
SLIDE 46

Code Generator

Bamboo Compiler

Optimized multi-core binary

Code Generator

Optimized implementation Intermediate C code

slide-47
SLIDE 47

Evaluation

  • MIT RAW simulator

– Cycle accurate simulator configured for 16 cores – RAW chip: tiled chip, shared memory, on-chip network

  • Benchmarks:

– Series: Java Grande benchmark suite – MonteCarlo: Java Grande benchmark suite – FilterBank: StreamIt benchmark suite – Fractal

slide-48
SLIDE 48

Speedups on 16 cores

Benchmark Clock Cycles (106 cyc) Speedup to 1- Core Bamboo 1-Core Bamboo 16-Core Bamboo

Series 26.4 1.8 14.7 Fractal 38.4 3.3 11.6 MonteCarlo 191.7 19.0 10.1 FilterBank 91.2 6.7 13.6

  • Successfully generated implementations with good

performance

slide-49
SLIDE 49

Comparison to Hand-Written C Code

Benchmark Clock Cycles (106 cyc) Speedup to 1-Core C Overhead of Bamboo 1-Core C 1-Core Bamboo 16-Core Bamboo

Series 25.0 26.4 1.8 13.9 5.6% Fractal 36.2 38.4 3.3 11.0 6.1% MonteCarlo 138.8 191.7 19.0 7.3 38.1% FilterBank 71.1 91.2 6.7 10.6 28.3%

  • Overhead of Bamboo:

– Small for Series and Fractal – Larger overhead for MonteCarlo and FilterBank:

  • GCC cannot reorder instructions to fill floating-point delay

slots for Bamboo implementations due to imprecise alias results

  • Easy to add alias information to facilitate the reordering
slide-50
SLIDE 50

Comparison of Estimation and Real Execution

  • The simulation estimations are close to the real

execution time

Benchmark 1-Core Bamboo Binary 16-Core Bamboo Binary Clock Cycles (106 cyc) Error Clock Cycles (106 cyc) Error Estimation Real Estimation Real

Series 26.3 26.4 0.38% 1.7 1.8 5.56% Fractal 38.4 38.4 0% 3.1 3.3 6.06% MonteCarlo 191.0 191.7 0.37% 18.3 19.0 3.68% FilterBank 91.2 91.2 0% 6.5 6.7 2.99%

slide-51
SLIDE 51

Optimality of Directed Simulated Annealing

slide-52
SLIDE 52

Fractal

slide-53
SLIDE 53

MonteCarlo

slide-54
SLIDE 54

FilterBank

slide-55
SLIDE 55

Generality of Synthesized Implementation

  • The speedups of both 16-core Bamboo versions are

similar

  • Successfully generate a sophisticated

implementation utilizing pipelining for MonteCarlo

Benchmark Profile_original, Input_double Profile_double, Input_double Clock Cycles (106 cyc) Speedup Clock Cycles (106 cyc) Speedup 1-Core 16-Core 16-Core

Series 54.2 3.6 15.1 3.6 15.1 Fractal 76.6 6.5 11.8 6.5 11.8 MonteCarlo 383.2 37.8 10.1 35.7 10.7 FilterBank 182.3 13.3 13.7 13.3 13.7

slide-56
SLIDE 56

Related Work

  • Data-flow and streaming languages:

– Bamboo relaxes typical restrictions in these models to permit:

  • Flexible mutation of data structures
  • Data structures of arbitrarily complex constructs

– Bamboo supports applications that non-deterministically access data

  • Tuple-space language: compiler cannot automatically

create multiple instantiations to utilize multiple cores

  • Self-tuning libraries: mostly address specific

computations

slide-57
SLIDE 57

Conclusion

  • We developed a new approach to automatically tune

task-based programs for multi-core processors

– Automatically generate parallel implementations – Automatically tune according to specific architecture

  • The approach was evaluated on MIT RAW simulator

– Successfully generated implementations with good performance – Successfully generated a sophisticated implementation utilizing pipelining

  • Can be extended to the broader context of traditional

programming languages

slide-58
SLIDE 58

Thank you!