Latency-preserving software pipelining of predicated reservation - - PowerPoint PPT Presentation

latency preserving software pipelining of predicated
SMART_READER_LITE
LIVE PREVIEW

Latency-preserving software pipelining of predicated reservation - - PowerPoint PPT Presentation

Latency-preserving software pipelining of predicated reservation tables for distributed hard real-time applications Thomas Carle Dumitru Potop-Butucaru INRIA Paris-Rocquencourt, FRANCE 14/12/11 1 Team AOSTE Outline Throughput


slide-1
SLIDE 1

14/12/11 1

Latency-preserving software pipelining of predicated reservation tables for distributed hard real-time applications

Thomas Carle – Dumitru Potop-Butucaru INRIA Paris-Rocquencourt, FRANCE Team AOSTE

slide-2
SLIDE 2

14/12/11 2

  • Throughput optimization problem
  • Previous work (software pipelining)
  • System models (pipelined and non-pipelined)
  • Pipelining algorithms
  • A complex example
  • Conclusion and future work

Outline

slide-3
SLIDE 3

14/12/11 3

Complex embedded control applications:

  • Cyclic, periodic execution
  • Safety-critical applications
  • Hard Real-Time constraints
  • Focus on functional and temporal

correctness

  • Distributed implementations

Application areas

slide-4
SLIDE 4

14/12/11 4

Our work focuses on static schedules: scheduling/reservation tables Validated by industrial standards: ARINC 653, AUTOSAR, FlexRay,... Defines one cycle of execution, repeated periodically

Static scheduling

slide-5
SLIDE 5

14/12/11 5

Code: v2:=v2_init; loop (v1,c)=f(v2); if c then v2:=g(v1) else m(v1); h(v2); end

P1 P2 P3 c,v1,v2 f: 1 g: 1;m: 2 h: 1 P1 P2 P3 1 2 f if(c) g if(¬c) m h Computation cycle RAM Time Resource

Motivating example

slide-6
SLIDE 6

14/12/11 6

P1 P2 P3 1 2 3 4 5 f if(c) g if(¬c) m h Time Resource Computation cycle f if(c) g if(¬c) m h End to end latency Throughput-1 = latency Latency: number of time units between the beginning and the end of the execution of a cycle Throughput: number of cycles executed in one time unit

Motivating example

slide-7
SLIDE 7

14/12/11 7

P1 P2 P3 1 2 3 4 5 f if(c0) g if(¬c0) m h Resource f If(c1) g h if(¬c1) m f if(c1) g if(¬c1) m h f if(c2) g if(¬c2) m ... ... ... ... Throughput-1 End to end latency Prolog Steady- state ≤ latency = latency Throughput-1 unchanged Goal : increase throughput while keeping the system's latency, I/O function, and periodic behaviour Time

Our objective

slide-8
SLIDE 8

14/12/11 8

P1 P2 P3 1 2 3 4 5 f if(c0) g if(¬c0) m h Resource f if(c1) g if(¬c1) m h f if(c2) g if(¬c2) m Kernel Steady- state Prolog Prolog and steady-state are instances of the kernel ... ... ... ...

Our objective

slide-9
SLIDE 9

14/12/11 9

Scheduling techniques for parallelizing loop computations:

  • Developped since the 1980's,
  • First aimed at massively parallel architectures

such as VLIW and superscalar machines,

  • Now common optimization, present in most

compilers,

  • Similar to hardware pipelining: out-of-order

execution,

  • Reordering done in the compiler instead of in the

processor.

Previous work(1): Software Pipelining

slide-10
SLIDE 10

14/12/11 10

  • Low-level vs coarse-grain code generation technique
  • Goal : optimize average-case throughput by

reorganizing operations order to take advantage of parallelism vs optimize worst-case throughput without degrading the cycles latency by preserving the intra- cycle scheduling

  • No periodicity for applications with data-dependent

control vs preservation of the periodic behaviour of the application

  • Low degree of control over operators/functional units

for conditional execution vs exploitation of conditional execution to improve the pipelining process

Software Pipelining vs our work

slide-11
SLIDE 11

14/12/11 11

  • Optimization method in which registers in a

synchronous circuit are relocated in order to improve the throughput or memory consumption of an application,

  • Very similar to our techniques e.g. no increase in

latency after applying the retiming techniques, preservation of the I/O function,

  • Nevertheless: no support for conditional

execution/predication.

Previous work(2): Retiming

slide-12
SLIDE 12

14/12/11 12

  • Builds a pipelined schedule for the application,
  • Demonstrated on e.g. multimedia streaming

applications,

  • Again, no optimization for conditional execution

Previous work(3): Real-Time Software Pipelining

slide-13
SLIDE 13

14/12/11 13

Initial non-pipelined scheduling table Pipelined scheduling table Algorithms

We design low-level implementation models that can be integrated at the end of the development cycle

Architecture model

Elements of our approach

slide-14
SLIDE 14

14/12/11 14

Architecture model

Bipartite undirected graph: A=<P,M,C>, where:

  • P: "processors", i.e. computation and

communication resources capable of independent execution (Processors, DMAs, ...),

  • M: RAM blocks,
  • (P,M) ∈C indicates that processor P has direct

access to memory block M. RAM blocks: sets of disjoint untyped memory cells

P1 P2 P3

v1;v2;c

RAM Example:

slide-15
SLIDE 15

14/12/11 15

S=<p,O,Init>, where :

  • p : activation period of execution cycles, equal

to the length of the reservation table,

  • O : Set of scheduled operations,
  • Init : set of initial values of all memory cells (can

be nil or a constant).

Reservation/Scheduling table

slide-16
SLIDE 16

14/12/11 16

Scheduled operation o:

  • In(o): set of memory cells whose data is used

as input by o,

  • Out(o): set of memory cells written by o,
  • Guard(o): execution condition of o (predicate
  • ver memory cells),
  • Res(o): set of "processors" used during the

execution of o,

  • t(o): start date of o,
  • d(o): duration of o, maximum time budget

which can be ensured throught WCET analysis.

Reservation/Scheduling table

slide-17
SLIDE 17

14/12/11 17

Reservation/Scheduling table

Well-formed properties:

  • Exclusive resource use: for O1,O2 scheduled on the same

resource, if Guard(O1) ∧ Guard(O2) ≠ false, then t(O1)≥t(O2)+d(O2) or t(O2)≥t(O1)+d(O1),

  • No data races : if O1 writes variable v1 and O2 uses (reads or

writes) v1, then t(O1)≥t(O2)+d(O2) or t(O2)≥t(O1)+d(O1) , or Guard(O1) ∧ Guard(O2) = false,

  • Causal correctness.

Enough to describe non-pipelined schedules.

slide-18
SLIDE 18

14/12/11 18

For pipelined schedules, each scheduled operation o also has a start index fst(o). It accounts for the prologue phase, where operations progressively start to execute. If operation o has fst(o)=n, it will first be executed in the pipelined cycle of index n. Due to periodicity, the description of the schedule of the kernel with start indexes is enough to describe the whole execution of the system.

Pipelined Reservation/Scheduling table

Memory elements can be modified to take into account the variable replication process (described later)

slide-19
SLIDE 19

14/12/11 19

f if(C0) g h f if(C1) g h f if(C2) g if(¬C2) m Resource 1 2 3 4 5 Time P1 P2 P3 Prolog Steady- state if(¬C1) m if(¬C0) m

Pipelined Reservation/Scheduling table

slide-20
SLIDE 20

14/12/11 20

f if(C0) g 1 P1 P2 P3 h fst=1 Pipelined table 3 2 5 4 f h if(C1) g if(¬C1) m Pipelined Iteration 0 Pipelined Iteration 1 fst=1 f if(C2) g if(¬C2) m h Pipelined Iteration 2 if(¬C0) m

Pipelined Reservation/Scheduling table

Resource Time

slide-21
SLIDE 21

14/12/11 21

Pipelining algorithm

  • Constraints:
  • need to respect inter-cycle data dependency
  • no two operations can use a "processor" at the same time
  • no memory cell can be written by an operation and used

(written or read) by another at the same time

  • Our algorithm
  • Enforces the fulfilment of these constraints
  • Incrementally builds the Data Dependency Graph of the

application

  • Takes advantage of guards during pipelining (better than

existing work)

  • Specific memory handling
slide-22
SLIDE 22

14/12/11 22

Pipelining algorithm

  • Relies on the incremental construction of the Data

Dependency Graph (DDG) of the application i.e. the set {(o1,o2,n)} for all o1 and o2 such that In(o2)∩Out(o1)≠∅, and o1 happens n cycles before o2,

  • Uses an SSA transformation before performing a symbolic

execution of the different iterations in order to construct the DDG.

slide-23
SLIDE 23

14/12/11 23

c:=¬c if(c) v2:=f1(v1) P1 P2 if(¬c) w2:=g1(w1) if(c) v3:=f2(v2) if(¬c) w3:=g2(w2) P3 if(c) v1:=f3(v3) if(¬c) w1:=g3(w3) P4

Pipelining algorithm

1 2 3 4 5 6

slide-24
SLIDE 24

14/12/11 24

c:=¬c if(c) v2:=f1(v1) P1 P2 if(¬c) w2:=g1(w1) if(c) v3:=f2(v2) if(¬c) w3:=g2(w2) P3 if(c) v1:=f3(v3) if(¬c) w1:=g3(w3) P4 c1:=¬c if(c1) v2:=f1(v1) if(¬c1) w2:=g1(w1) if(c1) v3:=f2(v2) if(¬c1) w3:=g2(w2) if(c1) v1:=f3(v3) if(¬c1) w1:=g3(w3)

Pipelining algorithm

1 2 3 4 5 6 7 8

slide-25
SLIDE 25

14/12/11 25

c:=¬c if(c) v2:=f1(v1) P1 P2 if(¬c) w2:=g1(w1) if(c) v3:=f2(v2) if(¬c) w3:=g2(w2) P3 if(c) v1:=f3(v3) if(¬c) w1:=g3(w3) P4 c1:=¬c if(c1) v2:=f1(v1) if(¬c1) w2:=g1(w1) if(c1) v3:=f2(v2) if(¬c1) w3:=g2(w2) if(c1) v1:=f3(v3) if(¬c1) w1:=g3(w3) c2:=¬c1 if(c2) v2:=f1(v1) if(¬c2) w2:=g1(w1) if(c2) v3:=f2(v2) if(¬c2) w3:=g2(w2) if(c2) v1:=f3(v3) if(¬c2) w1:=g3(w3) Complete algorithm: the first repetition is fully covered

Pipelining algorithm

1 2 3 4 5 6 7 8 9 10 11

slide-26
SLIDE 26

14/12/11 26

c:=¬c if(c) v2:=f1(v1) P1 P2 if(¬c) w2:=g1(w1) if(c) v3:=f2(v2) if(¬c) w3:=g2(w2) P3 if(c) v1:=f3(v3) if(¬c) w1:=g3(w3) P4 c1:=¬c if(c1) v2:=f1(v1) if(¬c1) w2:=g1(w1) if(c1) v3:=f2(v2) if(¬c1) w3:=g2(w2) if(c1) v1:=f3(v3) if(¬c1) w1:=g3(w3) c2:=¬c1 if(c2) v2:=f1(v1) if(¬c2) w2:=g1(w1) if(c2) v3:=f2(v2) if(¬c2) w3:=g2(w2) if(c2) v1:=f3(v3) if(¬c2) w1:=g3(w3)

Pipelining algorithm

Make the repetition periodic : new_period = max(o1,o2,n)∈DDG ((t(o1)+d(o1)-t(o2))/n) 1 2 3 4 5 6 7 8 9 10 11

slide-27
SLIDE 27

14/12/11 27

c:=¬c if(c) v2:=f1(v1) P1 P2 if(¬c) w2:=g1(w1) if(c) v3:=f2(v2) if(¬c) w3:=g2(w2) P3 if(c) v1:=f3(v3) if(¬c) w1:=g3(w3) P4 c1:=¬c if(c1) v2:=f1(v1) if(¬c1) w2:=g1(w1) if(c1) v3:=f2(v2) if(¬c1) w3:=g2(w2) if(c1) v1:=f3(v3) if(¬c1) w1:=g3(w3) c2:=¬c1 if(c2) v2:=f1(v1) if(¬c2) w2:=g1(w1) if(c2) v3:=f2(v2) if(¬c2) w3:=g2(w2) if(c2) v1:=f3(v3) if(¬c2) w1:=g3(w3) new_period := 3

Pipelining algorithm

1 2 3 4 5 6 7 8 9 10 11

slide-28
SLIDE 28

14/12/11 28

c:=¬c v2:=f1(v1) @c P1 P2 w2:=g1(w1) @¬c v3:=f2(v2) @c w3:=g2(w2) @¬c P3 v1:=f3(v3) if(ci-2) w1:=g3(w3) if(¬ci-2) P4 c:=¬c v2:=f1(v1) @c w2:=g1(w1) @¬c if(ci-1) v3:=f2(v2) if(¬ci-1) w3:=g2(w2) v1:=f3(v3) @c w1:=g3(w3) @¬c ci:=¬ci-1 if(ci) v2:=f1(v1) if(¬ci) w2:=g1(w1) v3:=f2(v2) @c w3:=g2(w2) @¬c v1:=f3(v3) @c w1:=g3(w3) @¬c fst=0 fst=1

Pipelining algorithm

Build the kernel

slide-29
SLIDE 29

14/12/11 29

h f if(C2) g if(¬C2) m 4 5 P1 P2 P3 Kernel: f if(C) g if(¬C) m h 1 2 P1 P2 P3 Initial cycle Size = 2 h fst=1 Pipelined table

Pipelined Reservation/Scheduling table construction

if(¬C) m fst = [old_start_date/new_period] new_start_date = old_start_date - fst*new_period

slide-30
SLIDE 30

14/12/11 30

Memory aspects

Cycle 1 Cycle 2 Live zone 1 Live zone 2 fst1 lst1 lst2 Time Resources fst2

slide-31
SLIDE 31

14/12/11 31

Memory aspects

Problem: if a variable is used at the same time in two or more pipelined iterations, that variable must be replicated. Memory management is achieved the following way:

  • The replication factor rep(v) is computed for each

variable v,

  • Each memory cell v of the initial non-pipelined

scheduling table is replaced by rep(v) memory cells, allocated on the same memory block as v, in a cyclic fashion to the successive computation cycles.

slide-32
SLIDE 32

14/12/11 32

Knock controller example

Industrial case study Function of the engine control unit for gasoline spark-ignition engines Objective : choose ignition time for each cylinder at each rotation in order to maximize power output and avoid autoignition as much as possible

Aquisition Filtering Detection Correction ∆ ∆ Samples Aquisition window position

High-level description of the control loop function :

slide-33
SLIDE 33

14/12/11 33

Knock controller example

One acquisition device One microcontroller for filtering, detection, and correction Two buffers and their independent controllers

Acquisition Device BUF2 BUF1 µC

Config0;Config1;Config2; c

buf2 buf1 RAM

slide-34
SLIDE 34

14/12/11 34

Knock controller example

book if(c) Acq1 if(¬c) Acq2 if(c) Acq1 if(¬c) Acq2 if(c) FDC1 if(¬c) FDC2 if(c) FDC1 if(¬c) FDC2 1 2 3 4 5 Rotation units AD BUF 1 BUF 2 µC Resources

slide-35
SLIDE 35

14/12/11 35

book if(c) Acq1 if(¬c) Acq2 if(c) Acq1 if(¬c) Acq2 if(c) FDC1 if(¬c) FDC2 if(c) FDC1 if(¬c) FDC2 1 2 3 4 5 Rotation units AD BUF 1 BUF 2 µC Resources 6 7 8 book if(¬c1) Acq2 if(c1) Acq1 if(c1) Acq1 if(¬c1) Acq2 if(c1) FDC1 if(¬c1) FDC2 if(c1) FDC1 if(¬c1) FDC2 book if(c2) Acq1 if(¬c2) Acq2 if(c2) Acq1 if(¬c2) Acq2 ... ... ... ...

slide-36
SLIDE 36

14/12/11 36

Knock controller example

book if(cn) Acq1 if(¬cn) Acq2 if(cn) Acq1 if(¬cn) Acq2 1 2 Rotation units AD BUF 1 BUF 2 µC Resources if(cn-1) FDC1 fst=1 if(¬cn-1) FDC2 fst=1 if(¬cn-1) FDC2 fst=1 if(cn-1) FDC1 fst=1

slide-37
SLIDE 37

14/12/11 37

Experimental results

We demonstrated our algorithms on four examples:

  • Embedded control application for the CyCab

electric car: 27% reduction in cycle time

  • Adaptive equalizer: 6 % reduction
  • Knock controller: 50% reduction
  • Simple example: 66% reduction. 100% resource

usage

slide-38
SLIDE 38

14/12/11 38

Conclusion

  • Study of the state of the art in software pipelining
  • Definition of formal models
  • Definition of algorithms
  • Implementation in a prototype
  • Evaluation on study cases
slide-39
SLIDE 39

14/12/11 39

Future work

  • Enhance memory management: we can forbid

memory replication when it is not necessary, but we cannot yet limit it by giving memory sizes,

  • Exploit execution guards over partitioned

architectures, maybe by using n-synchronous formalism to express and exploit repetition patterns during the pipelining process,

  • Integrate pipelining in the initial scheduling process to
  • btain better trade-offs between latency/response-time,

throughput, and resource usage: we would like to do this on a software radio use case we are currently working on.