Temporal Partioning Temporal - - PDF document

temporal partioning temporal partioning with partial
SMART_READER_LITE
LIVE PREVIEW

Temporal Partioning Temporal - - PDF document

Temporal Partioning Temporal Partioning with Partial Mikael Olausson Reconfiguration Embedded Reconfigurable Computer Engineering Architectures Department of Electrical


slide-1
SLIDE 1

1

26/10/2001 Reconfigurable Systems 1

ÿþýüûúùø÷öõôóþòñðïîþíï

Mikael Olausson Computer Engineering Department of Electrical Engineering Linköping University

26/10/2001 Reconfigurable Systems 2

÷îóùûþ

Temporal Partioning Temporal Partioning with Partial Reconfiguration Embedded Reconfigurable Architectures Conclusions

26/10/2001 Reconfigurable Systems 3

þíüöõóòõöîùîùüûùûø

  • M. Kaul, R. Vemuri, ”Temporal

Partitioning Combined with Design Space Exploration for Latency Minimization of Run-Time Reconfigured Designs”, Proc. DATE 1999 Temporal Configuration Application partitioning

26/10/2001 Reconfigurable Systems 4

óîþöûõîùþï

Many different implementations

ÿ Area ÿ Latency

Intergrate partitioning with synthesis Iterative process

ÿ Lowest latency that meets area

26/10/2001 Reconfigurable Systems 5

õöîùîùüûùûøòóþþóï

Behavior level Register Transfer Level Gate level

26/10/2001 Reconfigurable Systems 6

þïùøûòüùûîï

Different implementations of the same task

ÿ Time-Area tradeoff ÿ Serial vs Parallel

Too many design points?

ÿ Candidate design points

slide-2
SLIDE 2

2

26/10/2001 Reconfigurable Systems 7

õûðòüöòþò õöîùîùüûùûøï

Spatial Partitioning

ÿ Increase partitioning ÿ Consumes more area ÿ Parallel processing => Less latency

26/10/2001 Reconfigurable Systems 8

õûðòüöòþò õöîùîùüûùûøïòýüûî

Temporal Partitioning

ÿ Increase partitions ÿ Increase the available area

Will Latency decrease?

ÿ Heavily dependent on the reconfiguration

times

26/10/2001 Reconfigurable Systems 9

õöîùîùüûùûø

  • 1. Map tasks to partitions
  • 2. Map each partition to several design

points

  • 3. Explore multiple implementations of

the design point

26/10/2001 Reconfigurable Systems 10

  • û÷îòîüòóøüöùîí

Behavior specification(Task graph)

ÿ Tasks ÿ Communication between

Target Architecture

ÿ Area ÿ Memory size ÿ Configuration times

26/10/2001 Reconfigurable Systems 11

þòóøüöùîí

Find the constraints

ÿ Minimum number of partitions, lower bound

Nl

min ÿ Minimum number of partitions, upper

bound Nu

min ÿ Worst case latency Dmax ÿ Best case latency Dmin

26/10/2001 Reconfigurable Systems 12

óøüöùîíòñîþï

  • 1. Find one solution for the constraints
  • 2. Tighten the latency constraints
  • 3. Increase the partition size and start
  • ver
slide-3
SLIDE 3

3

26/10/2001 Reconfigurable Systems 13

ñþõöýòóùíùîï

New Dmax=(Dmax+Dmin)/2 Stop when Dmax-Dmin < δ or when no new solutions are found time limit Start and stop parameters for the partitioning search α and γ When reconfiguration time is large, set α=γ=0

26/10/2001 Reconfigurable Systems 14

õíóþï

4x4 DCT with low reconf. Time 10ns

300 Inf 1.155 6.500 1 12 300 Inf 1.125 6.500 1 11 300 Inf 6.314 6.407 5 300 Inf 5.568 6.314 4 300 Inf 4.077 5.568 3 300 Inf 1.095 4.077 2 278.8 6.500 1.095 7.060 1 10 300 Inf 6.455 6.840 6 300 Inf 5.685 6.455 5 300 Inf 4.145 5.685 4 300 Inf 1.065 4.145 3 77.32 7.060 1.065 7.226 2 37.40 9.650 1.065 25.710 1 9 T Da Dmin Dmax Result Bounds I N

26/10/2001 Reconfigurable Systems 15

õíóþï

4x4 DCT with high reconf. Time 30ms

300 Inf 6.956 7.244 6 281.93 7.380 6.956 7.533 5 185.73 8.100 6.956 8.111 4 78.95 9.100 6.956 9.226 3 300 Inf 795 6.956 2 77.60 9.630 795 25.440 1 9 300 Inf 795 25.440 1 8 T Da Dmin Dmax Result Bounds I N

26/10/2001 Reconfigurable Systems 16

õöîùõóò ÿþýüûúùø÷öõîùüû

  • S. Ganesan, R. Vemuri, ”An Integrated

Temporal Partitioning and Partial Reconfiguration Technique for Design Latency Improvement”, Proc. DATE 2000. One part executing, one part reconfiguring

26/10/2001 Reconfigurable Systems 17

õöî òÿþýüûòýüûîò

TP1 TP1 TP2 TP2 TP3 TP3 TP1 TP2 TP3

26/10/2001 Reconfigurable Systems 18

õïùýòüûýþî

For maximum overlap:

ÿ Exe(Tpi) comparable to Rec(Tpi+1) ÿ Or Exe(Tpi) >= Rec(Tpi+1)

slide-4
SLIDE 4

4

26/10/2001 Reconfigurable Systems 19

Input behaviour specification in C or VHDL Generate a Control Data Flow Graph(CDFG) Partioner + area estimator => Temporal segments High-level synthesis => RTL

ñðûîþïùï

26/10/2001 Reconfigurable Systems 20

  • û÷îòïþýùúùýõîùüû

BLK_1 BLK_2 BLK_3 BLK_4 Behaviour Block Intermediate Format (BBIF)

Input Set Output Set Local Set Function Graph

26/10/2001 Reconfigurable Systems 21

õöøþîòöýùîþýî÷öþ

Xilinx 6200 FPGA

Host-side CTRL RC1 RC2 Switch between execution and reconfiguration

26/10/2001 Reconfigurable Systems 22

üüòõöîùîùüûùûø

Entire loop in one partition Why?

ÿ Easy partioning ÿ Execution time maximum overlapped

If the loop don’t fit?

ÿ Report a failure ÿ Use the whole device(the adopted one)

26/10/2001 Reconfigurable Systems 23

üûùîùüûõóò þþý÷îùüû

We have to wait for the

  • utcome of the conditional

executing Conditional in one, branches in the other If this fails

ÿ Host polling

26/10/2001 Reconfigurable Systems 24

óüýòöüýþïïùûø

Gives high execution times One configuration for many inputs

ÿ Filters ÿ FFT

Works only with no dependencies between inputs

slide-5
SLIDE 5

5

26/10/2001 Reconfigurable Systems 25

ÿþï÷óî

51.2 4.08 2088 1995.7 180 22 FR 2.0x 2.52 2.04 1991 51.47 180 48 PR DCT Speed up vs full % rec. Through- put (ms) Exe time (us) Rec. Time (us) #Inp blocks #TP Method Design 61.2 0.995 385 610 1 2 FR 1.8x 30 0.550 385 165 1 3 PR SEG Speed up vs full % rec. Through- put (ms) Exe time (us) Rec. Time (us) #Inp blocks #TP Method Design 59.5 0.26 103.1 154.8 165 2 FR 1.52x 25.7 0.21 154.7 54.03 165 3 PR 1d FFT Speed up vs full % rec. Through- put (ms) Exe time (us) Rec. Time (us) #Inp blocks #TP Method Design 98.9 0.087 0.9 86 1 1 FR 1x 98.9 0.087 0.9 86 1 1 PR TLC Speed up vs full % rec. Through- put (ms) Exe time (us) Rec. Time (us) #Inp blocks #TP Method Design

26/10/2001 Reconfigurable Systems 26

÷þïîùüûï

? What is required for good performance? ? Would more partitions be better? ? Can parallel processing increase the

performance?

26/10/2001 Reconfigurable Systems 27

ÿþýüûúùø÷öõôóþò öýùîþýî÷öþï

Yanbing Li, et al., ”Hardware-Software Codesign of Embedded Reconfigurable Architectures”, Proc. DAC 2000. Speed up execution with FPGA

26/10/2001 Reconfigurable Systems 28

õöøþîòöýùîþýî÷öþ

Target Architecture

CPU FPGA Mem

26/10/2001 Reconfigurable Systems 29

ùíôóþ

HW/SW partitioner From sytem-level described in C Loop and Basic block level Two dimensional partioning

ÿ Spatial ÿ Temporal

26/10/2001 Reconfigurable Systems 30

ùíôóþòýüûî

Search for candidate loops for implementation i HW 1 SW loop vs. 1 or more HW loops Search for Instruction Level Paralellism

slide-6
SLIDE 6

6

26/10/2001 Reconfigurable Systems 31

þðòùïï÷þïòôðò õöîùîùüûùûø

Dynamic reconfiguration costs Compiler optimations(SW) HW design space Profiling information for HW/SW tradeoffs

26/10/2001 Reconfigurable Systems 32

ÿþîöøþîõôóþòñðîþí

Yes!! Platform described in ADL by

ÿ Type of processor ÿ Characteristics of the FPGA ÿ Memory

26/10/2001 Reconfigurable Systems 33

ðòóüüï

Significant portion of the execution time Compact implementation of loops

26/10/2001 Reconfigurable Systems 34

öþöüýþïïùûø

Profile target architecture Extract loops Synthesize HW of loops Multiple HW structures

ÿ Loop unrolling ÿ Procedure inlining ÿ Branch trimming

26/10/2001 Reconfigurable Systems 35

óüôõóòüïîò÷ûýîùüû

Maximize overall performance What to include?

ÿ SW execution times ÿ HW execution times ÿ Entry times for HW implementations ÿ Exit times for HW implementations ÿ Configuration times

26/10/2001 Reconfigurable Systems 36

óøüöùîíòóü

Loop Entry Profiling(LED) Interesting Loop Detection(ILP) Intra Loop Selection Inter Loop Selection

slide-7
SLIDE 7

7

26/10/2001 Reconfigurable Systems 37

üüòûîöðòöüúùóùûø

Identify all loops Trace all the loop entries Compression

26/10/2001 Reconfigurable Systems 38

  • ûîþöþïîùûøòüüò

ñþóþýîùüû

Precentage contribution to application time

  • óüü ï

óüü ï óüü ï óüü ï

  • üîõó

üîõó üîõó üîõó òþþ òþþ òþþ òþþ îùíþ îùíþ îùíþ îùíþ

õ þóþîòùíõøþ õ þóþîòùíõøþ õ þóþîòùíõøþ õ þóþîòùíõøþ ýüí öþïïùüû ýüí öþïïùüû ýüí öþïïùüû ýüí öþïïùüû òþûýü ùûø òþûýü ùûø òþûýü ùûø òþûýü ùûø

  • þýü ùûø

þýü ùûø þýü ùûø þýü ùûø òþûýü þö òþûýü þö òþûýü þö òþûýü þö ñù õý ñù õý ñù õý ñù õý þûýöð îùüû þûýöð îùüû þûýöð îùüû þûýöð îùüû 26/10/2001 Reconfigurable Systems 39

  • ûîöõòüüòñþóþýîùüû

Choose the best HW solution that fits the FPGA. Execution times. Place infreq. branches in SW for too big HW solutions Keep the pure SW solution Leave configuration times out

26/10/2001 Reconfigurable Systems 40

  • ñòýüûî

Why do we leave out configuration time? The number of configurations for the loops are not available yet. We haven’t made the HW/SW partitioning yet.

26/10/2001 Reconfigurable Systems 41

  • ûîþöòüüòñþóþýîùüû

One SW and one HW solution Interactions between loops Too big search area (2n) Group loops into clusters

ÿ Competing loops

ÿ Nested loops, branches

Exhautive search within the clusters

26/10/2001 Reconfigurable Systems 42

üòøüüòùïòîþò õóøüöùîí

8.42e+6 1.47e+7 0.28 8.57e+6 1.53 62 UNEPIC decoding 8.00e+5 1.10e+5 0.01 8.00e+4 0.04 6 Skipjack encoding 7.00e+4 8.00e+4 0.04 7.09e+4 0.08 16 ADPCM 7.17e+8 1.58e+9 0.49 7.47e+8 1.92 165 MPEG-2 encoder 1.74e+5 5.10e+5 0.05 1.74e+5 0.17 25 Wavelet compression Performance (cycles) CPU time(s) Performance (cycles) CPU time(s) Upper bound performance (cycles) Local optimization The algorithm #Loop s Benchmark 8.00e+4 8.57e+6 7.09e+4 7.47e+8 1.74e+5

8.00e+5 8.42e+6 7.00e+4 7.17e+8 1.74e+5 1.10e+5 1.47e+7 8.00e+4 1.58e+9 5.10e+5

1.10e+5 1.47e+7 8.00e+4 1.58e+9 5.10e+5

slide-8
SLIDE 8

8

26/10/2001 Reconfigurable Systems 43

÷þïîùüûï

? What is the differece between spatial

and temporal partitioning?

? How can we speed up execution even

more?

26/10/2001 Reconfigurable Systems 44

üûýó÷ïùüûï

Partitioning in reconfigurable logic(FPGA) Spatial and Temporal partitiong Config-Exe-Recon-exe ..... Execute and reconfigure in parallel Speedup embedded CPU with HW in FPGA