temporal partioning temporal partioning with partial
play

Temporal Partioning Temporal - PDF document

Temporal Partioning Temporal Partioning with Partial Mikael Olausson Reconfiguration Embedded Reconfigurable Computer Engineering Architectures Department of Electrical


  1. �÷îóùûþ ÿþýüûúùø÷öõôóþòñðïîþíï Temporal Partioning Temporal Partioning with Partial Mikael Olausson Reconfiguration Embedded Reconfigurable Computer Engineering Architectures Department of Electrical Conclusions Engineering Linköping University 26/10/2001 Reconfigurable Systems 1 26/10/2001 Reconfigurable Systems 2 �þí�üöõóò�õöîùîùüûùûø �óîþöûõîù�þï M. Kaul, R. Vemuri, ”Temporal Many different implementations Partitioning Combined with Design ÿ Area Space Exploration for Latency ÿ Latency Minimization of Run-Time Reconfigured Intergrate partitioning with synthesis Designs”, Proc. DATE 1999 Iterative process Temporal Configuration ÿ Lowest latency that meets area Application partitioning 26/10/2001 Reconfigurable Systems 3 26/10/2001 Reconfigurable Systems 4 �õöîùîùüûùûøòóþ�þóï �þïùøûò�üùûîï Behavior level Different implementations of the same task Register Transfer Level ÿ Time-Area tradeoff Gate level ÿ Serial vs Parallel Too many design points? ÿ Candidate design points 26/10/2001 Reconfigurable Systems 5 26/10/2001 Reconfigurable Systems 6 1

  2. �õûðòüöò�þ�ò �õûðòüöò�þ�ò �õöîùîùüûùûøï �õöîùîùüûùûøïòýüûî�� Spatial Partitioning Temporal Partitioning ÿ Increase partitioning ÿ Increase partitions ÿ Consumes more area ÿ Increase the available area ÿ Parallel processing => Less latency Will Latency decrease? ÿ Heavily dependent on the reconfiguration times 26/10/2001 Reconfigurable Systems 7 26/10/2001 Reconfigurable Systems 8 �õöîùîùüûùûø � û�÷îòîüò�óøüöùî�í 1. Map tasks to partitions Behavior specification(Task graph) ÿ Tasks 2. Map each partition to several design points ÿ Communication between Target Architecture 3. Explore multiple implementations of the design point ÿ Area ÿ Memory size ÿ Configuration times 26/10/2001 Reconfigurable Systems 9 26/10/2001 Reconfigurable Systems 10 ��þò�óøüöùî�í �óøüöùî�íòñîþ�ï Find the constraints 1. Find one solution for the constraints ÿ Minimum number of partitions, lower bound 2. Tighten the latency constraints N l min 3. Increase the partition size and start ÿ Minimum number of partitions, upper over bound N u min ÿ Worst case latency D max ÿ Best case latency D min 26/10/2001 Reconfigurable Systems 11 26/10/2001 Reconfigurable Systems 12 2

  3. ñþõöý�òóùíùîï ��õí�óþï Bounds Result N I D max D min Da T 4x4 DCT New D max =(D max +D min )/2 9 1 25.710 1.065 9.650 37.40 with low 2 7.226 1.065 7.060 77.32 Stop when D max -D min < δ or when no 3 4.145 1.065 Inf 300 reconf. 4 5.685 4.145 Inf 300 new solutions are found time limit Time 10ns 5 6.455 5.685 Inf 300 Start and stop parameters for the 6 6.840 6.455 Inf 300 10 1 7.060 1.095 6.500 278.8 partitioning search α and γ 2 4.077 1.095 Inf 300 3 5.568 4.077 Inf 300 When reconfiguration time is large, set 4 6.314 5.568 Inf 300 α=γ= 0 5 6.407 6.314 Inf 300 11 1 6.500 1.125 Inf 300 12 1 6.500 1.155 Inf 300 26/10/2001 Reconfigurable Systems 13 26/10/2001 Reconfigurable Systems 14 �õöîùõóò ��õí�óþï ÿþýüûúùø÷öõîùüû 4x4 DCT with high reconf. Time 30ms S. Ganesan, R. Vemuri, ”An Integrated Temporal Partitioning and Partial Bounds Result N I D max D min Da T Reconfiguration Technique for Design 8 1 25.440 795 Inf 300 Latency Improvement”, Proc. DATE 9 1 25.440 795 9.630 77.60 2 6.956 795 Inf 300 2000. 3 9.226 6.956 9.100 78.95 One part executing, one part 4 8.111 6.956 8.100 185.73 5 7.533 6.956 7.380 281.93 reconfiguring 6 7.244 6.956 Inf 300 26/10/2001 Reconfigurable Systems 15 26/10/2001 Reconfigurable Systems 16 �õöî� òÿþýüûòýüûî��ò �õïùýò�üûýþ�î For maximum overlap: TP1 TP1 ÿ Exe(Tp i ) comparable to Rec(Tp i+1 ) ÿ Or Exe(Tp i ) >= Rec(Tp i+1 ) TP1 TP2 TP2 TP2 TP3 TP3 TP3 26/10/2001 Reconfigurable Systems 17 26/10/2001 Reconfigurable Systems 18 3

  4. ñðûî�þïùï � û�÷îòï�þýùúùýõîùüû Input behaviour specification in C or Input Set VHDL BLK_1 Local Set Generate a Control Data Flow Graph(CDFG) BLK_2 Function Graph Partioner + area estimator => BLK_3 BLK_4 Temporal segments Output Set High-level synthesis => RTL Behaviour Block Intermediate Format (BBIF) 26/10/2001 Reconfigurable Systems 19 26/10/2001 Reconfigurable Systems 20 �õöøþîò�öý�ùîþýî÷öþ �üü�ò�õöîùîùüûùûø Xilinx 6200 FPGA Entire loop in one partition Why? ÿ Easy partioning Host-side CTRL ÿ Execution time maximum overlapped If the loop don’t fit? RC1 RC2 ÿ Report a failure ÿ Use the whole device(the adopted one) Switch between execution and reconfiguration 26/10/2001 Reconfigurable Systems 21 26/10/2001 Reconfigurable Systems 22 �üû�ùîùüûõóò þ�þý÷îùüû �óüý�ò�öüýþïïùûø We have to wait for the Gives high execution times outcome of the conditional One configuration for many inputs executing ÿ Filters Conditional in one, branches ÿ FFT in the other Works only with no dependencies If this fails between inputs ÿ Host polling 26/10/2001 Reconfigurable Systems 23 26/10/2001 Reconfigurable Systems 24 4

  5. ÿþï÷óî � ÷þïîùüûï ? What is required for good performance? ? Would more partitions be better? Rec. Rec. Rec. Exe Exe Exe Speed Speed Speed Rec. Exe Speed #Inp #Inp #Inp Through- Through- Through- #Inp Through- Design Design Design Method Method Method #TP #TP #TP Time Time Time time time time % rec. % rec. % rec. up vs up vs up vs ? Can parallel processing increase the Design Method #TP Time time % rec. up vs blocks blocks blocks put (ms) put (ms) put (ms) blocks put (ms) (us) (us) (us) (us) (us) (us) full full full (us) (us) full performance? PR PR PR 1 3 3 165 1 1 54.03 165 86 154.7 0.9 385 0.21 0.550 0.087 98.9 25.7 30 1d PR 48 180 51.47 1991 2.04 2.52 TLC DCT SEG 1.52x 2.0x 1x 1.8x FFT FR FR FR 2 2 1 165 1 1 154.8 610 86 385 0.9 103.1 0.995 0.26 0.087 61.2 98.9 59.5 FR 22 180 1995.7 2088 4.08 51.2 26/10/2001 Reconfigurable Systems 25 26/10/2001 Reconfigurable Systems 26 ÿþýüûúùø÷öõôóþò �õöøþîò�öý�ùîþýî÷öþ �öý�ùîþýî÷öþï Yanbing Li, et al., ”Hardware-Software Target Architecture Codesign of Embedded Reconfigurable Architectures”, Proc. DAC 2000. Speed up execution with FPGA CPU Mem FPGA 26/10/2001 Reconfigurable Systems 27 26/10/2001 Reconfigurable Systems 28 � ùíôóþ � ùíôóþòýüûî�� HW/SW partitioner Search for candidate loops for implementation i HW From sytem-level described in C 1 SW loop vs. 1 or more HW loops Loop and Basic block level Search for Instruction Level Paralellism Two dimensional partioning ÿ Spatial ÿ Temporal 26/10/2001 Reconfigurable Systems 29 26/10/2001 Reconfigurable Systems 30 5

  6. � þðòùïï÷þïòôðò �õöîùîùüûùûø ÿþîöøþîõôóþòñðîþí� Dynamic reconfiguration costs Yes!! Compiler optimations(SW) Platform described in ADL by ÿ Type of processor HW design space ÿ Characteristics of the FPGA Profiling information for HW/SW ÿ Memory tradeoffs 26/10/2001 Reconfigurable Systems 31 26/10/2001 Reconfigurable Systems 32 � �ðòóüü�ï� �öþ�öüýþïïùûø Significant portion of the execution time Profile target architecture Compact implementation of loops Extract loops Synthesize HW of loops Multiple HW structures ÿ Loop unrolling ÿ Procedure inlining ÿ Branch trimming 26/10/2001 Reconfigurable Systems 33 26/10/2001 Reconfigurable Systems 34 � óüôõóò�üïîò�÷ûýîùüû �óøüöùî�íò�óü� Maximize overall performance Loop Entry Profiling(LED) What to include? Interesting Loop Detection(ILP) ÿ SW execution times Intra Loop Selection ÿ HW execution times Inter Loop Selection ÿ Entry times for HW implementations ÿ Exit times for HW implementations ÿ Configuration times 26/10/2001 Reconfigurable Systems 35 26/10/2001 Reconfigurable Systems 36 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend