 
              HPEC 2008 HPEC 2008 September 23-25, 2008
• Background • RC Taxonomy • Reconfigurability Factors • Computational Density Metrics • Internal Memory Bandwidth Metric • Results & Analysis • Future Work • Conclusions 2
• Moore’ s law continues to hold true, transistor counts doubling every 18 months o But can no longer rely upon increasing clock rates (f clk ) and instruction-level parallelism (ILP) to meet computing performance demands • How to best exploit ever-increasing on-chip transistor counts? o Architecture Reformation: Multi- & many-core (MC) devices are new technology wave o Application Reformation: focus on exploiting explicit parallelism in these new devices 3
• What MC architecture options are available? o Fixed MC: fixed hardware structure, cannot be changed post-fab o Reconfigurable MC: can be adapted post-fab to changing problem req’ s • How to compare disparate device technologies? o Need for taxonomy & device analysis early in development cycle o Challenging due to vast design space of FMC and RMC devices o We are developing a suite of metrics; two are focus of this study: o Computational Density per Watt captures computational performance and power consumption, more relevant for HPEC than pure performance metrics o Internal Memory Bandwidth describes device’ s on-chip memory 4 access capabilities
Devices with Devices with segregated RMC segregated RMC & FMC resources; & FMC resources; can use either in can use either in stand-alone mode stand-alone mode Spectrum of Granularity In Each Class Spectrum of Granularity In Each Class 5
Datapath Device Memory PE/Block Precision Register Register Register Register 64 KB × 32 64 KB × 64 + × + × 8 × 8 Multiply 8 × 8 MAC 24 × 24 Multiply 64 × 64 Multiply Register Register Register × × (Processing Element) (Processing Element) (Processing Element) (Processing Element) PE Register Mode Power Interconnect Interface RC Device PE PE PE PE PE1 PE1 PE2 PE2 PE PE Prg-A Prg-A Prg-B Prg-A PE PE PE PE PE PE RLDRAM Memory DDR2 Memory Controller Controller PE3 PE3 PE4 PE4 PE PE PE PE Prg-C Prg-A Prg-D Prg-A MEM MEM MEM MEM Performance RLDRAM SDRAM DDR2 SDRAM Power 6
Devices Studied (18) Ambric Am2045 1 • Metric Description 130 nm FMC ClearSpeed CSX600 o Computational Density (CD) Freescale MPC7447 � Measure of computational performance across Altera Stratix-II EP2S180 ElementCXI ECA-64 range of parallelism, grouped by process technology Mathstar Arrix FPOA 90 nm RMC Raytheon MONARCH o Computational Density per Watt (CDW) Tilera TILE64 � CD normalized by power consumption Xilinx Virtex-4 LX200 o Internal Memory Bandwidth (IMB) Xilinx Virtex-4 SX55 � Describes device’s memory-access capabilities Freescale MPC8640D with on-chip memories 90 nm FMC IBM Cell BE • CD & CDW Precisions (5 in all) Altera Stratix-III EP3SL340 Altera Stratix-III EP3SE260 o Bit-Level, 16-bit Integer, 32-bit Integer, 65 nm RMC Xilinx Virtex-5 LX330T S ingle-Precision Floating-Point (S PFP), and Xilinx Virtex-5 SX95T Double-Precision Floating-Point (DPFP) Intel Atom N270 2 45 nm FMC • IMB 40 nm RMC Altera Stratix-IV EP4SE530 o Block-based vs. Cache-based systems 1 Preliminary results based on limited vendor data (Ambric) 2 Limited Atom cache data, not included in IMB results
Integer & Floating-Point Analysis • CD for FPGAs ⎡ ⎤ ∑ = × + × CD f ⎢ N W N ⎥ bit max LUT i i ⎣ ⎦ o Bit-level i � f max is max device frequency, N LUT is number of look-up tables, W i & N i are width & number of fixed resources = + × CD ( Ops Ops ) f o Integer int / FP DSP LOGIC achievable � Use method on right with Integer cores o Floating-point � Use method on right with FP cores Overhead - Reserve 15% logic resources for steering logic and memory or I/O interfacing Memory-sustainable CD – Limit CD based on # of parallel paths to on-chip memory; each operation requires 2 memory locations Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) Achievable Frequency – Lowest frequency after PAR of DSP & logic-only implementations of add & mult computational cores 8 IP Cores – Use IP cores provided by vendor for better productivity
W i - width of element type i N i - # of elements of type i , or # of instructions that can be issued simultaneously • CD for FMC and coarse-grained f - clock frequency RMC devices CPI i - cycles per instruction for ⎡ ⎤ element i ∑ = × × CD f ⎢ W N ⎥ bit i i ⎣ ⎦ o Bit-level i For all RMC o Integer N ∑ = × • Power scales linearly with i CD f int/FP CPI resource utilization o Floating-point i i For FPGAs • CDW for all devices • Vendor tools (PowerPlay, Xpower) used to estimate o Calculated using CD for each level power for maximum LUT, FF, block memory, and DSP of parallelism and dividing by power utilization at maximum freq. • Maximum power is scaled consumption at that level of by ratio of achievable parallelism frequency to maximum freq. o CDW is critical metric for HPEC For all FMC • Use fixed, maximum power systems from vendor documentation 9
• Internal Memory Bandwidth (IMB) o Overall application performance may be limited by memory system × × × ∑ N P W f = × i i i i IMB % hitrate × cache 8 CPA o Cache-based systems (CBS) i i � Separate metrics for each level of cache × × × ∑ N P W f = i i i i IMB × � Calculate bandwidth over range of hit rates block 8 CPA i i o Block-based systems (BBS) %hitrate - Hit-rate scale factor %hitrate - Hit-rate scale factor � Calculate bandwidth over a range of N i - # of blocks of element i N i - # of blocks of element i achievable frequencies P i - # of ports or simultaneous P i - # of ports or simultaneous � For fixed-frequency devices, IMB is constant accesses supported by accesses supported by element i element i � Assume most parallel configuration (wide & W i - width of datapath W i - width of datapath shallow configuration of blocks) f i - memory operating f i - memory operating � Use dual-port configuration when available frequency, variable for FPGAs frequency, variable for FPGAs CPA i - # of clock cycles per CPA i - # of clock cycles per 10 memory access memory access
Bit-level 16-bit Int. 32-bit Int. SPFP DPFP 130 nm Raw Sustain. Raw Sustain. Raw Sustain. Raw Sustain. Raw Sustain. Device 90 nm Arrix FPOA 6144 6144 384 384 192 192 65 nm ECA-64 2176 2176 13 13 6 6 MONARCH 45 nm 2048 2048 65 65 65 65 65 65 Stratix-II S180 63181 63181 442 442 123 123 53 53 11 11 40 nm Stratix-III SL340 154422 154422 933 918 213 213 96 96 26 26 Stratix-III SE260 119539 119539 817 778 204 204 73 73 22 22 RMC RMC Stratix-IV SE530 243866 243866 990 766 312 312 171 171 88 88 TILE64 4608 4608 240 240 144 144 Virtex-4 LX200 89952 89952 357 116 66 42 68 46 16 16 Virtex-4 SX55 29184 29184 365 110 71 40 31 31 7 7 Virtex-5 LX330T 150163 150163 606 300 131 122 119 116 26 26 Virtex-5 SX95T 48435 48435 599 226 221 92 82 82 15 15 Am2045 8064 8064 504 504 252 252 Atom N270 307 307 14 14 8 8 8 8 5 5 Cell BE 4096 4096 205 205 115 115 205 205 19 19 FMC FMC CSX600 1536 1536 24 24 24 24 24 24 24 24 MPC7447 352 352 11 11 11 11 11 11 11 11 MPC8640D 576 576 34 34 18 18 12 12 6 6 • Maximum memory-sustainable CD is • Top CD performers are highlighted shown above (in GOPs) • RMC devices perform best for bit-level • CD scales with parallel operations & integer ops, FMC for float ing-point • Various devices may have their highest • Memory-sustainability issues seen when CDs at different levels of parallelism many, small registers are needed
130 nm 90 nm 90 nm 65 nm 45 nm 40 nm 40 nm FPGA 65 nm FPGAs 90 nm FPGAs Non- FPGAs • RMC devices (specifically FPGAs) far EP4SE530 (Stratix-IV) is best overall • outperform FMC devices 65 nm FPGAs are all strong performers • High bit-level CD due to fine-grained, • V4 LX200 top performer of 90 nm • LUT-based architecture devices Low power • Coarse-grained devices (both RMC & • Power scaling with parallelism (area) FMC) show poor performance •
90 nm 130 nm 90 nm 65 nm 45 nm 40 nm • RMC devices outperform FMC Virtex-4 S X55 is best performer in 90 nm class • S trong performance from ECA-64 due to Low power • • extremely low power consumption (one Watt at Power scaling with parallelism (area) • full utilization), despite low CD Requires algorithms that can take • FPOA gives good, moderate performance due to • advantage of numerous parallel operations high CD, but with higher power requirements Ambric (130 nm) shows promising prelim. • Virtex-5 SX95T (65 nm) is best overall with • results despite older process Stratix-IV EP4SE530 (40 nm) a close second
Recommend
More recommend