 
              FPGA Multipliers Bogdan PASCA projet Ar´ enaire, ENS-Lyon/INRIA/CNRS/Universit´ e de Lyon, France RAIM’11 February 7-10, 2011
Outline Background & Context Algorithmic techniques for reducing DSP count of large multipliers Karatsuba-Ofman algorithm Non-Standard tilings Squarers Truncated multipliers Conclusions Bogdan PASCA FPGA Multipliers 1
What’s an FPGA? F ield P rogrammable G ate A rray integrated circuit has a regular architecture (hence array ) logic elements can be programmed to perform various functions Bogdan PASCA FPGA Multipliers 2
Modern FPGA Architecture a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture RAM RAM RAM RAM a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture DSP RAM RAM DSP DSP RAM RAM DSP a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture DSP RAM RAM DSP DSP RAM RAM DSP a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture DSP RAM RAM DSP DSP RAM RAM DSP a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture DSP LUT RAM RAM DSP DSP RAM RAM DSP a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture DSP LUT RAM RAM DSP 18 DSP RAM RAM 18 shift 17 DSP a set of configurable logic elements on chip memory blocks digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins Bogdan PASCA FPGA Multipliers 3
What can we compute? u 2 x 2 x 1 x 0 × x 2 y 1 LUT y 1 y 0 l 2 l 1 l 0 + p 4 u 1 x 1 u 2 u 1 u 0 p 3 y 1 LUT LUT p 4 p 3 p 2 p 1 p 0 u 0 x 0 p 2 y 1 LUT LUT l 0 = y 0 ∧ x 0 = l 1 y 0 ∧ x 1 x 2 p 1 y 0 l 2 LUT LUT l 2 = y 0 ∧ x 2 l 1 x 1 y 0 LUT u 0 = y 1 ∧ x 0 l 0 = u 1 y 1 ∧ x 1 x 0 p 0 y 0 LUT u 2 = y 1 ∧ x 2 Bogdan PASCA FPGA Multipliers 4
What can we compute? u 2 x 2 x 1 x 0 × x 2 y 1 LUT y 1 y 0 l 2 l 1 l 0 + p 4 u 1 x 1 u 2 u 1 u 0 p 3 y 1 LUT LUT p 4 p 3 p 2 p 1 p 0 u 0 x 0 p 2 y 1 LUT LUT l 0 = y 0 ∧ x 0 = l 1 y 0 ∧ x 1 x 2 p 1 y 0 l 2 LUT LUT l 2 = y 0 ∧ x 2 l 1 x 1 y 0 LUT u 0 = y 1 ∧ x 0 l 0 = u 1 y 1 ∧ x 1 x 0 p 0 y 0 LUT u 2 = y 1 ∧ x 2 Bogdan PASCA FPGA Multipliers 4
What can we compute? u 2 x 2 x 1 x 0 × x 2 y 1 LUT y 1 y 0 l 2 l 1 l 0 + p 4 u 1 x 1 u 2 u 1 u 0 p 3 y 1 LUT LUT p 4 p 3 p 2 p 1 p 0 u 0 x 0 p 2 y 1 LUT LUT l 0 = y 0 ∧ x 0 = l 1 y 0 ∧ x 1 x 2 p 1 y 0 l 2 LUT LUT l 2 = y 0 ∧ x 2 l 1 x 1 y 0 LUT u 0 = y 1 ∧ x 0 l 0 = u 1 y 1 ∧ x 1 x 0 p 0 y 0 LUT u 2 = y 1 ∧ x 2 Bogdan PASCA FPGA Multipliers 4
What can we compute? u 2 x 2 x 1 x 0 × x 2 y 1 LUT y 1 y 0 l 2 l 1 l 0 + p 4 u 1 x 1 u 2 u 1 u 0 p 3 y 1 LUT FA p 4 p 3 p 2 p 1 p 0 u 0 x 0 p 2 y 1 LUT FA l 0 = y 0 ∧ x 0 = l 1 y 0 ∧ x 1 x 2 p 1 y 0 l 2 LUT l 2 = y 0 ∧ x 2 FA l 1 x 1 y 0 LUT u 0 = y 1 ∧ x 0 l 0 = u 1 y 1 ∧ x 1 x 0 p 0 y 0 LUT u 2 = y 1 ∧ x 2 Bogdan PASCA FPGA Multipliers 4
Need of DSP blocks Multiplication in logic is expensive n 2 + n ( n − 1) n × n bit ≈ LUTs ���� � �� � partial products adder tree 18 × 18 bit ≈ 324 LUT + 306 LUT = 630 LUTs 1 DSP block = 8 LEs (size on FPGA layout) Bogdan PASCA FPGA Multipliers 5
� Need of DSP blocks Multiplication in logic is expensive n 2 + n ( n − 1) n × n bit ≈ LUTs ���� � �� � partial products adder tree 18 × 18 bit ≈ 324 LUT + 306 LUT = 630 LUTs 1 DSP block = 8 LEs (size on FPGA layout) DSP blocks are a need in modern FPGAs A 18 B 48 P 18 C 17 bit shift 17 bit shift 48 P Bogdan PASCA FPGA Multipliers 5
DSP-Hungry Applications FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system 3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks 4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136) Four recipes for saving DSPs 1 D. Strenski (HPCWire, 2007.) 2 Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10) 3 H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10) 4 Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system 3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks 4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136) Four recipes for saving DSPs 1 D. Strenski (HPCWire, 2007.) 2 Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10) 3 H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10) 4 Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system 3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks 4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136) Four recipes for saving DSPs 1 D. Strenski (HPCWire, 2007.) 2 Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10) 3 H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10) 4 Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system 3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks 4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136) Four recipes for saving DSPs 1 D. Strenski (HPCWire, 2007.) 2 Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10) 3 H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10) 4 Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system 3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks 4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136) Four recipes for saving DSPs 1 D. Strenski (HPCWire, 2007.) 2 Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10) 3 H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10) 4 Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
Perceiving Multiplications Visually X Y � classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually X 2:0 Y 2:0 � classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually X 5:3 Y 5:3 � classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle Bogdan PASCA FPGA Multipliers 7
Recommend
More recommend