peak performance model for a custom precision floating
play

Peak Performance Model for a Custom Precision Floating-Point Dot - PowerPoint PPT Presentation

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop


  1. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop Europar 2010 Manfred M¨ ucke, Bernd Lesser, Wilfried N. Gansterer { manfred.muecke | bernd.lesser | wilfried.gansterer } @univie.ac.at Research Lab Computational Technologies and Applications University of Vienna http://rlcta.univie.ac.at August 30th, 2010 Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  2. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work 1 Motivation 2 Architecture 3 Experiments 4 Dot-product performance model 5 Conclusions 6 Future work Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  3. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Motivation Accelerating scientific applications For instance: accelerating linear solvers accelerating matrix operations ... A central part of many scientific computing applications: dot-product operation Our work deals with Performance analysis of custom-precision dot-product architectures on FPGAs Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  4. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Why on FPGAs? There are applications that do not require double-precision data types: Keep double precision range (11bit exponent) Reduce mantissa (mantissa bit width ≤ 52) On CPUs / GPUs: Speedup can only be achieved if mantissa bit width = 23 bit (single precision) or = 10 bit (half precision) On FPGAs: FPGAs are the only hardware platform that can benefit from bit width reduction on a fine-scaled level Lower precision translates directly into increased parallelism → throughput → SPEEDUP Larger FPGAs translate into increased parallelism → throughput → SPEEDUP Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  5. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot-product: Our observation: The maximum size of a parallel floating-point dot-product on FPGAs scales superlinearly with decreasing mantissa bit width Question: How much more performance can we gain? Goal: Give a quantitative model for the performance improvement as function of the mantissa bitwidth Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  6. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Architecture Canonical Dot-Product for real valued input vectors a , b : n < a , b > = a T b = X a i · b i . i =1 Different possibilities to implement a dot-product in hardware Our choice: binary-tree based dot-product architecture } ∗ Thus: + ∗ m parallel multipliers m + m − 1 adders result ∗ + ∗ } m − 1 Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  7. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Splitting (arbitrary long) vectors: ⌊ n m ⌋− 1 n m n X X X X < a , b > = a i · b i = a i + j · m · b i + j · m + a i · b i i =1 j =0 i =1 i = ⌊ n m ⌋· m +1 We investigate: a 1 a 1 b 1 b 1 b 4 b 4 b 4 b 4 b 1 b 1 b 1 b 1 a 1 a 1 b 1 b 1 ∗ a 1 a 1 Custom dot-product operator b 1 b 1 · · · · · · · · · · · · + · · · · accepting a maximum input · · · · · · · · b m b 6 b m b 6 ∗ · · · · vector length m · · · · X + · · · · · · · · · · · · ∗ · · · · for different floating-point · · · · · · · · + b 5 b m b 5 b m mantissa bit widths a m a 5 b m b 5 ∗ a m a 5 b m b 5 a 6 a n b n b 6 } our focus Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  8. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Given a certain sized FPGA, we want to know: Peak performance as a function of the used mantissa bit width Dot-Product architecture: peak performance depends on Number of parallel multipliers m max Maximum frequency f max ∗ + ∗ + ∗ + ∗ Thus, we need: Implementation for each mantissa bit width Measure its hardware resource usage Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  9. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Experiments Implementation issues: We implemented a generic dot-product architecture for arbitrary vector lengths Standard IEEE 754 floating-point format Arbitrary precision floating-point modules: chosen library: FPLibrary ( Arnaire project , at ENS Lyon ) http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/ combinatorial operators used Measurement issues: Used synthesis tool: QuartusII ( Altera ) Automated measurements using TCL scripting language Set generics Synthesize implementation Record hardware resource usage Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  10. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Our implementation: Accepts generic parameters: mantissa bit width, exponent bitwidth, m m parallel multipliers (accepts 2 m input operands) Binary adder tree of depth ⌈ log 2 m ⌉ Stages pipelined (registers) Total latency: ⌈ log 2 m ⌉ + 3 b 3 mult a 3 adder b 2 mult a 2 adder result b 1 mult a 1 adder b 0 mult a 0 Peak performance: P = (2 m − 1) ∗ f max [Flop/s] Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  11. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Methodology: First: we perform measurements on largest Cyclone II FPGA device (EP2C70) Then: Develop model for approximating best these original measurements Finally: Verify the model class with the measurements obtained from two more recent devices FPGA FPGA Logic elements DSP blocks Emb. Memory Device Family [9x9bit blocks] [kbits] EP2C70 Cyclone II 68,416 300 1,125 EP3C80 Cyclone III 81,264 488 2,745 EP3SL70 Stratix III 67,500 576 2,214 Table: Hardware resources of used FPGAs. Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  12. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Maximum Dot Product Size Dot Product Peak Performance 160 70 14 EP2C70max. dot product size EP2C70 Peak EP2C70 f max 65 140 12 maximum clock frequency [MHz] 60 maximum input operand pairs 120 10 55 peak performance 100 50 [GFlop/s] 8 80 45 6 40 60 35 4 40 30 2 20 25 0 20 0 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] mantissa bit width [Bits] FPGA Peak Perf vs. Mantissa bit width Measure maximum dot-prod size m max and maximum frequency f max Mantissa sizes: 52 downto 4 Calculate peak performance P = (2 m − 1) ∗ f max [Flop/s] Observation: peak performance grows exponentially Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  13. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot-product performance model Model: Fit: fractional polynomial of form P ( p ) = c 1 + c 2 · p c 3 , c 1 , c 2 , c 3 ∈ Q EP2C70 : P ( p ) = − 7 . 37 + 32 . 16 · p − 0 . 35 Dot Product Peak Performance Model 14 EP2C70 Peak Fit: P EP2C70 (p) = -7.37 + 32.16*(p -0.35 ) 12 peak performance 10 P = (2 m − 1) ∗ f max [Flop/s] [GFlop/s] 8 6 P := Measured value 4 2 ˆ P := Modelled value 0 relative error 20 Errorrel = ( P − ˆ 10 P ) [%] 0 · 100 [%] -10 P -20 absolute error 1 [GFlop/s] 0.5 0 Errorabs = P − ˆ -0.5 P [Flop/s] -1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

  14. Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot Product Peak Performance Model 25 Fit: P EP3SL70 (p) = -19.68 + 60.90*(p -0.26 ) EP3SL70 Peak Fit: P EP3C80 (p) = -10.31 + 43.29*(p -0.33 ) 20 EP3C80 Peak peak performance Fit: P EP2C70 (p) = -7.37 + 32.16*(p -0.35 ) EP2C70 Peak [GFlop/s] 15 10 5 0 relative error 20 10 0 [%] -10 -20 absolute error 1 [GFlop/s] 0.5 0 -0.5 -1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] Verify observations on more recent FPGA devices (families): Given appropriate constants, peak performance as a function of mantissa bit width can be modeled quite accurately Maximum absolute error: 1GFlop/s Average relative error: ≈ 5 − 7 % Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend