Peak Performance Model for a Custom Precision Floating-Point Dot - PowerPoint PPT Presentation

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop Europar 2010 Manfred M¨ ucke, Bernd Lesser, Wilfried N. Gansterer { manfred.muecke | bernd.lesser | wilfried.gansterer } @univie.ac.at Research Lab Computational Technologies and Applications University of Vienna http://rlcta.univie.ac.at August 30th, 2010 Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work 1 Motivation 2 Architecture 3 Experiments 4 Dot-product performance model 5 Conclusions 6 Future work Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Motivation Accelerating scientific applications For instance: accelerating linear solvers accelerating matrix operations ... A central part of many scientific computing applications: dot-product operation Our work deals with Performance analysis of custom-precision dot-product architectures on FPGAs Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Why on FPGAs? There are applications that do not require double-precision data types: Keep double precision range (11bit exponent) Reduce mantissa (mantissa bit width ≤ 52) On CPUs / GPUs: Speedup can only be achieved if mantissa bit width = 23 bit (single precision) or = 10 bit (half precision) On FPGAs: FPGAs are the only hardware platform that can benefit from bit width reduction on a fine-scaled level Lower precision translates directly into increased parallelism → throughput → SPEEDUP Larger FPGAs translate into increased parallelism → throughput → SPEEDUP Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot-product: Our observation: The maximum size of a parallel floating-point dot-product on FPGAs scales superlinearly with decreasing mantissa bit width Question: How much more performance can we gain? Goal: Give a quantitative model for the performance improvement as function of the mantissa bitwidth Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Architecture Canonical Dot-Product for real valued input vectors a , b : n < a , b > = a T b = X a i · b i . i =1 Different possibilities to implement a dot-product in hardware Our choice: binary-tree based dot-product architecture } ∗ Thus: + ∗ m parallel multipliers m + m − 1 adders result ∗ + ∗ } m − 1 Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Splitting (arbitrary long) vectors: ⌊ n m ⌋− 1 n m n X X X X < a , b > = a i · b i = a i + j · m · b i + j · m + a i · b i i =1 j =0 i =1 i = ⌊ n m ⌋· m +1 We investigate: a 1 a 1 b 1 b 1 b 4 b 4 b 4 b 4 b 1 b 1 b 1 b 1 a 1 a 1 b 1 b 1 ∗ a 1 a 1 Custom dot-product operator b 1 b 1 · · · · · · · · · · · · + · · · · accepting a maximum input · · · · · · · · b m b 6 b m b 6 ∗ · · · · vector length m · · · · X + · · · · · · · · · · · · ∗ · · · · for different floating-point · · · · · · · · + b 5 b m b 5 b m mantissa bit widths a m a 5 b m b 5 ∗ a m a 5 b m b 5 a 6 a n b n b 6 } our focus Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Given a certain sized FPGA, we want to know: Peak performance as a function of the used mantissa bit width Dot-Product architecture: peak performance depends on Number of parallel multipliers m max Maximum frequency f max ∗ + ∗ + ∗ + ∗ Thus, we need: Implementation for each mantissa bit width Measure its hardware resource usage Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Experiments Implementation issues: We implemented a generic dot-product architecture for arbitrary vector lengths Standard IEEE 754 floating-point format Arbitrary precision floating-point modules: chosen library: FPLibrary ( Arnaire project , at ENS Lyon ) http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/ combinatorial operators used Measurement issues: Used synthesis tool: QuartusII ( Altera ) Automated measurements using TCL scripting language Set generics Synthesize implementation Record hardware resource usage Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Our implementation: Accepts generic parameters: mantissa bit width, exponent bitwidth, m m parallel multipliers (accepts 2 m input operands) Binary adder tree of depth ⌈ log 2 m ⌉ Stages pipelined (registers) Total latency: ⌈ log 2 m ⌉ + 3 b 3 mult a 3 adder b 2 mult a 2 adder result b 1 mult a 1 adder b 0 mult a 0 Peak performance: P = (2 m − 1) ∗ f max [Flop/s] Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Methodology: First: we perform measurements on largest Cyclone II FPGA device (EP2C70) Then: Develop model for approximating best these original measurements Finally: Verify the model class with the measurements obtained from two more recent devices FPGA FPGA Logic elements DSP blocks Emb. Memory Device Family [9x9bit blocks] [kbits] EP2C70 Cyclone II 68,416 300 1,125 EP3C80 Cyclone III 81,264 488 2,745 EP3SL70 Stratix III 67,500 576 2,214 Table: Hardware resources of used FPGAs. Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Maximum Dot Product Size Dot Product Peak Performance 160 70 14 EP2C70max. dot product size EP2C70 Peak EP2C70 f max 65 140 12 maximum clock frequency [MHz] 60 maximum input operand pairs 120 10 55 peak performance 100 50 [GFlop/s] 8 80 45 6 40 60 35 4 40 30 2 20 25 0 20 0 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] mantissa bit width [Bits] FPGA Peak Perf vs. Mantissa bit width Measure maximum dot-prod size m max and maximum frequency f max Mantissa sizes: 52 downto 4 Calculate peak performance P = (2 m − 1) ∗ f max [Flop/s] Observation: peak performance grows exponentially Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot-product performance model Model: Fit: fractional polynomial of form P ( p ) = c 1 + c 2 · p c 3 , c 1 , c 2 , c 3 ∈ Q EP2C70 : P ( p ) = − 7 . 37 + 32 . 16 · p − 0 . 35 Dot Product Peak Performance Model 14 EP2C70 Peak Fit: P EP2C70 (p) = -7.37 + 32.16*(p -0.35 ) 12 peak performance 10 P = (2 m − 1) ∗ f max [Flop/s] [GFlop/s] 8 6 P := Measured value 4 2 ˆ P := Modelled value 0 relative error 20 Errorrel = ( P − ˆ 10 P ) [%] 0 · 100 [%] -10 P -20 absolute error 1 [GFlop/s] 0.5 0 Errorabs = P − ˆ -0.5 P [Flop/s] -1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Dot Product Peak Performance Model 25 Fit: P EP3SL70 (p) = -19.68 + 60.90*(p -0.26 ) EP3SL70 Peak Fit: P EP3C80 (p) = -10.31 + 43.29*(p -0.33 ) 20 EP3C80 Peak peak performance Fit: P EP2C70 (p) = -7.37 + 32.16*(p -0.35 ) EP2C70 Peak [GFlop/s] 15 10 5 0 relative error 20 10 0 [%] -10 -20 absolute error 1 [GFlop/s] 0.5 0 -0.5 -1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 mantissa bit width [Bits] Verify observations on more recent FPGA devices (families): Given appropriate constants, peak performance as a function of mantissa bit width can be modeled quite accurately Maximum absolute error: 1GFlop/s Average relative error: ≈ 5 − 7 % Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Peak Performance Model for a Custom Precision Floating-Point Dot - PowerPoint PPT Presentation

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop

CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Electronic Packaging Custom Metal Fabrication Custom Metal Fabrication Custom Metal Fabrication

Formal verification of floating-point algorithms John Harrison Intel Corporation Floating

Mixed Precision Training PAI Overview What is mixed-precision

Floating point Today ! IEEE Floating Point Standard ! Rounding ! Floating Point Operations !

Floating-point numbers Fractional binary numbers IEEE floating-point standard Floating-point

Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur

Machine numbers: how floating point numbers are stored? Floating-point number representation

Contemporary Projects Custom Bas Relief Deep Rich Gold gilded paper Custom Plum Blossom Custom

Peak Biotech Company Profile July 2005 Peak Biotech A/S was founded Location Kvistgaard,

Develop A Peak Performing Value Proposition For Your _____ A. Develop A B. Develop A Peak

7. Floating-point Numbers II p 1 , the precision (number of places), e min , the smallest

15-213 The course that gives CMU its Zip! Floating Point Sept 6, 2006 Topics Topics

Pavel Alex James Zach Panchekha Sanchez-Stern Wilcox Tatlock Floating Points Wild

Mi c r o s o ft Intun e www.id e ntit yexpe rts. co .uk | @Id e ntit y E xpe rts With St acey Di xo

CMSC201 Computer Science I for Majors Lecture 13 Functions Prof. Katherine Gibson Based on

Chapter 5 Ball W orlds In our in tuitiv e description of ob ject-orien ted

Virtual Memory 2 Schedule Today and Monday Chapter

The Table Makers Dilemma Results and Applications Vincent L EF ` EVRE November 16, 2000

Formal verification of IA-64 division algorithms John Harrison Intel Corporation IA-64

Next-Gen Proactive MANET Routing MANET, 62th IETF, Minneapolis, 2005 Status Background:

Context-Aware Clustering for SDN Enabled Network Ran Duo Celimuge Wu Tsutomu