 
              SONIA GONZALEZ-NAVARRO AND JAVIER HORMIGO Dept. Computer Architecture Universidad de Málaga (Spain) fjhormigo@uma.es
 New embedded applications increasingly demanding FP computation  IEEE-754 FP standard designed for GPP  Problems of using the FP standard: ▪ Lack of flexibility (Ex: word sizes) ▪ Compulsory requirements: costly and not always useful (different rounding modes, special cases, subnormal…) 2
 The problem exists: ▪ FPGA tools use almost compliant formats, but: ▪ Variable sizes, subnormals , special case flags… ▪ Special internal format (Intel fused FP-datapath) ▪ Synopsys Flexible Floating-Point format ▪ Two´ s complement, flags, no normalization, truncation…  Consequences: ▪ Multiple-variations of the standard are used=> incompatibility and irreproducibility ▪ Hardware implementations less efficient 3
Should a new extension of the FP standard be defined for embedded applications?  Multiple choices could be re-studied for these new applications: normalization, rounding, significand representation, special cases, etc.  Here we focus on Normalization (and rounding) ▪ How normalization affects accuracy ▪ Implementation result improvement 4
 Non-Normalized FP format  Proposed arithmetic circuits ▪ Adders ▪ Multipliers  Error measurement in DSP applications  Implementation results  Conclusions 5
 Non-Normalized FP format  Proposed arithmetic circuits ▪ Adders ▪ Multipliers  Error measurement in DSP applications  Implementation results  Conclusions 6
 Similar to binary32  Normalization is not compulsory  No special cases  Zero and subnormal are not special cases  Simplify rounding by using truncation: ▪ Round toward zero ▪ Round to nearest by using HUB approach [1] [1] J. Hormigo and J. Villalba , “New formats for computing with real numbers under round-to- nearest”, IEEE Trans. on Computers, vol. 65, no. 7, pp. 2158 – 2168, 2016 7
 If Normalization is not compulsory, it is lost: -The implicit bit => 1 bit of precision -Leading zeros => Accuracy -Comparison operation -Reproducibility  But, it is improved: +Area reduction +Power and energy reduction +Increase of the speed 8
 If Normalization is not compulsory, it is lost: -The implicit bit => 1 bit of precision -Leading zeros => Accuracy -Comparison operation -Reproducibility Aproximate Computing (HW-accuracy trade-off)  But, it is improved: +Area reduction +Power and energy reduction +Increase of the speed 8
 Non-Normalized FP format  Proposed arithmetic circuits ▪ Adders ▪ Multipliers  Error measurement in DSP applications  Implementation results  Conclusions 9
Basic FP Adder with no normalization(A1)  No normalization or rounding logic  Only significand overflow is normalized  Gray boxes => HUB version  Round-to-nearest 10
FP Adder with limited normalization(A2)  Up to two leading zero detection and shifting  Significand overflow is also normalized  Grey boxes => HUB version  Round-to-nearest 11
WITHOUT SIGNIFICAND WITH SIGNIFICAND OVERFLOW DETECTION (M) OVERFLOW DETECTION (M2) 12
 Leading zero detection at the input  LZz =LZx+LZy  Significand overflow is always supposed  Two versions: ▪ Limited (MLx) ▪ High radix (MRx) 13
 Non-Normalized FP format  Proposed arithmetic circuits ▪ Adders ▪ Multipliers  Error measurement in DSP applications  Implementation results  Conclusions 14
 Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros 15
 Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros .0101011 1.0101011 15
 Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros .0101011 0.0001011 1.0101011 + 1.1101101 15
 Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros 0.0101011 .0101011 0.0001011 x 0.1100111 1.0101011 + 1.1101101 00.010001011… 15
 Experiment with several DSP algorithm A1MH noN Reference FP64 FP32 NoN architectures Tested FPGA Non-Normalized ARM A9 Unit Error SNR 𝐹 𝑧 𝑇𝑂𝑆 𝑒𝐶 = 10 ∗ 𝑚𝑝 10 𝐹 𝑓𝑠𝑠𝑝𝑠 16
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 17
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 17
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 17
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. A2 A1 17
A1: basic M: no ovf. MRx: radix-x norm. A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. IEEE HUB no HUB 146.2 A1 A2 A1 M 0 135.5 0 M2 133.8 135.9 124 MR1 133.9 135.5 123 MR4 132.0 135.5 123 MR8 1.3 135.5 1.3 ML4 133.9 135.5 123.4 ML6 133.9 135.5 123.2 18
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 19
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 19
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 19
 Round-to-nearest is essential  A2 is the best adder  A2M2H the best combination  Limited normalization in adders give better accuracy than normalizing multipliers 20
 Non-Normalized FP format  Proposed arithmetic circuits ▪ Adders ▪ Multipliers  Error measurement in DSP applications  Implementation results  Conclusions 21
 Conditions: ▪ 32-bit FP architectures ▪ Fully combinational architectures ▪ Synopsys Design Compiler Ultra H-2013.03-SP2 ▪ TSMC 65nm Library typical case ▪ Area and power when targeting the same frequency 22
AREA POWER COMSUMPTION • Very important reduction for all versions (around 40%-75%) • Higher speed • HUB version uses slightly less area and power • Partial normalization has a significant cost 23
AREA POWER COMSUMPTION • Much less reduction than for adders • Improvement comes from elimination of rounding logic • HUB version slightly more area and power 24
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION Upper limit 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION Upper limit Lower limit 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION About 25%- 50% Area and Power reduction 25
 Non-Normalized FP format  Proposed arithmetic circuits ▪ Adders ▪ Multipliers  Error measurement in DSP applications  Implementation results  Conclusions 26
 Removing normalization condition allows hardware-cost vs accuracy trade-off  Different adders and multipliers proposed for dealing with this trade-off  Rounding-to-nearest and a few-bit normalization are enough to limit accuracy loss  By reasonable loss of accuracy (10 dB), area and power could be reduced up to 50% 27
 Obtained results encourages us to continue by seeking new non-normalized architectures, and testing more applications  Other FP standard characteristics are also questionable in embedded applications  We aim for opening a debate about the need for defining a new FP standard extension for new embeded applications 28
Questions?
Recommend
More recommend