SONIA GONZALEZ-NAVARRO AND JAVIER HORMIGO Dept. Computer Architecture Universidad de Málaga (Spain) fjhormigo@uma.es
New embedded applications increasingly demanding FP computation IEEE-754 FP standard designed for GPP Problems of using the FP standard: ▪ Lack of flexibility (Ex: word sizes) ▪ Compulsory requirements: costly and not always useful (different rounding modes, special cases, subnormal…) 2
The problem exists: ▪ FPGA tools use almost compliant formats, but: ▪ Variable sizes, subnormals , special case flags… ▪ Special internal format (Intel fused FP-datapath) ▪ Synopsys Flexible Floating-Point format ▪ Two´ s complement, flags, no normalization, truncation… Consequences: ▪ Multiple-variations of the standard are used=> incompatibility and irreproducibility ▪ Hardware implementations less efficient 3
Should a new extension of the FP standard be defined for embedded applications? Multiple choices could be re-studied for these new applications: normalization, rounding, significand representation, special cases, etc. Here we focus on Normalization (and rounding) ▪ How normalization affects accuracy ▪ Implementation result improvement 4
Non-Normalized FP format Proposed arithmetic circuits ▪ Adders ▪ Multipliers Error measurement in DSP applications Implementation results Conclusions 5
Non-Normalized FP format Proposed arithmetic circuits ▪ Adders ▪ Multipliers Error measurement in DSP applications Implementation results Conclusions 6
Similar to binary32 Normalization is not compulsory No special cases Zero and subnormal are not special cases Simplify rounding by using truncation: ▪ Round toward zero ▪ Round to nearest by using HUB approach [1] [1] J. Hormigo and J. Villalba , “New formats for computing with real numbers under round-to- nearest”, IEEE Trans. on Computers, vol. 65, no. 7, pp. 2158 – 2168, 2016 7
If Normalization is not compulsory, it is lost: -The implicit bit => 1 bit of precision -Leading zeros => Accuracy -Comparison operation -Reproducibility But, it is improved: +Area reduction +Power and energy reduction +Increase of the speed 8
If Normalization is not compulsory, it is lost: -The implicit bit => 1 bit of precision -Leading zeros => Accuracy -Comparison operation -Reproducibility Aproximate Computing (HW-accuracy trade-off) But, it is improved: +Area reduction +Power and energy reduction +Increase of the speed 8
Non-Normalized FP format Proposed arithmetic circuits ▪ Adders ▪ Multipliers Error measurement in DSP applications Implementation results Conclusions 9
Basic FP Adder with no normalization(A1) No normalization or rounding logic Only significand overflow is normalized Gray boxes => HUB version Round-to-nearest 10
FP Adder with limited normalization(A2) Up to two leading zero detection and shifting Significand overflow is also normalized Grey boxes => HUB version Round-to-nearest 11
WITHOUT SIGNIFICAND WITH SIGNIFICAND OVERFLOW DETECTION (M) OVERFLOW DETECTION (M2) 12
Leading zero detection at the input LZz =LZx+LZy Significand overflow is always supposed Two versions: ▪ Limited (MLx) ▪ High radix (MRx) 13
Non-Normalized FP format Proposed arithmetic circuits ▪ Adders ▪ Multipliers Error measurement in DSP applications Implementation results Conclusions 14
Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros 15
Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros .0101011 1.0101011 15
Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros .0101011 0.0001011 1.0101011 + 1.1101101 15
Using non-normalized numbers implies a loss of accuracy ▪ Loss of the implicit leading one ▪ Unaligned addition ▪ Multiplications increase the number of leading zeros 0.0101011 .0101011 0.0001011 x 0.1100111 1.0101011 + 1.1101101 00.010001011… 15
Experiment with several DSP algorithm A1MH noN Reference FP64 FP32 NoN architectures Tested FPGA Non-Normalized ARM A9 Unit Error SNR 𝐹 𝑧 𝑇𝑂𝑆 𝑒𝐶 = 10 ∗ 𝑚𝑝 10 𝐹 𝑓𝑠𝑠𝑝𝑠 16
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 17
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 17
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 17
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. A2 A1 17
A1: basic M: no ovf. MRx: radix-x norm. A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. IEEE HUB no HUB 146.2 A1 A2 A1 M 0 135.5 0 M2 133.8 135.9 124 MR1 133.9 135.5 123 MR4 132.0 135.5 123 MR8 1.3 135.5 1.3 ML4 133.9 135.5 123.4 ML6 133.9 135.5 123.2 18
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 19
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 19
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. 19
Round-to-nearest is essential A2 is the best adder A2M2H the best combination Limited normalization in adders give better accuracy than normalizing multipliers 20
Non-Normalized FP format Proposed arithmetic circuits ▪ Adders ▪ Multipliers Error measurement in DSP applications Implementation results Conclusions 21
Conditions: ▪ 32-bit FP architectures ▪ Fully combinational architectures ▪ Synopsys Design Compiler Ultra H-2013.03-SP2 ▪ TSMC 65nm Library typical case ▪ Area and power when targeting the same frequency 22
AREA POWER COMSUMPTION • Very important reduction for all versions (around 40%-75%) • Higher speed • HUB version uses slightly less area and power • Partial normalization has a significant cost 23
AREA POWER COMSUMPTION • Much less reduction than for adders • Improvement comes from elimination of rounding logic • HUB version slightly more area and power 24
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION Upper limit 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION Upper limit Lower limit 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION 25
A1: basic M: no ovf. MRx: radix-x norm. H: HUB A2: lim. norm. M2: ovf. MLx: lim. x-bit norm. AREA POWER COMSUMPTION About 25%- 50% Area and Power reduction 25
Non-Normalized FP format Proposed arithmetic circuits ▪ Adders ▪ Multipliers Error measurement in DSP applications Implementation results Conclusions 26
Removing normalization condition allows hardware-cost vs accuracy trade-off Different adders and multipliers proposed for dealing with this trade-off Rounding-to-nearest and a few-bit normalization are enough to limit accuracy loss By reasonable loss of accuracy (10 dB), area and power could be reduced up to 50% 27
Obtained results encourages us to continue by seeking new non-normalized architectures, and testing more applications Other FP standard characteristics are also questionable in embedded applications We aim for opening a debate about the need for defining a new FP standard extension for new embeded applications 28
Questions?
Recommend
More recommend