www.kalrayinc.com www.kalrayinc.com
ARITH24, Nicolas Brunie (nbrunie@kalray.eu)
Modifjed FMA for exact accumulation of low precision products
July 25th, 2017
Modifjed FMA for exact accumulation of low precision products - - PowerPoint PPT Presentation
Modifjed FMA for exact accumulation of low precision products ARITH24, Nicolas Brunie (nbrunie@kalray.eu) July 25th, 2017 www.kalrayinc.com www.kalrayinc.com Accurate accumulation of products of small precision numbers Goal: Assuming x i ,y
ARITH24, Nicolas Brunie (nbrunie@kalray.eu)
July 25th, 2017
–
Introduced in IEEE754-2008
–
As a storage format not intended for computation
–
But more and more used in computation
–
Optimize accuracy
–
Optimize speed (latency and throughput)
–
Suggest a generic processor operator
–
Is there a way to exploit smaller precision ?
–
Is there a way to easily extend FMA precision ?
–
How to implement low latency accumulation ?
Goal: Assuming xi,yj binary16 and S binary32 or larger, optimize S = [x0,x1,x2,x3,...].[y0,y1,y2,y3,...] = x0.y0 + x1.y1 + x2.y2 + x3.y3 + …
–
Focusing on binary32 and binary64
–
Architecture optimized for latency
–
Focusing on binary32 and binary64
–
Architecture optimized for latency
–
Several cycles for dependent accumulation
–
A few works on throughput optimization [2]
CPU ARM A72 AMD Bulldozer Intel Skylake FMA latency 6/3 5 4
David Lutz, 2011
–
Focusing on binary32 and binary64
–
Architecture optimized for latency
–
Several cycles for dependent accumulation
–
A few works on throughput optimization [2]
CPU ARM A72 AMD Bulldozer Intel Skylake FMA latency 6/3 5 4
David Lutz, 2011
binary16 . binary16 + binary32 → binary32
binary16 . binary16 + binary32 → binary32
–
Saving conversion instructions
–
IEEE754-compliant (formatOf)
–
Compromise between large and small FMA
binary16 . binary16 + binary32 → binary32
–
Saving conversion instructions
–
IEEE754-compliant (formatOf)
–
Compromise between large and small FMA
–
Cancellation requirements
–
Datapath design
–
X with precision p and anchor at Px
–
Y with precision q and anchor at Py
–
Arbitrary number of leading zeros
–
Output precision o (normalized)
–
X with precision p and anchor at Px
–
Y with precision q and anchor at Py
–
Arbitrary number of leading zeros
–
Output precision o (normalized)
–
T
–
Assuming single path
–
Assuming up to LX leading zero(s) in X
–
Assuming up to LY leading zero(s) in Y
–
Determines the Leading Zero Count range
–
Determines the close path topology
–
Determines the Leading Zero Count range
–
Determines the close path topology
–
Determines the Leading Zero Count range
–
Determines the close path topology
−(LY + 1) ≤ δ = eX − eY ≤ LX + 1
–
Determines the Leading Zero Count range
–
Determines the close path topology
−(LY + 1) ≤ δ = eX − eY ≤ LX + 1 max(LX + 1 + q, LY + 1 + p)
–
Determines the Leading Zero Count range
–
Determines the close path topology
−(LY + 1) ≤ δ = eX − eY ≤ LX + 1 max(LX + 1 + q, LY + 1 + p) max(LX + 1 + q, LY + 1 + p)
–
Determines datapath width
–
Exhibits efgect of non-normalization
–
T wo sub cases to be considered
–
Determines datapath width
–
Exhibits efgect of non-normalization
–
T wo sub cases to be considered
–
Determines datapath width
–
Exhibits efgect of non-normalization
–
T wo sub cases to be considered
max(o+LX,p) + max(o+LY,q) + 4 + min(p,q)
–
Determines datapath width
–
Exhibits efgect of non-normalization
–
T wo sub cases to be considered
max(o+LX,p) + max(o+LY,q) + 4 + min(p,q) max(o+LX,p) + max(o+LY,q) + 5
–
Evaluate datapath size
–
Evaluate feasability
–
Evaluate datapath size
–
Evaluate feasability
Operator Datapath width FMA16 (p=o=q=11) 49 FMA32 (p=o=q=24) 101 MPFMA16 (p=11, q=o=24) 99
–
Evaluate datapath size
–
Evaluate feasability
–
Better accuracy than FMA
–
Comparable latency
Operator Datapath width FMA16 (p=o=q=11) 49 FMA32 (p=o=q=24) 101 MPFMA16 (p=11, q=o=24) 99
–
Evaluate datapath size
–
Evaluate feasability
–
Better accuracy than FMA
–
Comparable latency
Operator Datapath width FMA16 (p=o=q=11) 49 FMA32 (p=o=q=24) 101 MPFMA16 (p=11, q=o=24) 99
Operator Cell Area (μm²) Acc. Latency MPFMA fp16/fp32 2690 3 FMA fp16 1840 3 FMA fp32 4790 3
–
554 bits for binary32
–
4196 bits for binary64
–
Full integration in Arithmetic Unit
–
But quite a large memory footprint
–
Not scalable (e.g. vectorization)
–
Require heavy CPU architectural modifjcation
–
554 bits for binary32
–
4196 bits for binary64
–
Full integration in Arithmetic Unit
–
But quite a large memory footprint
–
Not scalable (e.g. vectorization)
–
Require heavy CPU architectural modifjcation
Numbers and Products in Fixed-Point Arithmetic, Ulrich Kulisch, 1997
format p exp range binary16 11 [-14,15] binary32 24 [-126,127] binary64 53 [-1022,1023]
–
Has a very limited exponent range
format p exp range binary16 11 [-14,15] binary32 24 [-126,127] binary64 53 [-1022,1023]
–
Has a very limited exponent range
–
Only 80-bit required to store full product dynamic range
–
Make it suitable for in-register implementation of Kulisch's [1] accumulator
Numbers and Products in Fixed-Point Arithmetic, Ulrich Kulisch, 1997
format p exp range binary16 11 [-14,15] binary32 24 [-126,127] binary64 53 [-1022,1023]
–
Using (80+ε)-bit register for accumulation
–
Fixed-point aligned to (31+ε)
–
Using (80+ε)-bit register for accumulation
–
Fixed-point aligned to (31+ε)
–
Exact (no rounding)
–
Sign-magnitude accumulator
–
Balancing acc / product path
–
Using (80+ε)-bit register for accumulation
–
Fixed-point aligned to (31+ε)
–
Exact (no rounding)
–
Sign-magnitude accumulator
–
Balancing acc / product path
–
2's complement accumulator
–
2's complement accumulator
–
Fast accumulator path
–
Pushed back to fjnal conversion
–
Product less sensible to delay
–
Bypass easy to fjt in 1 cycle
–
2's complement accumulator
–
Fast accumulator path
–
Pushed back to fjnal conversion
–
Product less sensible to delay
–
Bypass easy to fjt in 1 cycle
–
Introduced at ARITH22
–
Generate C, OpenCL-C
–
Several backends: (generic x86 SSE, AVX, Kalray's K1)
–
Introduced at ARITH22
–
Generate C, OpenCL-C
–
Several backends: (generic x86 SSE, AVX, Kalray's K1)
–
Introduced at ARITH22
–
Generate C, OpenCL-C
–
Several backends: (generic x86 SSE, AVX, Kalray's K1)
–
Description extension
–
IR extension
–
New VHDL backend
–
From parametric description
–
With associated testbench
Operator Cell Area (μm²) Acc. latency FMA fp16 1840 3 FMA fp32 4790 3 MPFMA fp16/fp32 2690 3 Fixed MPFMA
Sign Magnitude
2195 1 Fixed MPFMA
2's complement
1950 1
–
From parametric description
–
With associated testbench
Operator Cell Area (μm²) Acc. latency FMA fp16 1840 3 FMA fp32 4790 3 MPFMA fp16/fp32 2690 3 Fixed MPFMA
Sign Magnitude
2195 1 Fixed MPFMA
2's complement
1950 1
–
From parametric description
–
With associated testbench
FMA
–
Larger shifter and adder
Operator Cell Area (μm²) Acc. latency FMA fp16 1840 3 FMA fp32 4790 3 MPFMA fp16/fp32 2690 3 Fixed MPFMA
Sign Magnitude
2195 1 Fixed MPFMA
2's complement
1950 1
–
From parametric description
–
With associated testbench
FMA
–
Larger shifter and adder
–
Fixed MPFMA is exact
–
MPFMA applied to binary16
–
Fixed-Point MPFMA
–
Get rid of a troubling architectural state
–
Fast conversion to binary32
–
Useful for larger dimensional dot product
–
Push forward 3-operand ADD
–
Around 80-bit Leading Zero Count
–
Around 80-bit Shifter
–
24-bit Incrementer for rounding
–
exp > 14 implies overfmow
–
exp < -24 implies dump into sticky
–
Straighforward subnormal output (when detected)
Numbers and Products in Fixed-Point Arithmetic, Ulrich Kulisch, 1997
Extended bibliography
Point Numbers and Products in Fixed-Point Arithmetic, Ulrich Kulisch, 1997
Pipelines, David Lutz, 2011(ARITH20)
product
subnormal no longer holds
–
Introduced in IEEE754-2008
–
As a storage format not intended for computation
–
But more and more used in computation
–
Optimize accuracy
–
Optimize speed (latency and throughput)
–
Suggest a generic processor operator
–
Is there a way to exploit smaller precision ?
–
Is there a way to easily extend FMA precision ?
–
How to implement low latency accumulation ?
Goal: Assuming S binary32 or larger and xi,yj binary16 optimize S = [x0,x1,x2,x3,...].[y0,y1,y2,y3,...] = x0.y0 + x1.y1 + x2.y2 + x3.y3 + …