Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, - - PowerPoint PPT Presentation
Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, - - PowerPoint PPT Presentation
Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, Martin Langhammer, Gregg Baeckler, Sergey Gribok Intel Corporation Context Machine learning increase density of small-precision arithmetic INT8 - commonly used for
Context
- Machine learning → increase density of small-precision arithmetic
- INT8 - commonly used for inferencing
- INT8-based block FP can also be used for training
1High Density and Performance Multiplication for FPGA - Martin Langhammer, Gregg
Baeckler - ARITH25 (2018)
2 Intel Corporation INTEL PUBLIC September 9, 2019
Context
- Machine learning → increase density of small-precision arithmetic
- INT8 - commonly used for inferencing
- INT8-based block FP can also be used for training
- Logic-based multiplier for Intel FPGAs investigated in 1
1High Density and Performance Multiplication for FPGA - Martin Langhammer, Gregg
Baeckler - ARITH25 (2018)
2 Intel Corporation INTEL PUBLIC September 9, 2019
Context
- Machine learning → increase density of small-precision arithmetic
- INT8 - commonly used for inferencing
- INT8-based block FP can also be used for training
- Logic-based multiplier for Intel FPGAs investigated in 1
This work
Extracting INT8 multipliers from commonly available INT18 multipliers
1High Density and Performance Multiplication for FPGA - Martin Langhammer, Gregg
Baeckler - ARITH25 (2018)
2 Intel Corporation INTEL PUBLIC September 9, 2019
General Idea - partial product separation
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 c5 c4 c3 c2 c1 c0 a5 a4 a3 a2 a1 a0 b0 b1 b2 b3 b4 b5
- 11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
- 22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12
O=PxQ 18 19 20 21 22 23 24 25
- 25 o24 o23
y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
3 Intel Corporation INTEL PUBLIC September 9, 2019
General Idea - partial product separation
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 c5 c4 c3 c2 c1 c0 a5 a4 a3 a2 a1 a0 b0 b1 b2 b3 b4 b5
- 11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
- 22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12
O=PxQ 18 19 20 21 22 23 24 25
- 25 o24 o23
y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
What happens for inputs beyond 6 bits?
3 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
- compute Y = A·C and Z = A·B using an 18x18 multiplier
- A, B and C 8-bit unsigned numbers
- the 18x18 multiplier is configured as an unsigned multiplier
4 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
- compute Y = A·C and Z = A·B using an 18x18 multiplier
- A, B and C 8-bit unsigned numbers
- the 18x18 multiplier is configured as an unsigned multiplier
- map A, B and C to the Int18 inputs:
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
4 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
- compute Y = A·C and Z = A·B using an 18x18 multiplier
- A, B and C 8-bit unsigned numbers
- the 18x18 multiplier is configured as an unsigned multiplier
- map A, B and C to the Int18 inputs:
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0
4 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
- compute Y = A·C and Z = A·B using an 18x18 multiplier
- A, B and C 8-bit unsigned numbers
- the 18x18 multiplier is configured as an unsigned multiplier
- map A, B and C to the Int18 inputs:
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
O=PxQ 18 19 20 21 22 23 24 25
4 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
- compute Y = A·C and Z = A·B using an 18x18 multiplier
- A, B and C 8-bit unsigned numbers
- the 18x18 multiplier is configured as an unsigned multiplier
- map A, B and C to the Int18 inputs:
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
O=PxQ 18 19 20 21 22 23 24 25 y15 y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
4 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
- compute Y = A·C and Z = A·B using an 18x18 multiplier
- A, B and C 8-bit unsigned numbers
- the 18x18 multiplier is configured as an unsigned multiplier
- map A, B and C to the Int18 inputs:
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
O=PxQ 18 19 20 21 22 23 24 25 y15 y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
How to obtain the rest of the bits of Y and Z?
4 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input
Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
O=PxQ 18 19 20 21 22 23 24 25 y15 y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0
- Observe:
{o25,...,o10} = {y15,...,y10}+{z15,...,z0} = {z15,...,z6,y15,...,y10}+{z5,...,z0}
- Therefore:
{z15,...,z6,y15,...,y10} = {o25,...,o10}−{z5,...z0}
5 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input - architecture
subtractor 18x18 mult LSB mult 6x6
{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0}
- {z5,...,z0} = {a5,..,a0}{c5,..,c0}[5 : 0]
- Z5:0 obtained using truncated (LSB) multiplier
6 Intel Corporation INTEL PUBLIC September 9, 2019
Unsigned Int8, shared input - architecture
subtractor 18x18 mult LSB mult 6x6
{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0}
- {z5,...,z0} = {a5,..,a0}{c5,..,c0}[5 : 0]
- Z5:0 obtained using truncated (LSB) multiplier
- technique also extends to other multiplier sizes
- the wider the overlap Y, Z overlap, the larger the area
6 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
- 18x18 multiplier is a signed multiplier with pre-adder
7 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
- 18x18 multiplier is a signed multiplier with pre-adder
- map A, B and C to the multiplier inputs:
23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- peration
(P+Q)R P Q R
7 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
- 18x18 multiplier is a signed multiplier with pre-adder
- map A, B and C to the multiplier inputs:
23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- peration
(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7
7 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
- 18x18 multiplier is a signed multiplier with pre-adder
- map A, B and C to the multiplier inputs:
23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- peration
(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
7 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
- 18x18 multiplier is a signed multiplier with pre-adder
- map A, B and C to the multiplier inputs:
23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- peration
(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
CONFIGURATION 1 CONFIGURATION 2
- peration
(P−Q)R P Q R b6 b7 b5 b4 b3 b2 b1 b0 c7 c7 c6 c7 c5 c4 c3 c2 c1 c0 c7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a6 a5 a4 a3 a2 a1 a0
7 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
- 18x18 multiplier is a signed multiplier with pre-adder
- map A, B and C to the multiplier inputs:
23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- peration
(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
CONFIGURATION 1 CONFIGURATION 2
- peration
(P−Q)R P Q R b6 b7 b5 b4 b3 b2 b1 b0 c7 c7 c6 c7 c5 c4 c3 c2 c1 c0 c7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a6 a5 a4 a3 a2 a1 a0 y0 y1 y2 y4 y3 y5 y6 y7 y8 y10 y11 y12 y13 y14 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 z14 z15 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0 y9
How to obtain the rest of the bits of Y and Z?
7 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
y0 y1 y2 y4 y3 y5 y6 y7 y8 y10 y11 y12 y13 y14 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 z14 z15 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0 y9
There are two possible output subtract/add types:
- Type 1: Subtract
{z15,..,z6,y15,..,y10} ={o25,..,o10}−{10′y15,z5,..,z0}
FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA
- 10
1 z0
- 11
z1
- 12
- 13
z2 z3
- 14
z4
- 15
z5 y15 y14 y13 y12 y11 y10
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
z15 z14 z13 z12 z11 z10 z9 z8 z7 z6
8 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input
- 23 o22 o21 o20 o19 o18 o17 o16
- 25 o24
- 15 o14 o13 o12 o11 o10 o9
- 8
- 7
- 6
- 5
- 4
- 3
- 2
- 1
y0 y1 y2 y4 y3 y5 y6 y7 y8 y10 y11 y12 y13 y14 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 z14 z15 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0 y9
There are two possible output subtract/add types:
- Type 1: Subtract
{z15,..,z6,y15,..,y10} ={o25,..,o10}−{10′y15,z5,..,z0}
FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA
- 10
1 z0
- 11
z1
- 12
- 13
z2 z3
- 14
z4
- 15
z5 y15 y14 y13 y12 y11 y10
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
z15 z14 z13 z12 z11 z10 z9 z8 z7 z6
- Type 2: Add
{cOut,y15,...,y10} = {0,o15,...,o10}+{0,z5,...,z0} {z15,...,z6} = {o25,...,o16}+{y15,...y15}+cOut
8 Intel Corporation INTEL PUBLIC September 9, 2019
Signed Int8, shared input - architectures
LSB mult 6x6 mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} {o25,...,o10} subtractor {z15,...,z6,y15,...,y10} 10 {y15} {z5,...,z0}
LSB mult 6x6 {z5,...,z0} mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} adder {o15,...,o10} {o25,...,o11} {z15,...,z6} adder {!y15,...,!y15} {y15,...,y10} {z5,...,z0}
9 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization
subtractor 18x18 mult LSB mult 6x6
{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0} LSB mult 6x6 mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} {o25,...,o10} subtractor {z15,...,z6,y15,...,y10} 10 {y15} {z5,...,z0} LSB mult 6x6 {z5,...,z0} mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} adder {o15,...,o10} {o25,...,o11} {z15,...,z6} adder {!y15,...,!y15} {y15,...,y10} {z5,...,z0}
(A) (B) (C)
Case Type 18x18 (two) DSP (four) ALMs/ int8
- urs
A Standalone 16 ALMs 32 ALMs 8 B Standalone 16 ALMs 32 ALMs 8 C Standalone 17 ALMs 34 ALMs 8.5
- 16-bit adder/subtracter requires 8 ALMs
- 6-bit LSB multiplier requires 8 ALMs
10 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization
subtractor 18x18 mult LSB mult 6x6
{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0} LSB mult 6x6 mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} {o25,...,o10} subtractor {z15,...,z6,y15,...,y10} 10 {y15} {z5,...,z0} LSB mult 6x6 {z5,...,z0} mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} adder {o15,...,o10} {o25,...,o11} {z15,...,z6} adder {!y15,...,!y15} {y15,...,y10} {z5,...,z0}
(A) (B) (C)
Case Type 18x18 (two) DSP (four) ALMs/ int8
- urs
A Standalone 16 ALMs 32 ALMs 8 B Standalone 16 ALMs 32 ALMs 8 C Standalone 17 ALMs 34 ALMs 8.5
- 16-bit adder/subtracter requires 8 ALMs
- 6-bit LSB multiplier requires 8 ALMs
What about the area in a dot-product unit?
10 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization - dot-product
Reduce carry-propagation cost: accumulate Z in 2 components
~
Z1 Y1
P b1 c1 18x18 Q a1
10 6
- [9:0]
6 E1
- [15:10]
1 brw 10
- [25:16]
16
6 6 9
brw
11 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization - dot-product
Reduce carry-propagation cost: accumulate Z in 2 components
~
Z1 Y1
P b1 c1 18x18 Q a1
10 6
- [9:0]
6 E1
- [15:10]
1 brw 10
- [25:16]
16
6 6 9
brw
~
Z2
P c2 b2 a2 Q
6
- [9:0]
- [25:16]
18x18
10
- [15:10]
brw 6 10 Y2 6 1 E2 11 17 8
11 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization - dot-product
Reduce carry-propagation cost: accumulate Z in 2 components
~
Z1 Y1
P b1 c1 18x18 Q a1
10 6
- [9:0]
6 E1
- [15:10]
1 brw 10
- [25:16]
16
6 6 9
brw
~
Z2
P c2 b2 a2 Q
6
- [9:0]
- [25:16]
18x18
10
- [15:10]
brw 6 10 Y2 6 1 E2 11 17 8
~
Z3
~
Z4
a3 c3 b3 Q P 18x18
6
- [25:16]
- [15:10]
- [9:0]
brw 10 6 6 1 E3 10 Y3
b4 c4 a4 P Q 18x18
6
- [15:10]
- [9:0]
brw
- [25:16]
10 1 6 6 10 E4 Y4 11 17 8 18 9 12
11 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization - dot-product
Reduce carry-propagation cost: accumulate Z in 2 components
~
Z1 Y1
P b1 c1 18x18 Q a1
10 6
- [9:0]
6 E1
- [15:10]
1 brw 10
- [25:16]
16
6 6 9
brw
~
Z2
P c2 b2 a2 Q
6
- [9:0]
- [25:16]
18x18
10
- [15:10]
brw 6 10 Y2 6 1 E2 11 17 8
~
Z3
~
Z4
a3 c3 b3 Q P 18x18
6
- [25:16]
- [15:10]
- [9:0]
brw 10 6 6 1 E3 10 Y3
b4 c4 a4 P Q 18x18
6
- [15:10]
- [9:0]
brw
- [25:16]
10 1 6 6 10 E4 Y4 11 17 8 18 9 12
msb3
12 lsb6
18
a1b1+a2b2+a3b3+a4b4 a1c1+a2c2+a3c3+a4c4
+
11 Intel Corporation INTEL PUBLIC September 9, 2019
Resource Utilization - dot-product
Reduce carry-propagation cost: accumulate Z in 2 components
~
Z1 Y1
P b1 c1 18x18 Q a1
10 6
- [9:0]
6 E1
- [15:10]
1 brw 10
- [25:16]
16
6 6 9
brw
~
Z2
P c2 b2 a2 Q
6
- [9:0]
- [25:16]
18x18
10
- [15:10]
brw 6 10 Y2 6 1 E2 11 17 8
~
Z3
~
Z4
a3 c3 b3 Q P 18x18
6
- [25:16]
- [15:10]
- [9:0]
brw 10 6 6 1 E3 10 Y3
b4 c4 a4 P Q 18x18
6
- [15:10]
- [9:0]
brw
- [25:16]
10 1 6 6 10 E4 Y4 11 17 8 18 9 12
msb3
12 lsb6
18
a1b1+a2b2+a3b3+a4b4 a1c1+a2c2+a3c3+a4c4
+
Pay the cost of fixing Z once
11 Intel Corporation INTEL PUBLIC September 9, 2019
Scaling at system-level
Push-button approach:
- 500 DOT32 cores into the Stratix 10
1SG280LN2F43E1VG
- Quartus 18.1 with Fractal Synthesis
- clock frequency: 457.9 MHz
- 4000/5760 DSP Blocks available
(70%) - 16000 INT8 multipliers
- 300K ALMs (w.o. virtual pins) or 32%
- f the available logic
12 Intel Corporation INTEL PUBLIC September 9, 2019
Scaling at system-level
Push-button approach:
- 500 DOT32 cores into the Stratix 10
1SG280LN2F43E1VG
- Quartus 18.1 with Fractal Synthesis
- clock frequency: 457.9 MHz
- 4000/5760 DSP Blocks available
(70%) - 16000 INT8 multipliers
- 300K ALMs (w.o. virtual pins) or 32%
- f the available logic
12 Intel Corporation INTEL PUBLIC September 9, 2019
Scaling at system-level
Push-button approach:
- 700 DOT32 cores into the Stratix 10
1SG280LN2F43E1VG
- Quartus 18.1 with Fractal Synthesis
- clock frequency: 416 MHz
- 5600/5760 DSP Blocks available
(97%) - 22400 INT8 multipliers
- 452K ALMs (less than half of the
available logic)
12 Intel Corporation INTEL PUBLIC September 9, 2019
Scaling at system-level
Push-button approach:
- 700 DOT32 cores into the Stratix 10
1SG280LN2F43E1VG
- Quartus 18.1 with Fractal Synthesis
- clock frequency: 416 MHz
- 5600/5760 DSP Blocks available
(97%) - 22400 INT8 multipliers
- 452K ALMs (less than half of the
available logic)
12 Intel Corporation INTEL PUBLIC September 9, 2019
Conclusions
1
extract Int8 multipliers from Int18 using minimal logic
13 Intel Corporation INTEL PUBLIC September 9, 2019
Conclusions
1
extract Int8 multipliers from Int18 using minimal logic
2
techniques presented for both signed and unsigned multipliers
13 Intel Corporation INTEL PUBLIC September 9, 2019
Conclusions
1
extract Int8 multipliers from Int18 using minimal logic
2
techniques presented for both signed and unsigned multipliers
3
technique extensible to other multiplier sizes
13 Intel Corporation INTEL PUBLIC September 9, 2019
Conclusions
1
extract Int8 multipliers from Int18 using minimal logic
2
techniques presented for both signed and unsigned multipliers
3
technique extensible to other multiplier sizes
4
further area savings in dot-product units
13 Intel Corporation INTEL PUBLIC September 9, 2019
Conclusions
1
extract Int8 multipliers from Int18 using minimal logic
2
techniques presented for both signed and unsigned multipliers
3
technique extensible to other multiplier sizes
4
further area savings in dot-product units
5
high system-level performance → 700 DOT32 in S10
13 Intel Corporation INTEL PUBLIC September 9, 2019