Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, - - PowerPoint PPT Presentation

extracting int8 multipliers from int18 multipliers
SMART_READER_LITE
LIVE PREVIEW

Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, - - PowerPoint PPT Presentation

Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, Martin Langhammer, Gregg Baeckler, Sergey Gribok Intel Corporation Context Machine learning increase density of small-precision arithmetic INT8 - commonly used for


slide-1
SLIDE 1

Extracting INT8 Multipliers from INT18 Multipliers

Bogdan Pasca, Martin Langhammer, Gregg Baeckler, Sergey Gribok

Intel Corporation

slide-2
SLIDE 2

Context

  • Machine learning → increase density of small-precision arithmetic
  • INT8 - commonly used for inferencing
  • INT8-based block FP can also be used for training

1High Density and Performance Multiplication for FPGA - Martin Langhammer, Gregg

Baeckler - ARITH25 (2018)

2 Intel Corporation INTEL PUBLIC September 9, 2019

slide-3
SLIDE 3

Context

  • Machine learning → increase density of small-precision arithmetic
  • INT8 - commonly used for inferencing
  • INT8-based block FP can also be used for training
  • Logic-based multiplier for Intel FPGAs investigated in 1

1High Density and Performance Multiplication for FPGA - Martin Langhammer, Gregg

Baeckler - ARITH25 (2018)

2 Intel Corporation INTEL PUBLIC September 9, 2019

slide-4
SLIDE 4

Context

  • Machine learning → increase density of small-precision arithmetic
  • INT8 - commonly used for inferencing
  • INT8-based block FP can also be used for training
  • Logic-based multiplier for Intel FPGAs investigated in 1

This work

Extracting INT8 multipliers from commonly available INT18 multipliers

1High Density and Performance Multiplication for FPGA - Martin Langhammer, Gregg

Baeckler - ARITH25 (2018)

2 Intel Corporation INTEL PUBLIC September 9, 2019

slide-5
SLIDE 5

General Idea - partial product separation

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 c5 c4 c3 c2 c1 c0 a5 a4 a3 a2 a1 a0 b0 b1 b2 b3 b4 b5

  • 11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12

O=PxQ 18 19 20 21 22 23 24 25

  • 25 o24 o23

y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0

3 Intel Corporation INTEL PUBLIC September 9, 2019

slide-6
SLIDE 6

General Idea - partial product separation

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 c5 c4 c3 c2 c1 c0 a5 a4 a3 a2 a1 a0 b0 b1 b2 b3 b4 b5

  • 11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 22 o21 o20 o19 o18 o17 o16 o15 o14 o13 o12

O=PxQ 18 19 20 21 22 23 24 25

  • 25 o24 o23

y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0

What happens for inputs beyond 6 bits?

3 Intel Corporation INTEL PUBLIC September 9, 2019

slide-7
SLIDE 7

Unsigned Int8, shared input

  • compute Y = A·C and Z = A·B using an 18x18 multiplier
  • A, B and C 8-bit unsigned numbers
  • the 18x18 multiplier is configured as an unsigned multiplier

4 Intel Corporation INTEL PUBLIC September 9, 2019

slide-8
SLIDE 8

Unsigned Int8, shared input

  • compute Y = A·C and Z = A·B using an 18x18 multiplier
  • A, B and C 8-bit unsigned numbers
  • the 18x18 multiplier is configured as an unsigned multiplier
  • map A, B and C to the Int18 inputs:

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

4 Intel Corporation INTEL PUBLIC September 9, 2019

slide-9
SLIDE 9

Unsigned Int8, shared input

  • compute Y = A·C and Z = A·B using an 18x18 multiplier
  • A, B and C 8-bit unsigned numbers
  • the 18x18 multiplier is configured as an unsigned multiplier
  • map A, B and C to the Int18 inputs:

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0

4 Intel Corporation INTEL PUBLIC September 9, 2019

slide-10
SLIDE 10

Unsigned Int8, shared input

  • compute Y = A·C and Z = A·B using an 18x18 multiplier
  • A, B and C 8-bit unsigned numbers
  • the 18x18 multiplier is configured as an unsigned multiplier
  • map A, B and C to the Int18 inputs:

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

O=PxQ 18 19 20 21 22 23 24 25

4 Intel Corporation INTEL PUBLIC September 9, 2019

slide-11
SLIDE 11

Unsigned Int8, shared input

  • compute Y = A·C and Z = A·B using an 18x18 multiplier
  • A, B and C 8-bit unsigned numbers
  • the 18x18 multiplier is configured as an unsigned multiplier
  • map A, B and C to the Int18 inputs:

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

O=PxQ 18 19 20 21 22 23 24 25 y15 y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0

4 Intel Corporation INTEL PUBLIC September 9, 2019

slide-12
SLIDE 12

Unsigned Int8, shared input

  • compute Y = A·C and Z = A·B using an 18x18 multiplier
  • A, B and C 8-bit unsigned numbers
  • the 18x18 multiplier is configured as an unsigned multiplier
  • map A, B and C to the Int18 inputs:

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

O=PxQ 18 19 20 21 22 23 24 25 y15 y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0

How to obtain the rest of the bits of Y and Z?

4 Intel Corporation INTEL PUBLIC September 9, 2019

slide-13
SLIDE 13

Unsigned Int8, shared input

Bit weight P Q 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 b7 b6 b5 b4 b3 b2 b1 b0 c7 c6 c5 c4 c3 c2 c1 c0 a7 a6 a5 a4 a3 a2 a1 a0

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

O=PxQ 18 19 20 21 22 23 24 25 y15 y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0

  • Observe:

{o25,...,o10} = {y15,...,y10}+{z15,...,z0} = {z15,...,z6,y15,...,y10}+{z5,...,z0}

  • Therefore:

{z15,...,z6,y15,...,y10} = {o25,...,o10}−{z5,...z0}

5 Intel Corporation INTEL PUBLIC September 9, 2019

slide-14
SLIDE 14

Unsigned Int8, shared input - architecture

subtractor 18x18 mult LSB mult 6x6

{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0}

  • {z5,...,z0} = {a5,..,a0}{c5,..,c0}[5 : 0]
  • Z5:0 obtained using truncated (LSB) multiplier

6 Intel Corporation INTEL PUBLIC September 9, 2019

slide-15
SLIDE 15

Unsigned Int8, shared input - architecture

subtractor 18x18 mult LSB mult 6x6

{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0}

  • {z5,...,z0} = {a5,..,a0}{c5,..,c0}[5 : 0]
  • Z5:0 obtained using truncated (LSB) multiplier
  • technique also extends to other multiplier sizes
  • the wider the overlap Y, Z overlap, the larger the area

6 Intel Corporation INTEL PUBLIC September 9, 2019

slide-16
SLIDE 16

Signed Int8, shared input

  • comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
  • 18x18 multiplier is a signed multiplier with pre-adder

7 Intel Corporation INTEL PUBLIC September 9, 2019

slide-17
SLIDE 17

Signed Int8, shared input

  • comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
  • 18x18 multiplier is a signed multiplier with pre-adder
  • map A, B and C to the multiplier inputs:

23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • peration

(P+Q)R P Q R

7 Intel Corporation INTEL PUBLIC September 9, 2019

slide-18
SLIDE 18

Signed Int8, shared input

  • comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
  • 18x18 multiplier is a signed multiplier with pre-adder
  • map A, B and C to the multiplier inputs:

23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • peration

(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7

7 Intel Corporation INTEL PUBLIC September 9, 2019

slide-19
SLIDE 19

Signed Int8, shared input

  • comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
  • 18x18 multiplier is a signed multiplier with pre-adder
  • map A, B and C to the multiplier inputs:

23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • peration

(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

7 Intel Corporation INTEL PUBLIC September 9, 2019

slide-20
SLIDE 20

Signed Int8, shared input

  • comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
  • 18x18 multiplier is a signed multiplier with pre-adder
  • map A, B and C to the multiplier inputs:

23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • peration

(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

CONFIGURATION 1 CONFIGURATION 2

  • peration

(P−Q)R P Q R b6 b7 b5 b4 b3 b2 b1 b0 c7 c7 c6 c7 c5 c4 c3 c2 c1 c0 c7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a6 a5 a4 a3 a2 a1 a0

7 Intel Corporation INTEL PUBLIC September 9, 2019

slide-21
SLIDE 21

Signed Int8, shared input

  • comptue Y = A·C and Z = A·B with A, B and C 8-bit signed numbers
  • 18x18 multiplier is a signed multiplier with pre-adder
  • map A, B and C to the multiplier inputs:

23 22 21 20 19 18 17 16 25 24 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • peration

(P+Q)R P Q R c0 c1 c2 c3 c4 c5 c6 c7 c7 c7 b0 b1 b2 b3 b4 b5 b6 b7 c7 c7 c7 c7 c7 c7 c7 c7 a0 a1 a2 a3 a4 a5 a6 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

CONFIGURATION 1 CONFIGURATION 2

  • peration

(P−Q)R P Q R b6 b7 b5 b4 b3 b2 b1 b0 c7 c7 c6 c7 c5 c4 c3 c2 c1 c0 c7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a6 a5 a4 a3 a2 a1 a0 y0 y1 y2 y4 y3 y5 y6 y7 y8 y10 y11 y12 y13 y14 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 z14 z15 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0 y9

How to obtain the rest of the bits of Y and Z?

7 Intel Corporation INTEL PUBLIC September 9, 2019

slide-22
SLIDE 22

Signed Int8, shared input

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

y0 y1 y2 y4 y3 y5 y6 y7 y8 y10 y11 y12 y13 y14 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 z14 z15 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0 y9

There are two possible output subtract/add types:

  • Type 1: Subtract

{z15,..,z6,y15,..,y10} ={o25,..,o10}−{10′y15,z5,..,z0}

FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA

  • 10

1 z0

  • 11

z1

  • 12
  • 13

z2 z3

  • 14

z4

  • 15

z5 y15 y14 y13 y12 y11 y10

  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

z15 z14 z13 z12 z11 z10 z9 z8 z7 z6

8 Intel Corporation INTEL PUBLIC September 9, 2019

slide-23
SLIDE 23

Signed Int8, shared input

  • 23 o22 o21 o20 o19 o18 o17 o16
  • 25 o24
  • 15 o14 o13 o12 o11 o10 o9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

y0 y1 y2 y4 y3 y5 y6 y7 y8 y10 y11 y12 y13 y14 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 y15 z14 z15 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 z0 y9

There are two possible output subtract/add types:

  • Type 1: Subtract

{z15,..,z6,y15,..,y10} ={o25,..,o10}−{10′y15,z5,..,z0}

FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA

  • 10

1 z0

  • 11

z1

  • 12
  • 13

z2 z3

  • 14

z4

  • 15

z5 y15 y14 y13 y12 y11 y10

  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

z15 z14 z13 z12 z11 z10 z9 z8 z7 z6

  • Type 2: Add

{cOut,y15,...,y10} = {0,o15,...,o10}+{0,z5,...,z0} {z15,...,z6} = {o25,...,o16}+{y15,...y15}+cOut

8 Intel Corporation INTEL PUBLIC September 9, 2019

slide-24
SLIDE 24

Signed Int8, shared input - architectures

LSB mult 6x6 mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} {o25,...,o10} subtractor {z15,...,z6,y15,...,y10} 10 {y15} {z5,...,z0}

LSB mult 6x6 {z5,...,z0} mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} adder {o15,...,o10} {o25,...,o11} {z15,...,z6} adder {!y15,...,!y15} {y15,...,y10} {z5,...,z0}

9 Intel Corporation INTEL PUBLIC September 9, 2019

slide-25
SLIDE 25

Resource Utilization

subtractor 18x18 mult LSB mult 6x6

{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0} LSB mult 6x6 mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} {o25,...,o10} subtractor {z15,...,z6,y15,...,y10} 10 {y15} {z5,...,z0} LSB mult 6x6 {z5,...,z0} mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} adder {o15,...,o10} {o25,...,o11} {z15,...,z6} adder {!y15,...,!y15} {y15,...,y10} {z5,...,z0}

(A) (B) (C)

Case Type 18x18 (two) DSP (four) ALMs/ int8

  • urs

A Standalone 16 ALMs 32 ALMs 8 B Standalone 16 ALMs 32 ALMs 8 C Standalone 17 ALMs 34 ALMs 8.5

  • 16-bit adder/subtracter requires 8 ALMs
  • 6-bit LSB multiplier requires 8 ALMs

10 Intel Corporation INTEL PUBLIC September 9, 2019

slide-26
SLIDE 26

Resource Utilization

subtractor 18x18 mult LSB mult 6x6

{b5,...,b0} {a5,...,a0} P Q {z5,...,z0} {o25,...,o10} {y9,...,y0} {z15,...,z6,y15,...,y10} {z5,...,z0} LSB mult 6x6 mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} {o25,...,o10} subtractor {z15,...,z6,y15,...,y10} 10 {y15} {z5,...,z0} LSB mult 6x6 {z5,...,z0} mult 18x18 add/sub P Q R {a5,...,a0} {b5,...,b0} Half DSP {y9,...,y0} adder {o15,...,o10} {o25,...,o11} {z15,...,z6} adder {!y15,...,!y15} {y15,...,y10} {z5,...,z0}

(A) (B) (C)

Case Type 18x18 (two) DSP (four) ALMs/ int8

  • urs

A Standalone 16 ALMs 32 ALMs 8 B Standalone 16 ALMs 32 ALMs 8 C Standalone 17 ALMs 34 ALMs 8.5

  • 16-bit adder/subtracter requires 8 ALMs
  • 6-bit LSB multiplier requires 8 ALMs

What about the area in a dot-product unit?

10 Intel Corporation INTEL PUBLIC September 9, 2019

slide-27
SLIDE 27

Resource Utilization - dot-product

Reduce carry-propagation cost: accumulate Z in 2 components

~

Z1 Y1

P b1 c1 18x18 Q a1

10 6

  • [9:0]

6 E1

  • [15:10]

1 brw 10

  • [25:16]

16

6 6 9

brw

11 Intel Corporation INTEL PUBLIC September 9, 2019

slide-28
SLIDE 28

Resource Utilization - dot-product

Reduce carry-propagation cost: accumulate Z in 2 components

~

Z1 Y1

P b1 c1 18x18 Q a1

10 6

  • [9:0]

6 E1

  • [15:10]

1 brw 10

  • [25:16]

16

6 6 9

brw

~

Z2

P c2 b2 a2 Q

6

  • [9:0]
  • [25:16]

18x18

10

  • [15:10]

brw 6 10 Y2 6 1 E2 11 17 8

11 Intel Corporation INTEL PUBLIC September 9, 2019

slide-29
SLIDE 29

Resource Utilization - dot-product

Reduce carry-propagation cost: accumulate Z in 2 components

~

Z1 Y1

P b1 c1 18x18 Q a1

10 6

  • [9:0]

6 E1

  • [15:10]

1 brw 10

  • [25:16]

16

6 6 9

brw

~

Z2

P c2 b2 a2 Q

6

  • [9:0]
  • [25:16]

18x18

10

  • [15:10]

brw 6 10 Y2 6 1 E2 11 17 8

~

Z3

~

Z4

a3 c3 b3 Q P 18x18

6

  • [25:16]
  • [15:10]
  • [9:0]

brw 10 6 6 1 E3 10 Y3

b4 c4 a4 P Q 18x18

6

  • [15:10]
  • [9:0]

brw

  • [25:16]

10 1 6 6 10 E4 Y4 11 17 8 18 9 12

11 Intel Corporation INTEL PUBLIC September 9, 2019

slide-30
SLIDE 30

Resource Utilization - dot-product

Reduce carry-propagation cost: accumulate Z in 2 components

~

Z1 Y1

P b1 c1 18x18 Q a1

10 6

  • [9:0]

6 E1

  • [15:10]

1 brw 10

  • [25:16]

16

6 6 9

brw

~

Z2

P c2 b2 a2 Q

6

  • [9:0]
  • [25:16]

18x18

10

  • [15:10]

brw 6 10 Y2 6 1 E2 11 17 8

~

Z3

~

Z4

a3 c3 b3 Q P 18x18

6

  • [25:16]
  • [15:10]
  • [9:0]

brw 10 6 6 1 E3 10 Y3

b4 c4 a4 P Q 18x18

6

  • [15:10]
  • [9:0]

brw

  • [25:16]

10 1 6 6 10 E4 Y4 11 17 8 18 9 12

msb3

12 lsb6

18

a1b1+a2b2+a3b3+a4b4 a1c1+a2c2+a3c3+a4c4

+

11 Intel Corporation INTEL PUBLIC September 9, 2019

slide-31
SLIDE 31

Resource Utilization - dot-product

Reduce carry-propagation cost: accumulate Z in 2 components

~

Z1 Y1

P b1 c1 18x18 Q a1

10 6

  • [9:0]

6 E1

  • [15:10]

1 brw 10

  • [25:16]

16

6 6 9

brw

~

Z2

P c2 b2 a2 Q

6

  • [9:0]
  • [25:16]

18x18

10

  • [15:10]

brw 6 10 Y2 6 1 E2 11 17 8

~

Z3

~

Z4

a3 c3 b3 Q P 18x18

6

  • [25:16]
  • [15:10]
  • [9:0]

brw 10 6 6 1 E3 10 Y3

b4 c4 a4 P Q 18x18

6

  • [15:10]
  • [9:0]

brw

  • [25:16]

10 1 6 6 10 E4 Y4 11 17 8 18 9 12

msb3

12 lsb6

18

a1b1+a2b2+a3b3+a4b4 a1c1+a2c2+a3c3+a4c4

+

Pay the cost of fixing Z once

11 Intel Corporation INTEL PUBLIC September 9, 2019

slide-32
SLIDE 32

Scaling at system-level

Push-button approach:

  • 500 DOT32 cores into the Stratix 10

1SG280LN2F43E1VG

  • Quartus 18.1 with Fractal Synthesis
  • clock frequency: 457.9 MHz
  • 4000/5760 DSP Blocks available

(70%) - 16000 INT8 multipliers

  • 300K ALMs (w.o. virtual pins) or 32%
  • f the available logic

12 Intel Corporation INTEL PUBLIC September 9, 2019

slide-33
SLIDE 33

Scaling at system-level

Push-button approach:

  • 500 DOT32 cores into the Stratix 10

1SG280LN2F43E1VG

  • Quartus 18.1 with Fractal Synthesis
  • clock frequency: 457.9 MHz
  • 4000/5760 DSP Blocks available

(70%) - 16000 INT8 multipliers

  • 300K ALMs (w.o. virtual pins) or 32%
  • f the available logic

12 Intel Corporation INTEL PUBLIC September 9, 2019

slide-34
SLIDE 34

Scaling at system-level

Push-button approach:

  • 700 DOT32 cores into the Stratix 10

1SG280LN2F43E1VG

  • Quartus 18.1 with Fractal Synthesis
  • clock frequency: 416 MHz
  • 5600/5760 DSP Blocks available

(97%) - 22400 INT8 multipliers

  • 452K ALMs (less than half of the

available logic)

12 Intel Corporation INTEL PUBLIC September 9, 2019

slide-35
SLIDE 35

Scaling at system-level

Push-button approach:

  • 700 DOT32 cores into the Stratix 10

1SG280LN2F43E1VG

  • Quartus 18.1 with Fractal Synthesis
  • clock frequency: 416 MHz
  • 5600/5760 DSP Blocks available

(97%) - 22400 INT8 multipliers

  • 452K ALMs (less than half of the

available logic)

12 Intel Corporation INTEL PUBLIC September 9, 2019

slide-36
SLIDE 36

Conclusions

1

extract Int8 multipliers from Int18 using minimal logic

13 Intel Corporation INTEL PUBLIC September 9, 2019

slide-37
SLIDE 37

Conclusions

1

extract Int8 multipliers from Int18 using minimal logic

2

techniques presented for both signed and unsigned multipliers

13 Intel Corporation INTEL PUBLIC September 9, 2019

slide-38
SLIDE 38

Conclusions

1

extract Int8 multipliers from Int18 using minimal logic

2

techniques presented for both signed and unsigned multipliers

3

technique extensible to other multiplier sizes

13 Intel Corporation INTEL PUBLIC September 9, 2019

slide-39
SLIDE 39

Conclusions

1

extract Int8 multipliers from Int18 using minimal logic

2

techniques presented for both signed and unsigned multipliers

3

technique extensible to other multiplier sizes

4

further area savings in dot-product units

13 Intel Corporation INTEL PUBLIC September 9, 2019

slide-40
SLIDE 40

Conclusions

1

extract Int8 multipliers from Int18 using minimal logic

2

techniques presented for both signed and unsigned multipliers

3

technique extensible to other multiplier sizes

4

further area savings in dot-product units

5

high system-level performance → 700 DOT32 in S10

13 Intel Corporation INTEL PUBLIC September 9, 2019