SPFPTANGENTARCHITECTUREFORFPGAS Bogdan Pasca, Martin Langhammer - - PowerPoint PPT Presentation

spfptangentarchitectureforfpgas
SMART_READER_LITE
LIVE PREVIEW

SPFPTANGENTARCHITECTUREFORFPGAS Bogdan Pasca, Martin Langhammer - - PowerPoint PPT Presentation

SPFPTANGENTARCHITECTUREFORFPGAS Bogdan Pasca, Martin Langhammer Intel PSG Arithmetic in DSP (Special Session), Tuesday, July 25th, 2017 This talk is about single-precision tangent implementation input range restricted to [ /2 ,


slide-1
SLIDE 1

SPFPTANGENTARCHITECTUREFORFPGAS

Bogdan Pasca, Martin Langhammer Intel PSG

Arithmetic in DSP (Special Session), Tuesday, July 25th, 2017

slide-2
SLIDE 2

This talk is about

  • single-precision tangent implementation
  • input range restricted to [−π/2,π/2]
  • accuracy can be tunned to meet OpenCL conformance
  • target contemporary FPGAs having hard FP DSPs
  • available in DSP Builder Advanced backend
  • this work builds on

Faithful Single-Precision Floating-Point Tangent for FPGAs Field Programmable Gate Arrays 2013 (FPGA’13)

2 Programmable Solutions Group Intel Public

slide-3
SLIDE 3

Context

  • many trigonometric function implementation use CORDIC [8, 3]

efficient for iterative implementations unrolled implementations stressful on FPGA fabric multiple, deep, arithmetic structures with wide adders

  • piecewise polynomial-approximation can also be used [6]

better mapping to FPGA architecture can use operator assembly, but that can be wasteful [1, 5]

  • implementation as a fused-operator

careful error analysis can obtain faithfully rounded SP tangent

3 Programmable Solutions Group Intel Public

slide-4
SLIDE 4

Context

  • many trigonometric function implementation use CORDIC [8, 3]

efficient for iterative implementations unrolled implementations stressful on FPGA fabric multiple, deep, arithmetic structures with wide adders

  • piecewise polynomial-approximation can also be used [6]

better mapping to FPGA architecture can use operator assembly, but that can be wasteful [1, 5]

  • implementation as a fused-operator

careful error analysis can obtain faithfully rounded SP tangent

3 Programmable Solutions Group Intel Public

slide-5
SLIDE 5

Context

  • many trigonometric function implementation use CORDIC [8, 3]

efficient for iterative implementations unrolled implementations stressful on FPGA fabric multiple, deep, arithmetic structures with wide adders

  • piecewise polynomial-approximation can also be used [6]

better mapping to FPGA architecture can use operator assembly, but that can be wasteful [1, 5]

  • implementation as a fused-operator

careful error analysis can obtain faithfully rounded SP tangent

3 Programmable Solutions Group Intel Public

slide-6
SLIDE 6

Technique

x y

π 2

−π

2

  • compute tan(x) with x ∈ [−π/2,+π/2]
  • range-reduction step not presented here [2, 7]
  • tangent is symmetrical: tan

tan range .

  • is very small (

in SP), good approximation is tan tan

  • 24+12=36 bit fixed-point for compute range input

4 Programmable Solutions Group Intel Public

slide-7
SLIDE 7

Technique

  • x

y

π 2

−π

2

  • compute tan(x) with x ∈ [−π/2,+π/2]
  • range-reduction step not presented here [2, 7]
  • tangent is symmetrical: tan(−x) = −tan(x) → range = [0,π/2).
  • is very small (

in SP), good approximation is tan tan

  • 24+12=36 bit fixed-point for compute range input

4 Programmable Solutions Group Intel Public

slide-8
SLIDE 8

Technique

  • x

y

π 2

−π

2

2−12 tan(x) ≈ x

  • compute tan(x) with x ∈ [−π/2,+π/2]
  • range-reduction step not presented here [2, 7]
  • tangent is symmetrical: tan(−x) = −tan(x) → range = [0,π/2).
  • x is very small (< 2−12 in SP), good approximation is tan(x) = x

tan(x) = x+ 1 3x3 + 2 15x5 +...

  • 24+12=36 bit fixed-point for compute range input

4 Programmable Solutions Group Intel Public

slide-9
SLIDE 9

Technique

  • x

y

π 2

−π

2

2−12 tan(x) ≈ x

  • compute tan(x) with x ∈ [−π/2,+π/2]
  • range-reduction step not presented here [2, 7]
  • tangent is symmetrical: tan(−x) = −tan(x) → range = [0,π/2).
  • x is very small (< 2−12 in SP), good approximation is tan(x) = x

tan(x) = x+ 1 3x3 + 2 15x5 +...

  • 24+12=36 bit fixed-point for compute range input

4 Programmable Solutions Group Intel Public

slide-10
SLIDE 10

Technique

  • the following identity holds:

tan(a+b) = tan(a)+tan(b) 1−tan(a)tan(b)

  • this can be expanded recursively to:

tan tan tan tan tan tan tan tan tan tan tan

  • decompose input as:

weight of the MSB of is

c a b 9 bits 9 bits 18 bits

5 Programmable Solutions Group Intel Public

slide-11
SLIDE 11

Technique

  • the following identity holds:

tan(a+b) = tan(a)+tan(b) 1−tan(a)tan(b)

  • this can be expanded recursively to:

tan(a+b+c) = tan(a)+tan(b) 1−tan(a)tan(b) +tan(c) 1− tan(a)+tan(b) 1−tan(a)tan(b) tan(c)

  • decompose input as: x = c+a+b

weight of the MSB of c is 20

c a b 9 bits 9 bits 18 bits x {

5 Programmable Solutions Group Intel Public

slide-12
SLIDE 12

Approximating the numerator

n = tan(a)+tan(b) 1−tan(a)tan(b) +tan(c)

  • tan(b) ∈ [0,2−17], tan(a) ∈ [0,2−8]
  • tan(a)tan(b) ∈ [0,2−25] → 1−tan(a)tan(b) ≈ 1
  • good approximation is tan

.

  • numerator approximated by:

tan tan

6 Programmable Solutions Group Intel Public

slide-13
SLIDE 13

Approximating the numerator

n = tan(a)+tan(b) 1−tan(a)tan(b) +tan(c)

  • tan(b) ∈ [0,2−17], tan(a) ∈ [0,2−8]
  • tan(a)tan(b) ∈ [0,2−25] → 1−tan(a)tan(b) ≈ 1
  • b < 2−17 good approximation is tan(b) ≈ b.
  • numerator approximated by:

tan tan

6 Programmable Solutions Group Intel Public

slide-14
SLIDE 14

Approximating the numerator

n = tan(a)+tan(b) 1−tan(a)tan(b) +tan(c)

  • tan(b) ∈ [0,2−17], tan(a) ∈ [0,2−8]
  • tan(a)tan(b) ∈ [0,2−25] → 1−tan(a)tan(b) ≈ 1
  • b < 2−17 good approximation is tan(b) ≈ b.
  • numerator approximated by:

n = tan(a)+b+tan(c)

6 Programmable Solutions Group Intel Public

slide-15
SLIDE 15

Approximating the denominator

d = 1− tan(a)+tan(b) 1−tan(a)tan(b) tan(c)

  • use the same technique 1−tan(a)tan(b) ≈ 1, tan(b) ≈ b
  • possible cancellation can amplify an existing error
  • cancellation occurs x close to π/2
  • use additional table for the final 512ulp before π/2
  • denominator approximated by:

d = 1−(tan(a)+b)tan(c)

7 Programmable Solutions Group Intel Public

slide-16
SLIDE 16

Implementation using Hard FP DSP Blocks

  • Alignment:

c and a are obtained from xfxp reduced barrel shifter size: use top 18 bits of the Mx maximum shift size is 13 (tan(x) = x if required shift > 13)

  • B in floating-point:

use mask to extract FP information

mask

0111 00000000000000000

  • 1

0111 00000000000000001

  • 2

0111 00000000000000011

  • 3

0111 00000000000000111

  • 4

0111 00000000000001111

  • 5

0111 00000000000011111

  • 6

0111 00000000000111111

  • 7

0111 00000000001111111

  • 8

0111 00000000011111111

  • 9

0111 00000000111111111

  • 10

0111 00000001111111111

  • 11

0111 00000011111111111

  • 12

0111 00000111111111111

  • 13

0111 00000111111111111

... ... ...

use FP subtraction to get B in FP.

8 Programmable Solutions Group Intel Public

slide-17
SLIDE 17

Implementation using Hard FP DSP Blocks

  • Alignment:

c and a are obtained from xfxp reduced barrel shifter size: use top 18 bits of the Mx maximum shift size is 13 (tan(x) = x if required shift > 13)

  • B in floating-point:

use mask to extract FP information

e ebiased mask

01111111 00000000000000000111111

  • 1

01111110 00000000000000001111111

  • 2

01111101 00000000000000011111111

  • 3

01111100 00000000000000111111111

  • 4

01111011 00000000000001111111111

  • 5

01111010 00000000000011111111111

  • 6

01111001 00000000000111111111111

  • 7

01111000 00000000001111111111111

  • 8

01110111 00000000011111111111111

  • 9

01110110 00000000111111111111111

  • 10

01110101 00000001111111111111111

  • 11

01110100 00000011111111111111111

  • 12

01110011 00000111111111111111111

  • 13

01110010 00000111111111111111111

... ... ...

use FP subtraction to get B in FP.

8 Programmable Solutions Group Intel Public

slide-18
SLIDE 18

Implementation using Hard FP DSP Blocks

  • Alignment:

c and a are obtained from xfxp reduced barrel shifter size: use top 18 bits of the Mx maximum shift size is 13 (tan(x) = x if required shift > 13)

  • B in floating-point:

use mask to extract FP information

e ebiased mask

01111111 00000000000000000 111111

  • 1

01111110 00000000000000001 111111

  • 2

01111101 00000000000000011 111111

  • 3

01111100 00000000000000111 111111

  • 4

01111011 00000000000001111 111111

  • 5

01111010 00000000000011111 111111

  • 6

01111001 00000000000111111 111111

  • 7

01111000 00000000001111111 111111

  • 8

01110111 00000000011111111 111111

  • 9

01110110 00000000111111111 111111

  • 10

01110101 00000001111111111 111111

  • 11

01110100 00000011111111111 111111

  • 12

01110011 00000111111111111 111111

  • 13

01110010 00000111111111111 111111

... ... ...

use FP subtraction to get B in FP.

8 Programmable Solutions Group Intel Public

slide-19
SLIDE 19

Implementation using Hard FP DSP Blocks

  • Alignment:

c and a are obtained from xfxp reduced barrel shifter size: use top 18 bits of the Mx maximum shift size is 13 (tan(x) = x if required shift > 13)

  • B in floating-point:

use mask to extract FP information

e ebiased mask

0111 1111 00000000000000000 111111

  • 1

0111 1110 00000000000000001 111111

  • 2

0111 1101 00000000000000011 111111

  • 3

0111 1100 00000000000000111 111111

  • 4

0111 1011 00000000000001111 111111

  • 5

0111 1010 00000000000011111 111111

  • 6

0111 1001 00000000000111111 111111

  • 7

0111 1000 00000000001111111 111111

  • 8

0111 0111 00000000011111111 111111

  • 9

0111 0110 00000000111111111 111111

  • 10

0111 0101 00000001111111111 111111

  • 11

0111 0100 00000011111111111 111111

  • 12

0111 0011 00000111111111111 111111

  • 13

0111 0010 00000111111111111 111111

... ... ...

use FP subtraction to get B in FP.

8 Programmable Solutions Group Intel Public

slide-20
SLIDE 20

Obtaining b in FP example

  • let x = 0 01111000 10011110001101001111001
  • mask is 00000000001111111
  • apply mask on fraction:

10011110001101001111001 AND 00000000001111111111111 =

  • 00000000001101001111001
  • create FP value

=0 01111000 00000000001101001111001

  • create FP value

=0 01111000 00000000000000000000000

  • obtain

1.10100111100100000000000

  • = 0 01101101 10100111100100000000000
  • this is the same as:

c (9 bits) a (9 bits) b (18 bits) 000000011 001111000 110100111100100000

9 Programmable Solutions Group Intel Public

slide-21
SLIDE 21

Obtaining b in FP example

  • let x = 0 01111000 10011110001101001111001
  • mask is 00000000001111111
  • apply mask on fraction:

10011110001101001111001 AND 00000000001111111111111 =

  • 00000000001101001111001
  • create FP value

=0 01111000 00000000001101001111001

  • create FP value

=0 01111000 00000000000000000000000

  • obtain

1.10100111100100000000000

  • = 0 01101101 10100111100100000000000
  • this is the same as:

c (9 bits) a (9 bits) b (18 bits) 000000011 001111000 110100111100100000

9 Programmable Solutions Group Intel Public

slide-22
SLIDE 22

Obtaining b in FP example

  • let x = 0 01111000 10011110001101001111001
  • mask is 00000000001111111 & 111111 right padding
  • apply mask on fraction:

10011110001101001111001 AND 00000000001111111111111 =

  • 00000000001101001111001
  • create FP value

=0 01111000 00000000001101001111001

  • create FP value

=0 01111000 00000000000000000000000

  • obtain

1.10100111100100000000000

  • = 0 01101101 10100111100100000000000
  • this is the same as:

c (9 bits) a (9 bits) b (18 bits) 000000011 001111000 110100111100100000

9 Programmable Solutions Group Intel Public

slide-23
SLIDE 23

Obtaining b in FP example

  • let x = 0 01111000 10011110001101001111001
  • mask is 00000000001111111 & 111111 right padding
  • apply mask on fraction:

10011110001101001111001 AND 00000000001111111111111 =

  • 00000000001101001111001
  • create FP value

=0 01111000 00000000001101001111001

  • create FP value

=0 01111000 00000000000000000000000

  • obtain

1.10100111100100000000000

  • = 0 01101101 10100111100100000000000
  • this is the same as:

c (9 bits) a (9 bits) b (18 bits) 000000011 001111000 110100111100100000

9 Programmable Solutions Group Intel Public

slide-24
SLIDE 24

Obtaining b in FP example

  • let x = 0 01111000 10011110001101001111001
  • mask is 00000000001111111 & 111111 right padding
  • apply mask on fraction:

10011110001101001111001 AND 00000000001111111111111 =

  • 00000000001101001111001
  • create FP value vl=0 01111000 00000000001101001111001
  • create FP value vr=0 01111000 00000000000000000000000
  • obtain b = vl −vr = 2−181.101001111001000000000002
  • b = 0 01101101 10100111100100000000000
  • this is the same as:

c (9 bits) a (9 bits) b (18 bits) 000000011 001111000 110100111100100000

9 Programmable Solutions Group Intel Public

slide-25
SLIDE 25

Obtaining b in FP example

  • let x = 0 01111000 10011110001101001111001
  • mask is 00000000001111111 & 111111 right padding
  • apply mask on fraction:

10011110001101001111001 AND 00000000001111111111111 =

  • 00000000001101001111001
  • create FP value vl=0 01111000 00000000001101001111001
  • create FP value vr=0 01111000 00000000000000000000000
  • obtain b = vl −vr = 2−181.101001111001000000000002
  • b = 0 01101101 10100111100100000000000
  • this is the same as:

c (9 bits) a (9 bits) b (18 bits) 000000011 001111000 110100111100100000

9 Programmable Solutions Group Intel Public

slide-26
SLIDE 26

Implementation using Hard FP DSP Blocks

  • Common-subexpression

tan(a)+b both in numerator and denominator

  • btain tan(a) by tabulation directly in FP

perform sum in SP FP

  • Numerator
  • btain tan

by tabulation directly in FP use a FP adder to add tan to the value of the common-subexpression

  • Denominator

multiply tan by value of common-subexpression use a FP subtracter to perform tan tan

  • Reciprocal

can use a piecewise-polynomial-based inverse

  • Assembly

use a floating-point multiplier

10 Programmable Solutions Group Intel Public

slide-27
SLIDE 27

Implementation using Hard FP DSP Blocks

  • Common-subexpression

tan(a)+b both in numerator and denominator

  • btain tan(a) by tabulation directly in FP

perform sum in SP FP

  • Numerator
  • btain tan(c) by tabulation directly in FP

use a FP adder to add tan(c) to the value of the common-subexpression

  • Denominator

multiply tan by value of common-subexpression use a FP subtracter to perform tan tan

  • Reciprocal

can use a piecewise-polynomial-based inverse

  • Assembly

use a floating-point multiplier

10 Programmable Solutions Group Intel Public

slide-28
SLIDE 28

Implementation using Hard FP DSP Blocks

  • Common-subexpression

tan(a)+b both in numerator and denominator

  • btain tan(a) by tabulation directly in FP

perform sum in SP FP

  • Numerator
  • btain tan(c) by tabulation directly in FP

use a FP adder to add tan(c) to the value of the common-subexpression

  • Denominator

multiply tan(c) by value of common-subexpression use a FP subtracter to perform 1−(tan(a)+b)tan(c)

  • Reciprocal

can use a piecewise-polynomial-based inverse

  • Assembly

use a floating-point multiplier

10 Programmable Solutions Group Intel Public

slide-29
SLIDE 29

Implementation using Hard FP DSP Blocks

  • Common-subexpression

tan(a)+b both in numerator and denominator

  • btain tan(a) by tabulation directly in FP

perform sum in SP FP

  • Numerator
  • btain tan(c) by tabulation directly in FP

use a FP adder to add tan(c) to the value of the common-subexpression

  • Denominator

multiply tan(c) by value of common-subexpression use a FP subtracter to perform 1−(tan(a)+b)tan(c)

  • Reciprocal

can use a piecewise-polynomial-based inverse

  • Assembly

use a floating-point multiplier

10 Programmable Solutions Group Intel Public

slide-30
SLIDE 30

Implementation using Hard FP DSP Blocks

  • Common-subexpression

tan(a)+b both in numerator and denominator

  • btain tan(a) by tabulation directly in FP

perform sum in SP FP

  • Numerator
  • btain tan(c) by tabulation directly in FP

use a FP adder to add tan(c) to the value of the common-subexpression

  • Denominator

multiply tan(c) by value of common-subexpression use a FP subtracter to perform 1−(tan(a)+b)tan(c)

  • Reciprocal

can use a piecewise-polynomial-based inverse

  • Assembly

use a floating-point multiplier

10 Programmable Solutions Group Intel Public

slide-31
SLIDE 31

Enhanced reciprocal exponent handling

  • compute reciprocal of denominator:

mantissa Md = 1+y, y ∈ [0,1) the exponent of the denominator ebiased

d

= ed +bias.

  • the reciprocal:

mantissa 1/Md ∈ (0.5,1] the pre-normalization biased reciprocal exponent: ebiased

r

= −ed +bias = 2bias−ebiased

d

  • normalization

exponent adjustment

let be true if need to normalize, or . . use bit manipulations to directly get

11 Programmable Solutions Group Intel Public

slide-32
SLIDE 32

Enhanced reciprocal exponent handling

  • compute reciprocal of denominator:

mantissa Md = 1+y, y ∈ [0,1) the exponent of the denominator ebiased

d

= ed +bias.

  • the reciprocal:

mantissa 1/Md ∈ (0.5,1] the pre-normalization biased reciprocal exponent: ebiased

r

= −ed +bias = 2bias−ebiased

d

  • normalization → exponent adjustment

let n be true if need to normalize, or 1/Md ∈ (0.5,1). ebiased

rPostNorm = 2bias−ebiased d

−n. use bit manipulations to directly get 2bias−n 2bias−n = 11111110−n = 111111 nn

11 Programmable Solutions Group Intel Public

slide-33
SLIDE 33

Architecture

X 12 Programmable Solutions Group Intel Public

slide-34
SLIDE 34

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX 12 Programmable Solutions Group Intel Public

slide-35
SLIDE 35

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX 12 Programmable Solutions Group Intel Public

slide-36
SLIDE 36

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT 12 Programmable Solutions Group Intel Public

slide-37
SLIDE 37

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b 12 Programmable Solutions Group Intel Public

slide-38
SLIDE 38

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) 12 Programmable Solutions Group Intel Public

slide-39
SLIDE 39

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) 12 Programmable Solutions Group Intel Public

slide-40
SLIDE 40

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 12 Programmable Solutions Group Intel Public

slide-41
SLIDE 41

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 frac 1/x

PPA

n f 24-bit 12 Programmable Solutions Group Intel Public

slide-42
SLIDE 42

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 frac 1/x

PPA

n f 24-bit normalize

111111 nn exp 12 Programmable Solutions Group Intel Public

slide-43
SLIDE 43

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 frac 1/x

PPA

n f 24-bit normalize

111111 nn exp R 12 Programmable Solutions Group Intel Public

slide-44
SLIDE 44

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 frac 1/x

PPA

n f 24-bit normalize

111111 nn exp R X

expX

115 12 Programmable Solutions Group Intel Public

slide-45
SLIDE 45

Architecture

X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 frac 1/x

PPA

n f 24-bit normalize

111111 nn exp R X

expX

115 π/2−512ulp to π/2 LUT tan(x) π/2 2 512ulp

fracX[8:0] 12 Programmable Solutions Group Intel Public

slide-46
SLIDE 46

Side-by-side comparison against non-HFP architecture

LZC

normalize product denominator tan(1.57−0.002) LUT X LUT tan(x) π/2−256ulp to π/2 bias expX ’1’&fracX 115 fixed-point X LUT tan(0.002−7·10−6) [26:18] a

LZC

R Rounding Exception Handling

LZC

exponent numerator c b tan(a)+b fraction exponent FP tan(c) 15 fraction 256ulp

2

[35:27] [17:0] π/2 numerator 1 normalize 1 1/x normalize X c bias fixed-point X a [8:0] ’1’&fracX[22:6] [17:9] expX AND 2expX

FP 2expX(1+b)

table Mask fracX

expX[3:0] expX

tan(c) LUT tan(a) LUT b FP tan(a) FPtan(a)+b FP tan(c) FP tan(c) FP 1 frac 1/x

PPA

n f 24-bit normalize

111111 nn exp R X

expX

115 π/2−512ulp to π/2 LUT tan(x) π/2 2 512ulp

fracX[8:0]

13 Programmable Solutions Group Intel Public

slide-47
SLIDE 47

Results

Table: Synthesis results for recent FPGA architectures

Target Architecture Latency @ Freq. Resources Arria 10 Proposed 36 @ 520MHz 8 DSPs, 6 M20K, 990ALMs Stratix 10 Proposed w/o HIPI 53 @ 532MHz 8 DSPs, 7 M20K, 1573ALMs Stratix 10 Proposed HIPI * 53 @ 655MHz 8 DSPs, 7 M20K, 1632ALMs Arria 10 Proposed FP 32 @ 487MHz 8 DSPs, 6 M20K, 393ALMs Stratix 10 Proposed FP w/o HIPI 39 @ 573MHz 8 DSPs, 6 M20K, 566ALMs Stratix 10 Proposed FP HIPI * 39 @ 644MHz 8 DSPs, 6 M20K, 572ALMs

  • results for both proposed architectures
  • throughput of 1 operation / clock cycle
  • largest Arria 10 472200 ALMs

Utilization of 0.23% for Proposed, 0.09% Proposed FP

  • same Arria 10 1518 DSPs

Utilization of 0.5% for Proposed and Proposed FP

  • same Arria 10 2713 M20Ks

Utilization of 0.2% for Proposed and Proposed FP

  • HyperFlex architecture enhances Stratix 10 performance
  • Stratix 10 ALMs

2, DSPs 3.8, M20Ks 4.3

14 Programmable Solutions Group Intel Public

slide-48
SLIDE 48

Results

Table: Synthesis results for recent FPGA architectures

Target Architecture Latency @ Freq. Resources Arria 10 Proposed 36 @ 520MHz 8 DSPs, 6 M20K, 990ALMs Stratix 10 Proposed w/o HIPI 53 @ 532MHz 8 DSPs, 7 M20K, 1573ALMs Stratix 10 Proposed HIPI * 53 @ 655MHz 8 DSPs, 7 M20K, 1632ALMs Arria 10 Proposed FP 32 @ 487MHz 8 DSPs, 6 M20K, 393ALMs Stratix 10 Proposed FP w/o HIPI 39 @ 573MHz 8 DSPs, 6 M20K, 566ALMs Stratix 10 Proposed FP HIPI * 39 @ 644MHz 8 DSPs, 6 M20K, 572ALMs

  • results for both proposed architectures
  • throughput of 1 operation / clock cycle
  • largest Arria 10 472200 ALMs

Utilization of 0.23% for Proposed, 0.09% Proposed FP

  • same Arria 10 1518 DSPs

Utilization of 0.5% for Proposed and Proposed FP

  • same Arria 10 2713 M20Ks

Utilization of 0.2% for Proposed and Proposed FP

  • HyperFlex architecture enhances Stratix 10 performance
  • Stratix 10 ALMs × 2, DSPs × 3.8, M20Ks × 4.3

14 Programmable Solutions Group Intel Public

slide-49
SLIDE 49

Conclusion

  • proposed fused-operator for single-precision tangent
  • error analysis allows meeting OpenCL accuracy requirements
  • proposed cores are fully pipelined, throughput 1 operation/cycle
  • proposed cores require very little area
  • DSP and Memory centric architectures allow reaching high frequencies
  • Stratix 10 and Arria 10 Hard FP DSP block allows for an

efficient implementation

  • lower latency in high-frequency scenarios

15 Programmable Solutions Group Intel Public

slide-50
SLIDE 50

References

[1] de Dinechin, F., and Pasca, B. Designing custom arithmetic data paths with FloPoCo. IEEE Design and Test (2011). [2] Detrey, J., and de Dinechin, F. Floating-point trigonometric functions for FPGAs. In International Conference on Field Programmable Logic and Applications (Amsterdam, Netherlands, aug 2007), IEEE, pp. 29–34. [3] Garcia, E., Cumplido, R., and Arias, M. Pipelined CORDIC design on FPGA for a digital sine and cosine waves generator. In Electrical and Electronics Engineering, 2006 3rd International Conference on (sept. 2006), pp. 1 –4. [4] Langhammer, M., and Pasca, B. Efficient floating-point polynomial evaluation on FPGAs. In 22th International Conference on Field Programmable Logic and Applications (FPL’13) (Porto, Portugal, Aug. 2013), IEEE. [5] Langhammer, M., and VanCourt, T. FPGA floating point datapath compiler. Field-Programmable Custom Computing Machines, Annual IEEE Symposium on 17 (2009), 259–262. [6] Pasca, B. Correctly rounded floating-point division for DSP-enabled FPGAs. In 22th International Conference on Field Programmable Logic and Applications (FPL’12) (Oslo, Norway, Aug. 2012), IEEE. [7] Payne, M. H., and Hanek, R. N. Radian reduction for trigonometric functions. ACM SIGNUM Newsletter 18, 1 (Jan. 1983), 19–24. [8] Shang, Y. Implementation of ip core of fast sine and cosine operation through FPGA. Energy Procedia 16, Part B, 0 (2012), 1253 – 1258. 2012 ICFEEM.

16 Programmable Solutions Group Intel Public

slide-51
SLIDE 51

Figure: ”Registers Everywhere” HyperFlex Architecture

17 Programmable Solutions Group Intel Public

slide-52
SLIDE 52

Figure: Pipelining in Conventional FPGA Architectures Figure: Hyper-Pipelining in the HyperFlex Architecture

back

18 Programmable Solutions Group Intel Public

slide-53
SLIDE 53

Figure: Pipelining in Conventional FPGA Architectures Figure: Hyper-Pipelining in the HyperFlex Architecture

back

18 Programmable Solutions Group Intel Public