Black-box approximation of bivariate functions for FPGAs David B. - - PowerPoint PPT Presentation

black box approximation of bivariate functions for fpgas
SMART_READER_LITE
LIVE PREVIEW

Black-box approximation of bivariate functions for FPGAs David B. - - PowerPoint PPT Presentation

Black-box approximation of bivariate functions for FPGAs David B. Thomas, Imperial College London Chosen approximation problem Input xy = x x y ( 2 fx Z ) x ( 2 fy Z ) Fixed-point domain: Fixed-point range: r = 2


slide-1
SLIDE 1

Black-box approximation

  • f bivariate functions

David B. Thomas, Imperial College London

…for FPGAs

slide-2
SLIDE 2

Chosen approximation problem

  • Input
  • Fixed-point domain:

Ωxy = Ωx x Ωy  ( 2fxZ ) x ( 2fyZ )

  • Fixed-point range:

Ωr = 2frZ

  • A function:

f : Ωxy → Ωr

  • An error budget:

ε ( ε >=2fr )

  • Output
  • An approximation: a : Ωxy → Ωr
  • A certificate:

ꓯ (x,y)  Ωxy: | f(x,y) – a(x,y) | < ε

  • Synthesisable C

Does not need to be differentiable, analytic, or smooth

slide-3
SLIDE 3

Classic example : atan2

  • Input domain :

Ωxy = [-1,1) x [-1,1) fx=-8, fy=-8

  • Output domain : fr = -16
slide-4
SLIDE 4

Classic example : atan2

  • Output domain : fr = -1
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

1d function approximation

  • Function-specific range-reduction
  • Split the remaining problem up into easier parts
  • Come up with small primitives that locally fit
  • Can use Remez algorithm to find best polynomial

Range Reduction Domain Splitting Primitive Evaluation x x* x* i Range Reconst. r* r

slide-10
SLIDE 10

2d function approximation

Range Reduction Domain Splitting Primitive Evaluation x y x* y* x* y* i Range Reconst. r* r Uniform grid Binary split Quadtree Hash-table Polynomial Spline RBF Chebfun2 Kernels Relative -> Absolute Symmetries Odd-even function

slide-11
SLIDE 11

Why this is hard?

  • Remez approach does not work
  • Degree m polynomials get expensive in 2D
  • Memory:

O(m) -> O(m2) coefficients

  • Area :

O(m) -> O(m2) multipliers

  • Delay:

O(m) -> O(2m) stages

  • Uniform domain splitting requires too much RAM
  • Non-uniform domain splitting get expensive
  • Large number of splitting coefficients
  • Potential for very deep trees
slide-12
SLIDE 12

Why make it harder?

  • Why not require partial derivatives?
  • Doesn’t actually buy us much in practise
  • Makes it more difficult for end-users
  • Why not smooth/analytic/continuous functions?
  • Going to have to prove correctness empirically anyway
  • Why not exploit function-specific range reductions?
  • Many interesting 2D functions do not have any
  • Often we just get symmetry, so “only” a factor of 2 saving
slide-13
SLIDE 13

Making it easier

  • Target domains of up to about 224 x 224 bits
  • 1 CPU : Verify by exhaustion up to ~ 220 x 220 in a day
  • 9 EC2 c4.large: verify up to 224 x 224 in a day and $108
  • Though arguably a bit pointless
  • Target range of up to ~224 bits
  • e.g. [0,1) with lsb=-24, or [-4,+4) with lsb=-16
  • Expect user to try to manage range a bit
  • Though the method will work with larger ranges
  • Treat double as “real”
  • Target function given as: double f(double x, double y);
  • Reasonable given the above assumptions
  • We are unlikely to have mpfr versions of functions
slide-14
SLIDE 14

Restricting the solution space

  • Influenced by experience with FloatApprox
  • No function-specific range reduction
  • Only use binary trees to split the domain
  • Hash-tables are very promising: not covered here
  • Only use polynomial patches as primitives
  • Chebfun2-like patches quite promising

Domain Splitting Primitive Evaluation x y x* y* i r Binary split Polynomial

slide-15
SLIDE 15

High-level Algorithm

int a(int x, int y) { // Constant table of polynomials static const poly_t polynomials[...] = { ... }; // Select index of polynomial segment int i = select_segment(x,y); // Get all the coefficients poly_t poly=polynomials[i]; // Evaluate that patch at input co-ordinates return eval(poly,x,y); }

slide-16
SLIDE 16

Segmentation by binary split

struct node_t { bool dir; int split; int left, right; }; int select_segment(int x, int y) { // Constant table of splits static const node_t n0[1] = { ... }; static const node_t n1[2] = { ... }; static const node_t n2[4] = { ... }; static const node_t n3[7] = { ... }; // Select index of polynomial segment using pipeline int i=0; i=(n0[i].dir?x:y) < n0[i].split ? n0[i].left : n0[i].right; i=(n1[i].dir?x:y) < n1[i].split ? n1[i].left : n1[i].right; i=(n2[i].dir?x:y) < n2[i].split ? n2[i].left : n2[i].right; i=(n3[i].dir?x:y) < n3[i].split ? n3[i].left : n3[i].right; return i; }

slide-17
SLIDE 17

Residual of atan2

Degree=2, fracBits=-10, maxError=0.001

slide-18
SLIDE 18

Residual of atan2

degree 3, polys=154 degree 4, polys=95

slide-19
SLIDE 19

Residual of atan2

maxError=0.001, polys=154, maxError=0.0001, polys=676

slide-20
SLIDE 20

Poly count versus input precision

Quadratic, maxError=0.001

64 256 1024 4096 16384 65536 10 12 14 16 18

Number of polynomials Domain Fractional Bits

func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

slide-21
SLIDE 21

Poly count versus input precision

Cubic, maxError=0.001

64 256 1024 4096 16384 65536 10 12 14 16 18

Number of polynomials Domain Fractional Bits

func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

slide-22
SLIDE 22

Controlling the split

  • Which dimension to split: x or y?
  • Alternating dimensions: uniformly sub-divide to squares
  • Repeat one dimension: allows long thin patches
  • Where should we split along the dimension?
  • Split in the middle: narrow binary constants in tree
  • Split adaptatively: manage the depth of the tree
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

Curvature based splitting

  • We need to estimate curvature in some way
  • Currently use number of Chebyshev coefficients
  • Fit Chebyshev curves to f(x,ymid) and f(xmid,y)
  • Use highest degree non-zero coeff as “curvature”
  • Split along the “hardest” direction
  • Also played around with other methods
  • E.g. varying split point to balance curvature
  • Can sometimes reduce tree depth
slide-31
SLIDE 31
slide-32
SLIDE 32

Naïve versus curvature split

10 100 1000 10000

Number of polynomials Function

curvature_split naive

slide-33
SLIDE 33

Evaluating a 2D polynomial

  • Standard 1D polynomial
  • p(x) = a0 + a1 x + a2 x2 + a3 x3 … am xm
  • Extending to a 2D polynomial
  • p(x,y) = a0,0

+ a1,0 x + a2,0 x2 + a3,0 x3 + … + a0,1 y + a1,1 x y + a2,1 x2 y + a3,1 x3 y + … + a0,2 y2 + a1,2 x y2 + a2,2 x2 y2 + a3,2 x3 y2 + … + a0,3 y3 + a1,3 x y3 + a2,3 x2 y3 + a3,3 x3 y3 + … +

  • Need m2 coefficients
  • Naïve form
  • Requires m2 multipliers
  • Horrible numeric properties: requires many guard bits
slide-34
SLIDE 34

2D Horner form

+ * + c20 c22 * c23 + * + c10 c12 * c13 + * + c30 c32 * c33 + * + * x y + * + c00 c02 * c03 * + c21 * + c11 * + c31 * + c01 + *

slide-35
SLIDE 35

Advantages of Horner form

  • Only requires (m-1)2 multipliers
  • Latency is 2(m+1) multipliers
  • Evaluation guard bits are roughly O(log m) bits
  • Requires -1 <= x <= +1 and -1 <= y <= +1
  • Pre-scale all polynomials
  • Start with native polynomial range [xL,xH) x [yL,yH)
  • Use offsets xO = -(xL+xH)/2 and yL = -(yL+hH)/2
  • Find integers xS and yS such that:
  • 1 <= 2xS (xL+xO) <0.5 and 0.5 < 2xS (xH+xO) <= +1
  • Also drastically reduces range of intermediates
  • Means multiplier input widths are much smaller
slide-36
SLIDE 36

Fitting polynomials

  • We don’t have the Remez property
  • Use least squares to fit the coefficients

p(x,y) = a0,0 + a1,0 x + a2,0 x2 + a3,0 x3 + … + a0,1 y + a1,1 x y + a2,1 x2 y + a3,1 x3 y + … + a0,2 y2 + a1,2 x y2 + a2,2 x2 y2 + a3,2 x3 y2 + … + a0,3 y3 + a1,3 x y3 + a2,3 x2 y3 + a3,3 x3 y3 + …

  • Find a set of n points p spread over interval
  • Regress to find best coefficients in least squares

sense

slide-37
SLIDE 37

Optimising polynomials

  • Least squares fit is not a minimax approximation…
  • Attempt 1: successive over-constrained fit
  • Use residual in previous fit to nudge next fit
  • Attempt 2 : downhill simplex to optimise coefficients
  • Tiny improvements in maximum error
  • Attempt 3 : use Chebyshev nodes during fit
  • No noticeable improvement over uniform nodes
  • Attempt 4 : express the problem as discrete LP
  • Far too slow to be useful
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

Overall: total BRAMs, error=10-3

5 10 15 20 25 30 10 12 14 16 18

Number of Block RAMs

Domain fractional bits

func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

slide-41
SLIDE 41

Overall: total BRAMs, error=10-5

100 200 300 400 500 10 12 14 16 18

Number of Block RAMs

Domain fractional bits

func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

slide-42
SLIDE 42

Conclusion

  • Black-box 2D function approximation is feasible
  • At least up to around 224 in each dimension
  • Major constraint is block RAM usage
  • Most functions only take up a fraction of RAM though
  • Route to hardware is via HLS
  • Generate C code describing tables
  • Ongoing work
  • Optimising the splitting process
  • Improving monomial basis selection
  • Lower-level hardware optimisations
slide-43
SLIDE 43

Depopulated polynomials

+ c20 + * c10 c12 c30 + * + * x y + * + c00 c02 * c03 * c21 * + c11 * + c01 + *