black box approximation of bivariate functions for fpgas
play

Black-box approximation of bivariate functions for FPGAs David B. - PowerPoint PPT Presentation

Black-box approximation of bivariate functions for FPGAs David B. Thomas, Imperial College London Chosen approximation problem Input xy = x x y ( 2 fx Z ) x ( 2 fy Z ) Fixed-point domain: Fixed-point range: r = 2


  1. Black-box approximation of bivariate functions …for FPGAs David B. Thomas, Imperial College London

  2. Chosen approximation problem • Input Ω xy = Ω x x Ω y  ( 2 fx Z ) x ( 2 fy Z ) • Fixed-point domain: • Fixed-point range: Ω r = 2 fr Z • A function: f : Ω xy → Ω r • An error budget: ( ε >=2 fr ) ε Does not need to be differentiable, analytic, or smooth • Output • An approximation: a : Ω xy → Ω r ꓯ (x,y)  Ω xy : | f(x,y) – a(x,y) | < ε • A certificate: • Synthesisable C

  3. Classic example : atan2 • Input domain : Ω xy = [-1,1) x [-1,1) fx=-8, fy=-8 • Output domain : fr = -16

  4. Classic example : atan2 • Output domain : fr = -1

  5. 1d function approximation • Function-specific range-reduction • Split the remaining problem up into easier parts • Come up with small primitives that locally fit • Can use Remez algorithm to find best polynomial x* i x r Range Domain Primitive Range r* x* Reduction Splitting Evaluation Reconst.

  6. 2d function approximation Polynomial Uniform grid Spline Symmetries Binary split Relative -> Absolute RBF Odd-even function Quadtree Chebfun2 Hash-table Kernels x* i x r Range Domain Primitive Range y* r* x* y Reduction Splitting Evaluation Reconst. y*

  7. Why this is hard? • Remez approach does not work • Degree m polynomials get expensive in 2D • Memory: O(m) -> O(m 2 ) coefficients • Area : O(m) -> O(m 2 ) multipliers • Delay: O(m) -> O(2m) stages • Uniform domain splitting requires too much RAM • Non-uniform domain splitting get expensive • Large number of splitting coefficients • Potential for very deep trees

  8. Why make it harder? • Why not require partial derivatives? • Doesn’t actually buy us much in practise • Makes it more difficult for end-users • Why not smooth/analytic/continuous functions? • Going to have to prove correctness empirically anyway • Why not exploit function-specific range reductions? • Many interesting 2D functions do not have any • Often we just get symmetry, so “only” a factor of 2 saving

  9. Making it easier • Target domains of up to about 2 24 x 2 24 bits • 1 CPU : Verify by exhaustion up to ~ 2 20 x 2 20 in a day • 9 EC2 c4.large: verify up to 2 24 x 2 24 in a day and $108 • Though arguably a bit pointless • Target range of up to ~2 24 bits • e.g. [0,1) with lsb=-24, or [-4,+4) with lsb=-16 • Expect user to try to manage range a bit • Though the method will work with larger ranges • Treat double as “real” • Target function given as: double f(double x, double y); • Reasonable given the above assumptions • We are unlikely to have mpfr versions of functions

  10. Restricting the solution space • Influenced by experience with FloatApprox • No function-specific range reduction • Only use binary trees to split the domain • Hash-tables are very promising: not covered here • Only use polynomial patches as primitives • Chebfun2-like patches quite promising Binary split Polynomial i x Domain Primitive r x* y Splitting Evaluation y*

  11. High-level Algorithm int a (int x , int y ) { // Constant table of polynomials static const poly_t polynomials [...] = { ... }; // Select index of polynomial segment int i = select_segment ( x , y ); // Get all the coefficients poly_t poly = polynomials [ i ]; // Evaluate that patch at input co-ordinates return eval ( poly , x , y ); }

  12. Segmentation by binary split struct node_t { bool dir ; int split ; int left , right ; }; int select_segment (int x , int y ) { // Constant table of splits static const node_t n0 [ 1 ] = { ... }; static const node_t n1 [ 2 ] = { ... }; static const node_t n2 [ 4 ] = { ... }; static const node_t n3 [ 7 ] = { ... }; // Select index of polynomial segment using pipeline int i = 0 ; i =( n0 [ i ]. dir ? x : y ) < n0 [ i ]. split ? n0 [ i ]. left : n0 [ i ]. right ; i =( n1 [ i ]. dir ? x : y ) < n1 [ i ]. split ? n1 [ i ]. left : n1 [ i ]. right ; i =( n2 [ i ]. dir ? x : y ) < n2 [ i ]. split ? n2 [ i ]. left : n2 [ i ]. right ; i =( n3 [ i ]. dir ? x : y ) < n3 [ i ]. split ? n3 [ i ]. left : n3 [ i ]. right ; return i ; }

  13. Residual of atan2 Degree=2, fracBits=-10, maxError=0.001

  14. Residual of atan2 degree 3, polys=154 degree 4, polys=95

  15. Residual of atan2 maxError=0.001, polys=154, maxError=0.0001, polys=676

  16. Poly count versus input precision Quadratic, maxError=0.001 65536 Number of polynomials 16384 4096 1024 256 64 10 12 14 16 18 Domain Fractional Bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

  17. Poly count versus input precision Cubic, maxError=0.001 65536 Number of polynomials 16384 4096 1024 256 64 10 12 14 16 18 Domain Fractional Bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

  18. Controlling the split • Which dimension to split: x or y? • Alternating dimensions: uniformly sub-divide to squares • Repeat one dimension: allows long thin patches • Where should we split along the dimension? • Split in the middle: narrow binary constants in tree • Split adaptatively: manage the depth of the tree

  19. Curvature based splitting • We need to estimate curvature in some way • Currently use number of Chebyshev coefficients • Fit Chebyshev curves to f( x ,y mid ) and f(x mid , y ) • Use highest degree non-zero coeff as “curvature” • Split along the “hardest” direction • Also played around with other methods • E.g. varying split point to balance curvature • Can sometimes reduce tree depth

  20. Naïve versus curvature split 10000 Number of polynomials 1000 100 10 Function curvature_split naive

  21. Evaluating a 2D polynomial • Standard 1D polynomial • p(x) = a 0 + a 1 x + a 2 x 2 + a 3 x 3 … a m x m • Extending to a 2D polynomial + a 3,0 x 3 + … • p(x,y) = a 0,0 + a 1,0 x + a 2,0 x 2 + a 0,1 y + a 1,1 x y + a 2,1 x 2 y + a 3,1 x 3 y + … + a 0,2 y 2 + a 1,2 x y 2 + a 2,2 x 2 y 2 + a 3,2 x 3 y 2 + … + a 0,3 y 3 + a 1,3 x y 3 + a 2,3 x 2 y 3 + a 3,3 x 3 y 3 + … + • Need m 2 coefficients • Naïve form • Requires m 2 multipliers • Horrible numeric properties: requires many guard bits

  22. 2D Horner form c33 c23 c13 c03 x * * * * c32 + c22 + c12 + c02 + * * * * c31 + c21 + c11 + c01 + * * * * c30 + c20 + c10 + c00 + * + * + * + y

  23. Advantages of Horner form • Only requires (m-1) 2 multipliers • Latency is 2(m+1) multipliers • Evaluation guard bits are roughly O(log m) bits • Requires -1 <= x <= +1 and -1 <= y <= +1 • Pre-scale all polynomials • Start with native polynomial range [xL,xH) x [yL,yH) • Use offsets xO = -(xL+xH)/2 and yL = -(yL+hH)/2 • Find integers xS and yS such that: -1 <= 2 xS (xL+xO) <0.5 and 0.5 < 2 xS (xH+xO) <= +1 • • Also drastically reduces range of intermediates • Means multiplier input widths are much smaller

  24. Fitting polynomials • We don’t have the Remez property • Use least squares to fit the coefficients + a 3,0 x 3 + … + a 1,0 x + a 2,0 x 2 p(x,y) = a 0,0 + a 0,1 y + a 1,1 x y + a 2,1 x 2 y + a 3,1 x 3 y + … + a 0,2 y 2 + a 1,2 x y 2 + a 2,2 x 2 y 2 + a 3,2 x 3 y 2 + … + a 0,3 y 3 + a 1,3 x y 3 + a 2,3 x 2 y 3 + a 3,3 x 3 y 3 + … • Find a set of n points p spread over interval • Regress to find best coefficients in least squares sense

  25. Optimising polynomials • Least squares fit is not a minimax approximation… • Attempt 1: successive over-constrained fit • Use residual in previous fit to nudge next fit • Attempt 2 : downhill simplex to optimise coefficients • Tiny improvements in maximum error • Attempt 3 : use Chebyshev nodes during fit • No noticeable improvement over uniform nodes • Attempt 4 : express the problem as discrete LP • Far too slow to be useful

  26. Overall: total BRAMs, error=10 -3 30 Number of Block RAMs 25 20 15 10 5 0 10 12 14 16 18 Domain fractional bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

  27. Overall: total BRAMs, error=10 -5 500 Number of Block RAMs 400 300 200 100 0 10 12 14 16 18 Domain fractional bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t

  28. Conclusion • Black-box 2D function approximation is feasible • At least up to around 2 24 in each dimension • Major constraint is block RAM usage • Most functions only take up a fraction of RAM though • Route to hardware is via HLS • Generate C code describing tables • Ongoing work • Optimising the splitting process • Improving monomial basis selection • Lower-level hardware optimisations

  29. Depopulated polynomials c03 x * c12 c02 + * * c21 c11 + c01 + * * * c30 c20 + c10 + c00 + * + * + * + y

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend