efficient implementation of polynomial arithmetic in a
play

Efficient implementation of polynomial arithmetic in a - PowerPoint PPT Presentation

Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin Li & Marc Moreno Maza September 2, 2006 ICMS2006, Castro Urdiales, SPAIN. 1 Motivations (I) Generic Code Need for generic code of


  1. Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin Li & Marc Moreno Maza September 2, 2006 ICMS’2006, Castro Urdiales, SPAIN. 1

  2. Motivations (I) Generic Code Need for generic code of polynomial arithmetic. • Suited to generic programming. • Code reuse, maintainability. • Reasonable speed after compiler optimization. We use AXIOM and A LDOR . • Parametric polymorphism. • Dependent type. 2

  3. Motivations (II) Specialized Code • “Particular” domains deserve their own specialized implementations. • More efficient. 0.5 Generic Specialized 0.45 0.4 0.35 0.3 Time [sec] 0.25 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Polynomial Degrees Univariate Generic F.E.E.A vs. Specialized F.E.E.A Modulo a 27-bit Prime. 3

  4. Motivations (III) Fast Algorithms • We are interested in fast polynomial arithmetic. 4 Standard Euclidean Fast Extended Euclidean 3.5 3 2.5 Time [sec] 2 1.5 1 0.5 0 0 500 1000 1500 2000 2500 3000 Polynomial Degree Univariate Fast E.E.A. vs. Quadratic E.E.A. (both generic code.) Modulo a 27-bit Prime. 4

  5. Motivations (IV) Implementation Issue • Focus on implementation issues in AXIOM . • Open AXIOM has a multiple-language level construction. Spad Lisp C Assembly code • Mix code at each level for high performance. 5

  6. The SPAD level In AXIOM at SPAD level: • A two-level type system – categories and domains . E.g. Ring is the AXIOM category. • The domain SUP(R) , where R has Ring type, implements UnivariatePolynomialCategory(R) with sparse data representation. • The domain SMP(R,V) , where R has Ring type, V has VariableSet type, implements RecursivePolynomialCategory(R, V) with recursive sparse data representation. Our implementation for better support dense algorithms: • The domain DUP(R) implements the same category as SUP(R) does. The data representation is dense. • The domain DRMP(R,V) implements the same category as SMP(R,V) does. The data representation is recursive dense. • The benefit of dense domains for dense algorithms in later slides. 6

  7. The GNU Common Lisp ( GCL ) level • Lisp is a symbolic-expression-oriented language. • To speed up the performance of GCL code. • Using statically typed feature. • Suppressing run-time type checking. • Choosing adapted data structures. 7

  8. Example-1: Array access in GCL SPAD code: array1.i := array2.i (SETELT | array1 | | i | (ELT | array2 | | i | )) Compiled Lisp code: Hand-written Lisp code: (setf ( aref array1 i ) ( aref array2 i )) case t_vector: if (index >= seq->v.v_fillp) goto E; return(aref(seq, index)); Without array bound checking, “aref” is slightly faster. 8

  9. Example-2: Vector-based recursive dense polynomials • Implement Z/pZ [ x 1 , . . . , x n ] domain using the vector-based representation proposed by R. Fateman. • Each polynomial encoded by a vector, each slot is pointer to another polynomial or a number. • At SPAD level using Union type for this disjunction. • At Lisp level using the property of dynamic typing. Spad Union type is a “cons”. “cdr” keeps the type info. “car” keeps the value. struct cons { FIRSTWORD; object c_cdr; object c_car; }; 9

  10. Benchmark (I) • MMA – Multivariate Modular Arithmetic domain. • MMA(p,V) implements the same operations as DRMP(PF(p),V) . • Benchmark: van Hoeij and Monagan Modular GCD algorithm [1]. Input: f 1 , f 2 ∈ Q ( a 1 , . . . , a e )[ y ] Output: GCD ( f 1 , f 2) Modulo several primes f 1 , f 2 ∈ Q ( a 1 , . . . , a e )[ y ] ... f 1 , f 2 ∈ Z /p i Z ( a 1 , ..., a e )[ y ] p 1 , . . . , p m Euclidean ... algorithm mod p Direct method Chinese remaindering GCD ( f 1 , f 2) g i ∈ ( Z /p i Z )( a 1 , . . . , a e )[ y ] + Rational reconstruction [1] A Modular GCD algorithm over Number Fields presented with Multiple Extensions.van Hoeij, Michael Monagan. ISSAC’02 proceedings, 109-116, (2002). 10

  11. Benchmarks (II) In AXIOM we implemented/used: Q ( a 1 , a 2 , . . . , a e ) K [ y ] NSMP in SPAD SUP in SPAD DMPR in SPAD DUP in SPAD MMA in L ISP SUP in SPAD MMA in L ISP DUP in SPAD 11

  12. Benchmarks (III) Timing: 140 UP(NSMP(Q,[v,w,x,y,z])) DUP(DMP(Q,[v,w,x,y,z])) UP(MMA(Q,[v,w,x,y,z])) 120 DUP(MMA(Q,[v,w,x,y,z])) 100 80 Time [sec] 60 40 20 0 0 5 10 15 20 25 30 Coefficients bound 12

  13. Benchmarks (IV) Memory Consumption: 300000 400000 UP(NSMP) UP(NSMP) DUP(DMP) DUP(DMP) UP(MMA) UP(MMA) 350000 DUP(MMA) DUP(MMA) 250000 300000 Resident Set Size [KB] 200000 VM Size [KB] 250000 200000 150000 150000 100000 100000 50000 50000 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Coefficient Bound Coefficient Bound 13

  14. The C level We go to C level: • Implementing efficiency-critical operations requires direct access to machine arithmetic and no much symbolic expression manipulation. • GCL is written in C , we can extend the GCL kernel at C level. • Direct use of C \ C++ or Assembly based libraries: NTL , GMP , .. .. 14

  15. Example-1: Modular integer reduction on special Fourier primes. • p = c · 2 k + 1 , c, k ∈ Z + , ( c, 2) = 1 , p is a prime; • Input: Z , p ∈ Z , where Z ≤ ( p − 1) 2 . Output: r ′ := ( c · Z rem p ) − δ , where δ < ( c + 1) · p . – Consider c · Z = q · p + r . – Define q ′ := ⌊ Z 2 k ⌋ . – Define r ′ := c · ( Z − q ′ · 2 k ) − q ′ . – We have r ′ = r − δ . • This algorithm only needs “shifts” and “adds/subs”. output = c * ((long)Z)&MASK_k - (long)(Z >> k); 15

  16. Example-2: FFT on special Fourier primes. • Special Fourier Prime modular reduction used in FFT . – Output: r ′ = ( c · Z rem p ) − δ . – Cancel c by its inverse. – Control δ s in middle stages. – Remove δ s at the end. • It’s more efficient than Montgomery trick and floating point trick. • Joint work with prof. ´ Eric schost. 16

  17. Benchmark of FFT 5 NTL-5.4 1-prime FFT Our 1-prime FFT 4.5 4 3.5 3 Time [sec] 2.5 2 1.5 1 0.5 0 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Degree FFT -based uni-poly-mul over Z /p Z , p is a 28-bit prime. • In above figure, both our and NTL’s FFT are 1-prime FFT. (Using zz_p ::FFTInit( i ) in NTL.) • Our 3-prime FFT vs. NTL’s 3-prime FFT has the same speed-up-ratio as in the Figure. (Using zz_p ::init( p ) in NTL.) 17

  18. Benchmark of FFT NTL-FFT OUR-FFT Cache setting: • desc: I1 cache: 16384 B, 32 B, 8-way associative. • desc: D1 cache: 65536 B, 64 B, 2-way associative. • desc: L2 cache: 1048576 B, 64 B, 8-way associative. 18

  19. The A SSEMBLY level • For using specific hardware features or existing code. 90 0.1 SUP FPU UMA SSE2 80 0.08 70 60 0.06 Time [sec] Time [sec] 50 40 0.04 30 0.02 20 10 0 0 1,000 2,000 4,000 8,000 16,000 32,000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Degree Degree 19

  20. Conclusion and future-work. • By a few examples, we show that properly mixing code at each AXIOM language level may achieve better performance. • Familiar with strength/weakness of language tools and their optimizer techniques are essential for high performance. • We hope to construct automatic performance-tuning packages for fast polynomial arithmetic in near future. * All code can be downloaded from http://www.csd.uwo.ca/ ∼ xli96 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend