Efficient implementation of polynomial arithmetic in a - - PowerPoint PPT Presentation

efficient implementation of polynomial arithmetic in a
SMART_READER_LITE
LIVE PREVIEW

Efficient implementation of polynomial arithmetic in a - - PowerPoint PPT Presentation

Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin Li & Marc Moreno Maza September 2, 2006 ICMS2006, Castro Urdiales, SPAIN. 1 Motivations (I) Generic Code Need for generic code of


slide-1
SLIDE 1

Efficient implementation of polynomial arithmetic in a multiple-level programming environment

Xin Li & Marc Moreno Maza September 2, 2006 ICMS’2006, Castro Urdiales, SPAIN.

1

slide-2
SLIDE 2

Motivations (I) Generic Code Need for generic code of polynomial arithmetic.

  • Suited to generic programming.
  • Code reuse, maintainability.
  • Reasonable speed after compiler optimization.

We use AXIOM and ALDOR.

  • Parametric polymorphism.
  • Dependent type.

2

slide-3
SLIDE 3

Motivations (II) Specialized Code

  • “Particular” domains deserve their own specialized implementations.
  • More efficient.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 200 400 600 800 1000 1200 1400 1600 1800 2000 Time [sec] Polynomial Degrees Generic Specialized

Univariate Generic F.E.E.A vs. Specialized F.E.E.A Modulo a 27-bit Prime.

3

slide-4
SLIDE 4

Motivations (III) Fast Algorithms

  • We are interested in fast polynomial arithmetic.

0.5 1 1.5 2 2.5 3 3.5 4 500 1000 1500 2000 2500 3000 Time [sec] Polynomial Degree Standard Euclidean Fast Extended Euclidean

Univariate Fast E.E.A. vs. Quadratic E.E.A. (both generic code.) Modulo a 27-bit Prime.

4

slide-5
SLIDE 5

Motivations (IV) Implementation Issue

  • Focus on implementation issues in AXIOM.
  • Open AXIOM has a multiple-language level construction.

Spad

Lisp C Assembly code

  • Mix code at each level for high performance.

5

slide-6
SLIDE 6

The SPAD level

In AXIOM at SPAD level:

  • A two-level type system – categories and domains.

E.g. Ring is the AXIOM category.

  • The domain SUP(R), where R has Ring type, implements

UnivariatePolynomialCategory(R) with sparse data representation.

  • The domain SMP(R,V), where R has Ring type, V has VariableSet type,

implements RecursivePolynomialCategory(R, V) with recursive sparse data representation. Our implementation for better support dense algorithms:

  • The domain DUP(R) implements the same category as SUP(R) does. The data

representation is dense.

  • The domain DRMP(R,V) implements the same category as SMP(R,V) does. The

data representation is recursive dense.

  • The benefit of dense domains for dense algorithms in later slides.

6

slide-7
SLIDE 7

The GNU Common Lisp (GCL) level

  • Lisp is a symbolic-expression-oriented language.
  • To speed up the performance of GCL code.
  • Using statically typed feature.
  • Suppressing run-time type checking.
  • Choosing adapted data structures.

7

slide-8
SLIDE 8

Example-1: Array access in GCL

SPAD code: array1.i := array2.i Compiled Lisp code: (SETELT |array1| |i| (ELT |array2| |i|)) Hand-written Lisp code: (setf ( aref array1 i ) ( aref array2 i )) case t_vector: if (index >= seq->v.v_fillp) goto E; return(aref(seq, index)); Without array bound checking, “aref” is slightly faster.

8

slide-9
SLIDE 9

Example-2: Vector-based recursive dense polynomials

  • Implement Z/pZ[x1, . . . , xn] domain using the vector-based representation

proposed by R. Fateman.

  • Each polynomial encoded by a vector, each slot is pointer to another

polynomial or a number.

  • At SPAD level using Union type for this disjunction.
  • At Lisp level using the property of dynamic typing.

Spad Union type is a “cons”. “cdr” keeps the type info. “car” keeps the value. struct cons { FIRSTWORD;

  • bject c_cdr;
  • bject c_car; };

9

slide-10
SLIDE 10

Benchmark (I)

  • MMA – Multivariate Modular Arithmetic domain.
  • MMA(p,V) implements the same operations as DRMP(PF(p),V).
  • Benchmark: van Hoeij and Monagan Modular GCD algorithm [1].

Modulo several primes

Input: f1, f2 ∈ Q(a1, . . . , ae)[y] Output: GCD(f1, f2) ...

Euclidean ... f1, f2 ∈ Q(a1, . . . , ae)[y] p1, . . . , pm GCD(f1, f2) f1, f2 ∈ Z/piZ(a1, ..., ae)[y] gi ∈ (Z/piZ)(a1, . . . , ae)[y] algorithm mod p Chinese remaindering + Rational reconstruction Direct method

[1] A Modular GCD algorithm over Number Fields presented with Multiple Extensions.van Hoeij, Michael Monagan. ISSAC’02 proceedings, 109-116, (2002).

10

slide-11
SLIDE 11

Benchmarks (II)

In AXIOM we implemented/used: Q(a1, a2, . . . , ae) K[y] NSMP in SPAD SUP in SPAD DMPR in SPAD DUP in SPAD MMA in LISP SUP in SPAD MMA in LISP DUP in SPAD

11

slide-12
SLIDE 12

Benchmarks (III)

Timing:

20 40 60 80 100 120 140 5 10 15 20 25 30 Time [sec] Coefficients bound UP(NSMP(Q,[v,w,x,y,z])) DUP(DMP(Q,[v,w,x,y,z])) UP(MMA(Q,[v,w,x,y,z])) DUP(MMA(Q,[v,w,x,y,z])) 12

slide-13
SLIDE 13

Benchmarks (IV)

Memory Consumption:

50000 100000 150000 200000 250000 300000 10 20 30 40 50 60 70 Resident Set Size [KB] Coefficient Bound UP(NSMP) DUP(DMP) UP(MMA) DUP(MMA) 50000 100000 150000 200000 250000 300000 350000 400000 10 20 30 40 50 60 70 VM Size [KB] Coefficient Bound UP(NSMP) DUP(DMP) UP(MMA) DUP(MMA)

13

slide-14
SLIDE 14

The C level

We go to C level:

  • Implementing efficiency-critical operations requires direct access to

machine arithmetic and no much symbolic expression manipulation.

  • GCL is written in C, we can extend the GCL kernel at C level.
  • Direct use of C\C++ or Assembly based libraries: NTL, GMP, .. ..

14

slide-15
SLIDE 15

Example-1: Modular integer reduction on special Fourier primes.

  • p = c · 2k + 1, c, k∈Z+, (c, 2) = 1, p is a prime;
  • Input: Z, p ∈ Z, where Z ≤ (p − 1)2.

Output: r′ := (c·Z rem p) − δ, where δ < (c + 1)·p. – Consider c·Z = q·p + r. – Define q′ := ⌊ Z

2k ⌋.

– Define r′ := c·(Z − q′·2k) − q′. – We have r′ = r − δ.

  • This algorithm only needs “shifts” and “adds/subs”.
  • utput = c * ((long)Z)&MASK_k - (long)(Z >> k);

15

slide-16
SLIDE 16

Example-2: FFT on special Fourier primes.

  • Special Fourier Prime modular reduction used in FFT.

– Output: r′ = (c·Z rem p) − δ. – Cancel c by its inverse. – Control δs in middle stages. – Remove δs at the end.

  • It’s more efficient than Montgomery trick and floating point trick.
  • Joint work with prof. ´

Eric schost.

16

slide-17
SLIDE 17

Benchmark of FFT

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Time [sec] Degree NTL-5.4 1-prime FFT Our 1-prime FFT

FFT-based uni-poly-mul over Z/pZ, p is a 28-bit prime.

  • In above figure, both our and NTL’s FFT are 1-prime FFT. (Using zz_p::FFTInit(i) in

NTL.)

  • Our 3-prime FFT vs. NTL’s 3-prime FFT has the same speed-up-ratio as in the Figure.

(Using zz_p::init(p) in NTL.)

17

slide-18
SLIDE 18

Benchmark of FFT

NTL-FFT OUR-FFT

Cache setting:

  • desc: I1 cache: 16384 B, 32 B, 8-way associative.
  • desc: D1 cache: 65536 B, 64 B, 2-way associative.
  • desc: L2 cache: 1048576 B, 64 B, 8-way associative.

18

slide-19
SLIDE 19

The ASSEMBLY level

  • For using specific hardware features or existing code.

0.02 0.04 0.06 0.08 0.1 32,000 16,000 8,000 4,000 2,000 1,000 Time [sec] Degree FPU SSE2

10 20 30 40 50 60 70 80 90 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Time [sec] Degree SUP UMA

19

slide-20
SLIDE 20

Conclusion and future-work.

  • By a few examples, we show that properly mixing code at each AXIOM

language level may achieve better performance.

  • Familiar with strength/weakness of language tools and their optimizer

techniques are essential for high performance.

  • We hope to construct automatic performance-tuning packages for fast

polynomial arithmetic in near future. * All code can be downloaded from http://www.csd.uwo.ca/∼xli96

20