Efficient implementation of polynomial arithmetic in a - PowerPoint PPT Presentation

Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin Li & Marc Moreno Maza September 2, 2006 ICMS’2006, Castro Urdiales, SPAIN. 1

Motivations (I) Generic Code Need for generic code of polynomial arithmetic. • Suited to generic programming. • Code reuse, maintainability. • Reasonable speed after compiler optimization. We use AXIOM and A LDOR . • Parametric polymorphism. • Dependent type. 2

Motivations (II) Specialized Code • “Particular” domains deserve their own specialized implementations. • More efficient. 0.5 Generic Specialized 0.45 0.4 0.35 0.3 Time [sec] 0.25 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Polynomial Degrees Univariate Generic F.E.E.A vs. Specialized F.E.E.A Modulo a 27-bit Prime. 3

Motivations (III) Fast Algorithms • We are interested in fast polynomial arithmetic. 4 Standard Euclidean Fast Extended Euclidean 3.5 3 2.5 Time [sec] 2 1.5 1 0.5 0 0 500 1000 1500 2000 2500 3000 Polynomial Degree Univariate Fast E.E.A. vs. Quadratic E.E.A. (both generic code.) Modulo a 27-bit Prime. 4

Motivations (IV) Implementation Issue • Focus on implementation issues in AXIOM . • Open AXIOM has a multiple-language level construction. Spad Lisp C Assembly code • Mix code at each level for high performance. 5

The SPAD level In AXIOM at SPAD level: • A two-level type system – categories and domains . E.g. Ring is the AXIOM category. • The domain SUP(R) , where R has Ring type, implements UnivariatePolynomialCategory(R) with sparse data representation. • The domain SMP(R,V) , where R has Ring type, V has VariableSet type, implements RecursivePolynomialCategory(R, V) with recursive sparse data representation. Our implementation for better support dense algorithms: • The domain DUP(R) implements the same category as SUP(R) does. The data representation is dense. • The domain DRMP(R,V) implements the same category as SMP(R,V) does. The data representation is recursive dense. • The benefit of dense domains for dense algorithms in later slides. 6

The GNU Common Lisp ( GCL ) level • Lisp is a symbolic-expression-oriented language. • To speed up the performance of GCL code. • Using statically typed feature. • Suppressing run-time type checking. • Choosing adapted data structures. 7

Example-1: Array access in GCL SPAD code: array1.i := array2.i (SETELT | array1 | | i | (ELT | array2 | | i | )) Compiled Lisp code: Hand-written Lisp code: (setf ( aref array1 i ) ( aref array2 i )) case t_vector: if (index >= seq->v.v_fillp) goto E; return(aref(seq, index)); Without array bound checking, “aref” is slightly faster. 8

Example-2: Vector-based recursive dense polynomials • Implement Z/pZ [ x 1 , . . . , x n ] domain using the vector-based representation proposed by R. Fateman. • Each polynomial encoded by a vector, each slot is pointer to another polynomial or a number. • At SPAD level using Union type for this disjunction. • At Lisp level using the property of dynamic typing. Spad Union type is a “cons”. “cdr” keeps the type info. “car” keeps the value. struct cons { FIRSTWORD; object c_cdr; object c_car; }; 9

Benchmark (I) • MMA – Multivariate Modular Arithmetic domain. • MMA(p,V) implements the same operations as DRMP(PF(p),V) . • Benchmark: van Hoeij and Monagan Modular GCD algorithm [1]. Input: f 1 , f 2 ∈ Q ( a 1 , . . . , a e )[ y ] Output: GCD ( f 1 , f 2) Modulo several primes f 1 , f 2 ∈ Q ( a 1 , . . . , a e )[ y ] ... f 1 , f 2 ∈ Z /p i Z ( a 1 , ..., a e )[ y ] p 1 , . . . , p m Euclidean ... algorithm mod p Direct method Chinese remaindering GCD ( f 1 , f 2) g i ∈ ( Z /p i Z )( a 1 , . . . , a e )[ y ] + Rational reconstruction [1] A Modular GCD algorithm over Number Fields presented with Multiple Extensions.van Hoeij, Michael Monagan. ISSAC’02 proceedings, 109-116, (2002). 10

Benchmarks (II) In AXIOM we implemented/used: Q ( a 1 , a 2 , . . . , a e ) K [ y ] NSMP in SPAD SUP in SPAD DMPR in SPAD DUP in SPAD MMA in L ISP SUP in SPAD MMA in L ISP DUP in SPAD 11

Benchmarks (III) Timing: 140 UP(NSMP(Q,[v,w,x,y,z])) DUP(DMP(Q,[v,w,x,y,z])) UP(MMA(Q,[v,w,x,y,z])) 120 DUP(MMA(Q,[v,w,x,y,z])) 100 80 Time [sec] 60 40 20 0 0 5 10 15 20 25 30 Coefficients bound 12

Benchmarks (IV) Memory Consumption: 300000 400000 UP(NSMP) UP(NSMP) DUP(DMP) DUP(DMP) UP(MMA) UP(MMA) 350000 DUP(MMA) DUP(MMA) 250000 300000 Resident Set Size [KB] 200000 VM Size [KB] 250000 200000 150000 150000 100000 100000 50000 50000 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Coefficient Bound Coefficient Bound 13

The C level We go to C level: • Implementing efficiency-critical operations requires direct access to machine arithmetic and no much symbolic expression manipulation. • GCL is written in C , we can extend the GCL kernel at C level. • Direct use of C \ C++ or Assembly based libraries: NTL , GMP , .. .. 14

Example-1: Modular integer reduction on special Fourier primes. • p = c · 2 k + 1 , c, k ∈ Z + , ( c, 2) = 1 , p is a prime; • Input: Z , p ∈ Z , where Z ≤ ( p − 1) 2 . Output: r ′ := ( c · Z rem p ) − δ , where δ < ( c + 1) · p . – Consider c · Z = q · p + r . – Define q ′ := ⌊ Z 2 k ⌋ . – Define r ′ := c · ( Z − q ′ · 2 k ) − q ′ . – We have r ′ = r − δ . • This algorithm only needs “shifts” and “adds/subs”. output = c * ((long)Z)&MASK_k - (long)(Z >> k); 15

Example-2: FFT on special Fourier primes. • Special Fourier Prime modular reduction used in FFT . – Output: r ′ = ( c · Z rem p ) − δ . – Cancel c by its inverse. – Control δ s in middle stages. – Remove δ s at the end. • It’s more efficient than Montgomery trick and floating point trick. • Joint work with prof. ´ Eric schost. 16

Benchmark of FFT 5 NTL-5.4 1-prime FFT Our 1-prime FFT 4.5 4 3.5 3 Time [sec] 2.5 2 1.5 1 0.5 0 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Degree FFT -based uni-poly-mul over Z /p Z , p is a 28-bit prime. • In above figure, both our and NTL’s FFT are 1-prime FFT. (Using zz_p ::FFTInit( i ) in NTL.) • Our 3-prime FFT vs. NTL’s 3-prime FFT has the same speed-up-ratio as in the Figure. (Using zz_p ::init( p ) in NTL.) 17

Benchmark of FFT NTL-FFT OUR-FFT Cache setting: • desc: I1 cache: 16384 B, 32 B, 8-way associative. • desc: D1 cache: 65536 B, 64 B, 2-way associative. • desc: L2 cache: 1048576 B, 64 B, 8-way associative. 18

The A SSEMBLY level • For using specific hardware features or existing code. 90 0.1 SUP FPU UMA SSE2 80 0.08 70 60 0.06 Time [sec] Time [sec] 50 40 0.04 30 0.02 20 10 0 0 1,000 2,000 4,000 8,000 16,000 32,000 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Degree Degree 19

Conclusion and future-work. • By a few examples, we show that properly mixing code at each AXIOM language level may achieve better performance. • Familiar with strength/weakness of language tools and their optimizer techniques are essential for high performance. • We hope to construct automatic performance-tuning packages for fast polynomial arithmetic in near future. * All code can be downloaded from http://www.csd.uwo.ca/ ∼ xli96 20

Efficient implementation of polynomial arithmetic in a - PowerPoint PPT Presentation

Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin Li & Marc Moreno Maza September 2, 2006 ICMS2006, Castro Urdiales, SPAIN. 1 Motivations (I) Generic Code Need for generic code of

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Lecture 14. Outline. Modular Arithmetic Fact and Secrets There exists a polynomial... Modular

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Introduction Warping polynomial Span of warping polynomial Span and dealternating number Ayaka

Polynomial Hierarchy A polynomial-bounded version of Kleenes Arithmetic Hierarchy becomes

Why Algorithmic and Rigorous Polynomial Approximations? Rigorous Polynomial Approximation =

On Kauffman polynomial of alternating knot and HOMFLY polynomial of its Whitehead double

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

Property of the interior polynomial from the HOMFLY polynomial

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

Part I: RELIC Diego F. Aranha Efficient Binary Field Arithmetic Numbers RELIC is an Efficient

Polynomial Resultants Henry Woody May 2, 2016 The Resultant Polynomial Resultants Henry Woody

Section4.1 Polynomial Functions and Models Introduction Definitions A polynomial function is a

AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Example Handwritten Digit Recognition Polynomial

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer

Weak dependence of mixed moving average fields and ap- plications Based on joint work with Imma

Registrant List Page 1 of 5 UCSF OCME Name City, State Adkins Samantha R Md Portland, OR 1

Probabilistic Deduplication, Data Linkage and Geocoding Peter Christen Data Mining Group,

FAST CORRIDORS INTANGIBLE INFRASTRUCTURE INCREASING TRANSPORT COMPETITIVENESS 1 AGENDA 1. WHAT

Mark Frerker MLB: $7.7 billion revenue in 2012 Statistics driven Moneyball

Agenda 1) Recap of Process- Where are We? 2) Research with other MLPs- Results 3) Review of

Prediction Markets Thomas Steinke & David Rezza Baqaee October 17, 2010 Introduction

Efficient implementation of polynomial arithmetic in a - PowerPoint PPT Presentation

Efficient implementation of polynomial arithmetic in a multiple-level programming environment Xin Li & Marc Moreno Maza September 2, 2006 ICMS2006, Castro Urdiales, SPAIN. 1 Motivations (I) Generic Code Need for generic code of

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Lecture 14. Outline. Modular Arithmetic Fact and Secrets There exists a polynomial... Modular

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Introduction Warping polynomial Span of warping polynomial Span and dealternating number Ayaka

Polynomial Hierarchy A polynomial-bounded version of Kleenes Arithmetic Hierarchy becomes

Why Algorithmic and Rigorous Polynomial Approximations? Rigorous Polynomial Approximation =

On Kauffman polynomial of alternating knot and HOMFLY polynomial of its Whitehead double

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

Property of the interior polynomial from the HOMFLY polynomial

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

Part I: RELIC Diego F. Aranha Efficient Binary Field Arithmetic Numbers RELIC is an Efficient

Polynomial Resultants Henry Woody May 2, 2016 The Resultant Polynomial Resultants Henry Woody

Section4.1 Polynomial Functions and Models Introduction Definitions A polynomial function is a

AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Example Handwritten Digit Recognition Polynomial

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer

Weak dependence of mixed moving average fields and ap- plications Based on joint work with Imma

Registrant List Page 1 of 5 UCSF OCME Name City, State Adkins Samantha R Md Portland, OR 1

Probabilistic Deduplication, Data Linkage and Geocoding Peter Christen Data Mining Group,

FAST CORRIDORS INTANGIBLE INFRASTRUCTURE INCREASING TRANSPORT COMPETITIVENESS 1 AGENDA 1. WHAT

Mark Frerker MLB: $7.7 billion revenue in 2012 Statistics driven Moneyball

Agenda 1) Recap of Process- Where are We? 2) Research with other MLPs- Results 3) Review of

Prediction Markets Thomas Steinke &amp; David Rezza Baqaee October 17, 2010 Introduction

Prediction Markets Thomas Steinke & David Rezza Baqaee October 17, 2010 Introduction