1/18 Straightforward parallelization of polynomial multiplication - - PowerPoint PPT Presentation

1 18 straightforward parallelization of polynomial
SMART_READER_LITE
LIVE PREVIEW

1/18 Straightforward parallelization of polynomial multiplication - - PowerPoint PPT Presentation

1/18 Straightforward parallelization of polynomial multiplication using parallel collections in Scala Raphal Jolly Databeans EOOPS 2013 Barcelona 2/18 Parallelization of symbolic computations * Numeric computations Several arithmetic


slide-1
SLIDE 1

1/18 Straightforward parallelization of polynomial multiplication using parallel collections in Scala Raphaël Jolly Databeans EOOPS 2013 Barcelona

slide-2
SLIDE 2

2/18 Parallelization of symbolic computations * Numeric computations Several arithmetic operations executed in parallel Linear algebra CPU-intensive * Symbolic computations : polynomials Same as above, and: Arithmetic operation itself is parallelized Multiplication, division, gcd Reduction, Gröbner bases (multivariate) CPU and memory-intensive (cache issues)

slide-3
SLIDE 3

3/18 Polynomial multiplication Multivariate polynomials Distributive representation Product

slide-4
SLIDE 4

4/18 Polynomial multiplication : sequential

+ + + +

x x 1 x n + + + y * y 1 * y m *

+ + +

slide-5
SLIDE 5

5/18 Polynomial multiplication : parallel

+ = + = + = + = + = + = + =

x x 1 x n + + + y * y 1 * y m *

slide-6
SLIDE 6

6/18 Polynomial multiplication : sequential type T = List[(Array[N], C)] def times(x: T, y: T) = (zero /: y) { (l, r) => val (a, b) = r l + multiply(x, a, b) }

slide-7
SLIDE 7

6/18 Polynomial multiplication : sequential type T = List[(Array[N], C)] def times(x: T, y: T) = y.foldLeft(zero)({ (l, r) => val (a, b) = r l + multiply(x, a, b) })

slide-8
SLIDE 8

6/18 Polynomial multiplication : sequential type T = List[(Array[N], C)] def times(x: T, y: T) = y.foldLeft(zero)({ (l, r) => val (a, b) = r l + multiply(x, a, b) }) def multiply(x: T, m: Array[N], c: C) = x.map { r => val (s, a) = r (s * m, a * c) } filter { r => val (_, a) = r !a.isZero }

slide-9
SLIDE 9

7/18 Polynomial multiplication : parallel type T = List[(Array[N], C)] def times(x: T, y: T) = y.par.aggregate(zero)({ (l, r) => val (a, b) = r l + multiply(x, a, b) }, _ + _) def multiply(x: T, m: Array[N], c: C) = x.map { r => val (s, a) = r (s * m, a * c) } filter { r => val (_, a) = r !a.isZero }

slide-10
SLIDE 10

7/18 Polynomial multiplication : parallel type T = List[(Array[N], C)] def times(x: T, y: T) = y.par.aggregate(zero)({ (l, r) => val (a, b) = r l + multiply(x, a, b) }, _ + _) def multiply(x: T, m: Array[N], c: C) = x.par.map { r => val (s, a) = r (s * m, a * c) } filter { r => val (_, a) = r !a.isZero }

slide-11
SLIDE 11

8/18 Experimental setup Intel Atom D410 at 1.66Ghz with ((32K, 24K), 512K) cache Single core Hyper-threading Parallel timings should not be worse than sequential Could be eventually better (20 %) Further experiments need to be done on multicore hardware

slide-12
SLIDE 12

9/18 Experimental setup

Cache(s) ALU’s

Arch states

(registers)

Arch states

(registers)

Main memory System bus Logical processor 1 Logical processor 2 Cache(s) ALU’s

Arch states

(registers)

Arch states

(registers)

Main memory System bus Logical processor 1 Logical processor 2

Cache(s) ALU’s Arch states (registers) Main memory System bus Physical processor 1 Physical processor 2 Cache(s) ALU’s Arch states (registers) Cache(s) ALU’s Arch states (registers) Main memory System bus Physical processor 1 Physical processor 2 Cache(s) ALU’s Arch states (registers)

Hyper-threading Dual-processor (Chen et al. Media Applications on Hyper-Threading Technology - Intel Technology Journal, Q1, 2002)

slide-13
SLIDE 13

10/18 Test case Squaring a sparse polynomial with and sufficiently large : (Fateman, R. J. DRAFT: Comparing the speed of programs for sparse polynomial multiplication, 2002)

slide-14
SLIDE 14

11/18 Test case : implementation import scas._ import Implicits.ZZ implicit val r = Polynomial(ZZ, 'x, 'y, 'z) val Array(x, y, z) = r.generators val p = 1 + x + y + z val q = pow(p, 20) val q1 = 1 + q val q2 = q * q1

slide-15
SLIDE 15

12/18 Timings

n par(2) 20 10 7 1.38 24 27 19 1.37 28 63 48 1.32 32 139 109 1.27 seq speedup

20 24 28 32 20 40 60 80 100 120 140 160

Timings

s eq par(2)

n s econds

slide-16
SLIDE 16

13/18 Fine-grained and exponential task splitting "stolen tasks are divided into exponentially smaller tasks until a threshold is reached and then handled sequentially starting from the smallest one, while tasks that came from the processor's own queue are handled sequentially straight away" (Prokopec, A.; Bawgell, P.; Rompf, T. & Odersky, M. On a Generic Parallel Collection Framework, 2011)

slide-17
SLIDE 17

14/18 Collection base classes hierarchy

Traversable Iterable Set Map Seq

slide-18
SLIDE 18

14/18 Collection base classes hierarchy

Traversable Iterable Set Map Seq Collection Map Set List

slide-19
SLIDE 19

15/18 Traversable[A] def map[B, That](f: A => B): That def flatMap[B, That](f: A => GenTraversableOnce[B]): That def filter(p: A => Boolean): Traversable[A] def foreach[U](f: A => U): Unit def forall(p: A => Boolean): Boolean def exists(p: A => Boolean): Boolean def count(p: A => Boolean): Int def reduce[A1 >: A](op: (A1, A1) => A1): A1 def aggregate[B](z: B)(seqop: (B, A) => B, combop: (B, B) => B): B def sum[B >: A](implicit num: Numeric[B]): B def product[B >: A](implicit num: Numeric[B]): B def min[B >: A](implicit cmp: Ordering[B]): A def max[B >: A](implicit cmp: Ordering[B]): A

slide-20
SLIDE 20

16/18 Other data structures (n = 20)

structure par(2) par(1) 17 9 8 10 7 17 12 19 40 48 seq tree tree.mutable list array stream

t r e e t r e e . m u t a b l e l i s t a r r a y s t r e a m 10 20 30 40 50 60 s eq par(2) par(1)

slide-21
SLIDE 21

17/18 Data paralellism

+ = + = + = + = + = + = + =

x x 1 x n + + + y * y 1 * y m *

slide-22
SLIDE 22

18/18 Task paralellism

+ + + + + + +

y * x y 1 * x y m * x

slide-23
SLIDE 23

18/18 Task paralellism

+ + + + + + +

y * x y 1 * x y m * x

Thank you ! http://github.com/rjolly/scas