The Three Beaars . . . . The Three Beaars Dyalog 12 . - - PowerPoint PPT Presentation

the three beaars
SMART_READER_LITE
LIVE PREVIEW

The Three Beaars . . . . The Three Beaars Dyalog 12 . - - PowerPoint PPT Presentation

. Snake Island Research Inc Robert Bernecky October 10, 2012 bernecky@snakeisland.com tel: +1 416 203 0854 Toronto, Canada 18 Fifth Street, Wards Island Robert Bernecky . Basically, Every Array Allocation Reduces Speed The Three Beaars


slide-1
SLIDE 1

. . . . . .

The Three Beaars

Basically, Every Array Allocation Reduces Speed Robert Bernecky

Snake Island Research Inc 18 Fifth Street, Ward’s Island Toronto, Canada tel: +1 416 203 0854 bernecky@snakeisland.com

October 10, 2012

Robert Bernecky The Three Beaars – Dyalog ’12

slide-2
SLIDE 2

. . . . . .

Abstract

Functional array language compiler and interpreter designers try to reduce the number of arrays created during application execution, because the negative impact of arrays on performance is so dramatic. Just as The Three Bears had different requirements for their

  • wn satisfaction, so do differing array shapes have different

requirements for their elimination. The problem itself is a bear: scalar operations are the baby bear, typified here by dynamic programming and the Floyd-Warshall algorithm; operations on small arrays, such as numerically intense computations on complex arrays, is the mama bear; operations on large arrays, typified by acoustic signal processing, is the papa bear. We compare interpreted to compiled APL performance for several applications with different array shapes, and give an

  • verview of the various optimizations that enable those

speedups, in both serial and parallel contexts.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-3
SLIDE 3

. . . . . .

The APEX-SaC tool chain

◮ APEX: APL-to-SaC compiler (R. Bernecky)

SaC: SaC-to-C compiler ( S.B. Scholz) APEX & SaC preserve arrays throughout compilation SaC is a purely functional compiler SaC represents control structures as functions These characteristics are a mixed blessing

Robert Bernecky The Three Beaars – Dyalog ’12

slide-4
SLIDE 4

. . . . . .

The APEX-SaC tool chain

◮ APEX: APL-to-SaC compiler (R. Bernecky) ◮ SaC: SaC-to-C compiler ( S.B. Scholz)

APEX & SaC preserve arrays throughout compilation SaC is a purely functional compiler SaC represents control structures as functions These characteristics are a mixed blessing

Robert Bernecky The Three Beaars – Dyalog ’12

slide-5
SLIDE 5

. . . . . .

The APEX-SaC tool chain

◮ APEX: APL-to-SaC compiler (R. Bernecky) ◮ SaC: SaC-to-C compiler ( S.B. Scholz) ◮ APEX & SaC preserve arrays throughout compilation

SaC is a purely functional compiler SaC represents control structures as functions These characteristics are a mixed blessing

Robert Bernecky The Three Beaars – Dyalog ’12

slide-6
SLIDE 6

. . . . . .

The APEX-SaC tool chain

◮ APEX: APL-to-SaC compiler (R. Bernecky) ◮ SaC: SaC-to-C compiler ( S.B. Scholz) ◮ APEX & SaC preserve arrays throughout compilation ◮ SaC is a purely functional compiler

SaC represents control structures as functions These characteristics are a mixed blessing

Robert Bernecky The Three Beaars – Dyalog ’12

slide-7
SLIDE 7

. . . . . .

The APEX-SaC tool chain

◮ APEX: APL-to-SaC compiler (R. Bernecky) ◮ SaC: SaC-to-C compiler ( S.B. Scholz) ◮ APEX & SaC preserve arrays throughout compilation ◮ SaC is a purely functional compiler ◮ SaC represents control structures as functions

These characteristics are a mixed blessing

Robert Bernecky The Three Beaars – Dyalog ’12

slide-8
SLIDE 8

. . . . . .

The APEX-SaC tool chain

◮ APEX: APL-to-SaC compiler (R. Bernecky) ◮ SaC: SaC-to-C compiler ( S.B. Scholz) ◮ APEX & SaC preserve arrays throughout compilation ◮ SaC is a purely functional compiler ◮ SaC represents control structures as functions ◮ These characteristics are a mixed blessing. . .

Robert Bernecky The Three Beaars – Dyalog ’12

slide-9
SLIDE 9

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y:

(Interpreter) Parse code to find expression: 200ops Increment reference counts on : 50ops Conformance checks (type, rank, shape) for addition: 200ops Allocate temp for result from heap: 200ops Initialize temp: 100ops Perform actual additions: 200ops Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-10
SLIDE 10

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops

Increment reference counts on : 50ops Conformance checks (type, rank, shape) for addition: 200ops Allocate temp for result from heap: 200ops Initialize temp: 100ops Perform actual additions: 200ops Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-11
SLIDE 11

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops

Conformance checks (type, rank, shape) for addition: 200ops Allocate temp for result from heap: 200ops Initialize temp: 100ops Perform actual additions: 200ops Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-12
SLIDE 12

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops

Allocate temp for result from heap: 200ops Initialize temp: 100ops Perform actual additions: 200ops Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-13
SLIDE 13

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops

Initialize temp: 100ops Perform actual additions: 200ops Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-14
SLIDE 14

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops

Perform actual additions: 200ops Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-15
SLIDE 15

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops

Decrement reference counts on : 50ops Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-16
SLIDE 16

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y: 50ops

Deallocate old , if any: 100ops Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-17
SLIDE 17

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y: 50ops ◮ Deallocate old Z, if any: 100ops

Assign : 50ops TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-18
SLIDE 18

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y: 50ops ◮ Deallocate old Z, if any: 100ops ◮ Assign Zûtemp: 50ops

TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-19
SLIDE 19

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y: 50ops ◮ Deallocate old Z, if any: 100ops ◮ Assign Zûtemp: 50ops ◮ TOTAL: 1150ops

  • vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-20
SLIDE 20

. . . . . .

Why are arrays expensive?

◮ Consider the cost to perform ZûX+Y: ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y: 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y: 50ops ◮ Deallocate old Z, if any: 100ops ◮ Assign Zûtemp: 50ops ◮ TOTAL: 1150ops ◮ vs. compiled scalar code: 10ops

Robert Bernecky The Three Beaars – Dyalog ’12

slide-21
SLIDE 21

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars

Traditional optimization methods: CSE, VP, CP, etc. Allocate scalars on stack, instead of heap Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-22
SLIDE 22

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc.

Allocate scalars on stack, instead of heap Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-23
SLIDE 23

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap

Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-24
SLIDE 24

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code

This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-25
SLIDE 25

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter

Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-26
SLIDE 26

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter ◮ Experimental platform: AMD 1075T 6-core CPU, 3.2GHz

(cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-27
SLIDE 27

. . . . . .

Baby Bear - Eliminating Scalar Arrays in a Compiler

◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter ◮ Experimental platform: AMD 1075T 6-core CPU, 3.2GHz ◮ (cheap ASUS M4A88T-M desktop machine)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-28
SLIDE 28

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph

Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec Dynamic programming ( string shuffle): >1000X speedup Lesson: Interpreters dislike scalar-dominated algorithms Lesson: Compilers are not fussy; Baby bear problem solved! But, no parallelism: Adding threads just makes it slower! What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-29
SLIDE 29

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec

Dynamic programming ( string shuffle): >1000X speedup Lesson: Interpreters dislike scalar-dominated algorithms Lesson: Compilers are not fussy; Baby bear problem solved! But, no parallelism: Adding threads just makes it slower! What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-30
SLIDE 30

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup

Lesson: Interpreters dislike scalar-dominated algorithms Lesson: Compilers are not fussy; Baby bear problem solved! But, no parallelism: Adding threads just makes it slower! What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-31
SLIDE 31

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms

Lesson: Compilers are not fussy; Baby bear problem solved! But, no parallelism: Adding threads just makes it slower! What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-32
SLIDE 32

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved!

But, no parallelism: Adding threads just makes it slower! What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-33
SLIDE 33

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved! ◮ But, no parallelism: Adding threads just makes it slower!

What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-34
SLIDE 34

. . . . . .

Baby Bear Problem: Floyd-Warshall Algorithm

zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor

◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved! ◮ But, no parallelism: Adding threads just makes it slower! ◮ What about array-based solutions? It’s papa bear time!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-35
SLIDE 35

. . . . . .

Array-based Floyd-Warshall Algorithm

◮ j64-602, from J Essays ( CDC STAR APL algorithm variant)

floyd=: 3 : ('for j. i.#y';'do. y=. y <. j ({"1 +/ {) y';' end.';'y') SAC: Scholz & Bernecky ( Classic matmul variant) inline int[.,.] floydSbs1(int[.,.] D ) DT = transpose(D); res = with (. <= [i,j] <= .) : min( D[i,j], minval( D[i] + DT[j])); : modarray(D); return( res); A ”with-loop” is a nested data-parallel FORALL loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-36
SLIDE 36

. . . . . .

Array-based Floyd-Warshall Algorithm

◮ j64-602, from J Essays ( CDC STAR APL algorithm variant)

floyd=: 3 : ('for j. i.#y';'do. y=. y <. j ({"1 +/ {) y';' end.';'y')

◮ SAC: Scholz & Bernecky ( Classic matmul variant)

inline int[.,.] floydSbs1(int[.,.] D ) { DT = transpose(D); res = with (. <= [i,j] <= .) : min( D[i,j], minval( D[i] + DT[j])); : modarray(D); return( res); } A ”with-loop” is a nested data-parallel FORALL loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-37
SLIDE 37

. . . . . .

Array-based Floyd-Warshall Algorithm

◮ j64-602, from J Essays ( CDC STAR APL algorithm variant)

floyd=: 3 : ('for j. i.#y';'do. y=. y <. j ({"1 +/ {) y';' end.';'y')

◮ SAC: Scholz & Bernecky ( Classic matmul variant)

inline int[.,.] floydSbs1(int[.,.] D ) { DT = transpose(D); res = with (. <= [i,j] <= .) : min( D[i,j], minval( D[i] + DT[j])); : modarray(D); return( res); }

◮ A ”with-loop” is a nested data-parallel FORALL loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-38
SLIDE 38

. . . . . .

Array-based Floyd-Warshall Algorithm Speedup

Lesson: Array-based code and optimizers are good for you

0x 5x 10x 15x 20x 25x 30x 35x 40x 45x FOR−loop sbs/rbe−WLF sbs/rbe−AWLF Speedup Algorithm APEX/SAC (18091) vs. J Performance 2,012−07−20 Floyd−Warshall Shortest Path benchmark 3000x3,000 ( J Essays alg for J) 306s103s 306s 50.3s 26.9 17.9s 14.8s 12.5s 11.5s 306s 22.2s 14.2s 9.7s 8.9s 8.5s 7.6s

J −mt 1 −mt 2 −mt 3 −mt 4 −mt 5 −mt 6

Figure: Floyd-Warshall Performance

Robert Bernecky The Three Beaars – Dyalog ’12

slide-39
SLIDE 39

. . . . . .

Loop Fusion

◮ Z = V1 + (V2 * V3)

for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } Loop fusion transforms this into: for( j=0; j<n; j++) Z[j] = V1[j] + (V2[j] * V3[j]); Benefit: Array-valued tmp removed (DCR) Benefit: Reduced memory subsystem traffic Benefit: Reduced loop overhead Benefit: Improved parallelism, in some compilers

Robert Bernecky The Three Beaars – Dyalog ’12

slide-40
SLIDE 40

. . . . . .

Loop Fusion

◮ Z = V1 + (V2 * V3)

for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

◮ Loop fusion transforms this into:

for( j=0; j<n; j++) { Z[j] = V1[j] + (V2[j] * V3[j]); } Benefit: Array-valued tmp removed (DCR) Benefit: Reduced memory subsystem traffic Benefit: Reduced loop overhead Benefit: Improved parallelism, in some compilers

Robert Bernecky The Three Beaars – Dyalog ’12

slide-41
SLIDE 41

. . . . . .

Loop Fusion

◮ Z = V1 + (V2 * V3)

for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

◮ Loop fusion transforms this into:

for( j=0; j<n; j++) { Z[j] = V1[j] + (V2[j] * V3[j]); }

◮ Benefit: Array-valued tmp removed (DCR)

Benefit: Reduced memory subsystem traffic Benefit: Reduced loop overhead Benefit: Improved parallelism, in some compilers

Robert Bernecky The Three Beaars – Dyalog ’12

slide-42
SLIDE 42

. . . . . .

Loop Fusion

◮ Z = V1 + (V2 * V3)

for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

◮ Loop fusion transforms this into:

for( j=0; j<n; j++) { Z[j] = V1[j] + (V2[j] * V3[j]); }

◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic

Benefit: Reduced loop overhead Benefit: Improved parallelism, in some compilers

Robert Bernecky The Three Beaars – Dyalog ’12

slide-43
SLIDE 43

. . . . . .

Loop Fusion

◮ Z = V1 + (V2 * V3)

for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

◮ Loop fusion transforms this into:

for( j=0; j<n; j++) { Z[j] = V1[j] + (V2[j] * V3[j]); }

◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic ◮ Benefit: Reduced loop overhead

Benefit: Improved parallelism, in some compilers

Robert Bernecky The Three Beaars – Dyalog ’12

slide-44
SLIDE 44

. . . . . .

Loop Fusion

◮ Z = V1 + (V2 * V3)

for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

◮ Loop fusion transforms this into:

for( j=0; j<n; j++) { Z[j] = V1[j] + (V2[j] * V3[j]); }

◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic ◮ Benefit: Reduced loop overhead ◮ Benefit: Improved parallelism, in some compilers

Robert Bernecky The Three Beaars – Dyalog ’12

slide-45
SLIDE 45

. . . . . .

With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF)

◮ WLF (S.B. Scholz) - a generalization of loop fusion

Handles Arrays of Known Shape (AKS) only AWLF (R. Bernecky) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation):

Robert Bernecky The Three Beaars – Dyalog ’12

slide-46
SLIDE 46

. . . . . .

With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF)

◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only

AWLF (R. Bernecky) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation):

Robert Bernecky The Three Beaars – Dyalog ’12

slide-47
SLIDE 47

. . . . . .

With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF)

◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky)

Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation):

Robert Bernecky The Three Beaars – Dyalog ’12

slide-48
SLIDE 48

. . . . . .

With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF)

◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky) ◮ Handles AKS arrays & Arrays of Known Dimension (AKD)

Acoustic signal processing (delta modulation):

Robert Bernecky The Three Beaars – Dyalog ’12

slide-49
SLIDE 49

. . . . . .

With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF)

◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky) ◮ Handles AKS arrays & Arrays of Known Dimension (AKD) ◮ Acoustic signal processing (delta modulation):

logdû{¢50Ó50Ä50«(DIFF 0,×)ß0.01+×} DIFFû{¢1Õ×-¢1÷×}

Robert Bernecky The Three Beaars – Dyalog ’12

slide-50
SLIDE 50

. . . . . .

WLF/AWLF example: Acoustic Signal Processing

◮ logd on 200E6-element double-precision vector

Sixteen with-loops are folded into two WLs! WLF/AWLF increase available parallelism sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec sec APL 7.8s n/a n/a

  • nowlf -O3

10.7s 5.5s 1.9X

  • doawlf -O3

3.2s 0.7s 4.5X Speedup 3.3X 7.8X 15X

Robert Bernecky The Three Beaars – Dyalog ’12

slide-51
SLIDE 51

. . . . . .

WLF/AWLF example: Acoustic Signal Processing

◮ logd on 200E6-element double-precision vector ◮ Sixteen with-loops are folded into two WLs!

WLF/AWLF increase available parallelism sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec sec APL 7.8s n/a n/a

  • nowlf -O3

10.7s 5.5s 1.9X

  • doawlf -O3

3.2s 0.7s 4.5X Speedup 3.3X 7.8X 15X

Robert Bernecky The Three Beaars – Dyalog ’12

slide-52
SLIDE 52

. . . . . .

WLF/AWLF example: Acoustic Signal Processing

◮ logd on 200E6-element double-precision vector ◮ Sixteen with-loops are folded into two WLs! ◮ WLF/AWLF increase available parallelism

sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec sec APL 7.8s n/a n/a

  • nowlf -O3

10.7s 5.5s 1.9X

  • doawlf -O3

3.2s 0.7s 4.5X Speedup 3.3X 7.8X 15X

Robert Bernecky The Three Beaars – Dyalog ’12

slide-53
SLIDE 53

. . . . . .

With-Loop Scalarization (WLS)

◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

Trojahner) Operates on nested-WLs in which inner loop creates non-scalar cells WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) B = with ([0] <= jv < [4]) genarray( [4], iv[0] + 2 * jv[0]); genarray( [4], B); WLS transforms this into: A = with ([0,0] <= iv < [4,4]) genarray( [4,4], iv[0] + 2 * iv[1]); Mandatory for good performance: array-valued temps removed

Robert Bernecky The Three Beaars – Dyalog ’12

slide-54
SLIDE 54

. . . . . .

With-Loop Scalarization (WLS)

◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

Trojahner)

◮ Operates on nested-WLs in which inner loop creates

non-scalar cells WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) B = with ([0] <= jv < [4]) genarray( [4], iv[0] + 2 * jv[0]); genarray( [4], B); WLS transforms this into: A = with ([0,0] <= iv < [4,4]) genarray( [4,4], iv[0] + 2 * iv[1]); Mandatory for good performance: array-valued temps removed

Robert Bernecky The Three Beaars – Dyalog ’12

slide-55
SLIDE 55

. . . . . .

With-Loop Scalarization (WLS)

◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

Trojahner)

◮ Operates on nested-WLs in which inner loop creates

non-scalar cells

◮ WLS to merge loop-nest pairs, forming a single WL

A = with ([0] <= iv < [4]) { B = with ([0] <= jv < [4]) genarray( [4], iv[0] + 2 * jv[0]); } genarray( [4], B); WLS transforms this into: A = with ([0,0] <= iv < [4,4]) genarray( [4,4], iv[0] + 2 * iv[1]); Mandatory for good performance: array-valued temps removed

Robert Bernecky The Three Beaars – Dyalog ’12

slide-56
SLIDE 56

. . . . . .

With-Loop Scalarization (WLS)

◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

Trojahner)

◮ Operates on nested-WLs in which inner loop creates

non-scalar cells

◮ WLS to merge loop-nest pairs, forming a single WL

A = with ([0] <= iv < [4]) { B = with ([0] <= jv < [4]) genarray( [4], iv[0] + 2 * jv[0]); } genarray( [4], B);

◮ WLS transforms this into:

A = with ([0,0] <= iv < [4,4]) genarray( [4,4], iv[0] + 2 * iv[1]); Mandatory for good performance: array-valued temps removed

Robert Bernecky The Three Beaars – Dyalog ’12

slide-57
SLIDE 57

. . . . . .

With-Loop Scalarization (WLS)

◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

Trojahner)

◮ Operates on nested-WLs in which inner loop creates

non-scalar cells

◮ WLS to merge loop-nest pairs, forming a single WL

A = with ([0] <= iv < [4]) { B = with ([0] <= jv < [4]) genarray( [4], iv[0] + 2 * jv[0]); } genarray( [4], B);

◮ WLS transforms this into:

A = with ([0,0] <= iv < [4,4]) genarray( [4,4], iv[0] + 2 * iv[1]);

◮ Mandatory for good performance: array-valued temps removed

Robert Bernecky The Three Beaars – Dyalog ’12

slide-58
SLIDE 58

. . . . . .

WLF/AWLF/WLS example: Poisson 2-D Relaxation Kernel

From Sven-Bodo Scholz: With-Loop-Folding in Sac A good argument for Ken Iverson’s mask verb! zûrelax A;m;n. . . mû(ÒA)[0] nû(ÒA)[1] Bû((1áA)+(¢1áA)+(1÷A)+(¢1÷A))ß4 upperAû(1,n)ÙA lowerAû((m-1),0)ÕA leftAû1 0Õ((m-1),1)ÙA rightAû((m-2),1)Ù(1,n-1)ÕA innerBû((m-2),n-2)Ù1 1ÕB middleûleftA,innerB,rightA zûupperA¬middle¬lowerA

Robert Bernecky The Three Beaars – Dyalog ’12

slide-59
SLIDE 59

. . . . . .

Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes

◮ AWLF, aided by WLS, folds relax function into 1 loop!

20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s APEX/SAC 18091, single-thread: CPU time = 3.65s APEX/SAC 18091: multi-threaded (no source code changes!)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-60
SLIDE 60

. . . . . .

Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes

◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s

APEX/SAC 18091, single-thread: CPU time = 3.65s APEX/SAC 18091: multi-threaded (no source code changes!)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-61
SLIDE 61

. . . . . .

Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes

◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s ◮ APEX/SAC 18091, single-thread: CPU time = 3.65s

APEX/SAC 18091: multi-threaded (no source code changes!)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-62
SLIDE 62

. . . . . .

Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes

◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s ◮ APEX/SAC 18091, single-thread: CPU time = 3.65s ◮ APEX/SAC 18091: multi-threaded (no source code changes!)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-63
SLIDE 63

. . . . . .

Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes

0x 10x 20x 30x 40x 50x 60x 70x 250x250,20K 500x500,20K 10Kx10K,100 Speedup Problem size, iteration count APEX/SAC (18091) vs. Dyalog APL 13.0 Performance 2,012−07−19 6−core AMD Phenom II X6 1075T, 3.2GHz 47.4s 3.65s 1.85s 1.29s 1.08s 0.89s 0.83s 204s 14.7s 7.5s 5.2s 4.26s 3.48s 3.5s 418s 42.1s 29.9s 25.9s 23.6s 22.7s 23.8s

APL −mt 1 −mt 2 −mt 3 −mt 4 −mt 5 −mt 6

Figure: APEX vs. APL CPU time performance

Robert Bernecky The Three Beaars – Dyalog ’12

slide-64
SLIDE 64

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test?

Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint Memory subsystem bandwidth: 4464MB/s Grid is 800MB 5 writes of grid to/from memory/s Therefore, speedup is eventually memory-limited on cheapo system Scholz sees linear speedup on 48-core system Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-65
SLIDE 65

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint

APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint Memory subsystem bandwidth: 4464MB/s Grid is 800MB 5 writes of grid to/from memory/s Therefore, speedup is eventually memory-limited on cheapo system Scholz sees linear speedup on 48-core system Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-66
SLIDE 66

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint

Memory subsystem bandwidth: 4464MB/s Grid is 800MB 5 writes of grid to/from memory/s Therefore, speedup is eventually memory-limited on cheapo system Scholz sees linear speedup on 48-core system Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-67
SLIDE 67

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s

Grid is 800MB 5 writes of grid to/from memory/s Therefore, speedup is eventually memory-limited on cheapo system Scholz sees linear speedup on 48-core system Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-68
SLIDE 68

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s

Therefore, speedup is eventually memory-limited on cheapo system Scholz sees linear speedup on 48-core system Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-69
SLIDE 69

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo

system Scholz sees linear speedup on 48-core system Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-70
SLIDE 70

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo

system

◮ Scholz sees linear speedup on 48-core system

Lesson: High memory bandwidth is good for you. Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-71
SLIDE 71

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo

system

◮ Scholz sees linear speedup on 48-core system ◮ Lesson: High memory bandwidth is good for you.

Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-72
SLIDE 72

. . . . . .

Poisson 2-D Relaxation: Memory footprint

◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo

system

◮ Scholz sees linear speedup on 48-core system ◮ Lesson: High memory bandwidth is good for you. ◮ Lesson: Array optimizations are VERY good for you.

Robert Bernecky The Three Beaars – Dyalog ’12

slide-73
SLIDE 73

. . . . . .

Mama Bear Motivation

Why is interpreted APL faster than compiled code for some tests?

0.1 1 10 100 1,000 buildvAKS buildvfAKS buildv2AKS compiotaAKS compiotadAKS csbenchAKS downgradePVAKS fdAKS floydAKS gewlfAKS histgradeAKS histlpAKS histopAKS histopfAKS iotanAKS ipapeAKS ipbbAKS ipbdAKS ipddAKS ipopneAKS ipplusandAKS lltopAKS logdAKS logd2AKS logd3AKS logd4AKS loopfsAKS loopfvAKS loopisAKS matiotaAKS mconvAKS nsvAKS nthoneAKS poissonAKS primesAKS schedrAKS scsAKS sdyn4AKS snpAKS testforAKS testindxAKS testlcvAKS tjckrbeAKS tjck2AKS ulamAKS unirandAKS upgradeBoolAKS upgradeCharAKS upgradeIntVecAKS upgradePVAKS upgradeRPVAKS waverAKS Speedup (APL/APEX w/AWLF) Benchmark name APL vs. APEX CPU Time Performance (2,012−09−15) Higher is better for APEX APL: Dyalog APL 13.0 SAC: 18,221:MODIFIED

Figure: APEX vs. APL CPU time performance

Robert Bernecky The Three Beaars – Dyalog ’12

slide-74
SLIDE 74

. . . . . .

Mama Bear Motivation

Some reasons for poor performance of compiled SAC code:

◮ Index vector generation for indexed assign

Shape vector generation for variable result shapes Generation of small arrays, e.g., complex scalars No SaC FOR-loop analog to with-loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-75
SLIDE 75

. . . . . .

Mama Bear Motivation

Some reasons for poor performance of compiled SAC code:

◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes

Generation of small arrays, e.g., complex scalars No SaC FOR-loop analog to with-loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-76
SLIDE 76

. . . . . .

Mama Bear Motivation

Some reasons for poor performance of compiled SAC code:

◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes ◮ Generation of small arrays, e.g., complex scalars

No SaC FOR-loop analog to with-loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-77
SLIDE 77

. . . . . .

Mama Bear Motivation

Some reasons for poor performance of compiled SAC code:

◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes ◮ Generation of small arrays, e.g., complex scalars ◮ No SaC FOR-loop analog to with-loop

Robert Bernecky The Three Beaars – Dyalog ’12

slide-78
SLIDE 78

. . . . . .

Mama Bear - Small Array Scalarization

◮ Replace small arrays by their scalarized form

Optimization: Primitive Function Unrolling (Classic) Optimization: Index Vector Elimination (IVE) ( sacdev) 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-79
SLIDE 79

. . . . . .

Mama Bear - Small Array Scalarization

◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic)

Optimization: Index Vector Elimination (IVE) ( sacdev) 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-80
SLIDE 80

. . . . . .

Mama Bear - Small Array Scalarization

◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic) ◮ Optimization: Index Vector Elimination (IVE) ( sacdev)

2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-81
SLIDE 81

. . . . . .

Mama Bear - Small Array Scalarization

◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic) ◮ Optimization: Index Vector Elimination (IVE) ( sacdev)

2–16X speedup observed

◮ Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-82
SLIDE 82

. . . . . .

Mama Bear - Small Array Scalarization

◮ Mandelbrot set computation performance

mandelbrot: Uses complex numbers int calc( complex z, int maxdepth) while(real(z)*real(z)+imag(z)*imag(z)<=4.0) Complex scalars, under the covers: complex z double(2) z real(z) z[0] imag(z) z[1] mandelbrot opt: Hand-scalarized - pair of scalars int calc( double zr, double zi, int maxdepth) while( zr * zr + zi * zi <= 4.0)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-83
SLIDE 83

. . . . . .

Mama Bear - Small Array Scalarization

◮ Mandelbrot set computation performance ◮ mandelbrot: Uses complex numbers

int calc( complex z, int maxdepth) {. . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0). . . Complex scalars, under the covers: complex z double(2) z real(z) z[0] imag(z) z[1] mandelbrot opt: Hand-scalarized - pair of scalars int calc( double zr, double zi, int maxdepth) while( zr * zr + zi * zi <= 4.0)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-84
SLIDE 84

. . . . . .

Mama Bear - Small Array Scalarization

◮ Mandelbrot set computation performance ◮ mandelbrot: Uses complex numbers

int calc( complex z, int maxdepth) {. . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0). . .

◮ Complex scalars, under the covers:

complex z ↔ double(2) z real(z) ↔ z[0] imag(z) ↔ z[1] mandelbrot opt: Hand-scalarized - pair of scalars int calc( double zr, double zi, int maxdepth) while( zr * zr + zi * zi <= 4.0)

Robert Bernecky The Three Beaars – Dyalog ’12

slide-85
SLIDE 85

. . . . . .

Mama Bear - Small Array Scalarization

◮ Mandelbrot set computation performance ◮ mandelbrot: Uses complex numbers

int calc( complex z, int maxdepth) {. . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0). . .

◮ Complex scalars, under the covers:

complex z ↔ double(2) z real(z) ↔ z[0] imag(z) ↔ z[1]

◮ mandelbrot opt: Hand-scalarized - pair of scalars

int calc( double zr, double zi, int maxdepth) {. . . while( zr * zr + zi * zi <= 4.0). . .

Robert Bernecky The Three Beaars – Dyalog ’12

slide-86
SLIDE 86

. . . . . .

Mama Bear - Small Array Scalarization

◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled

Test Opts

  • mt 1
  • mt 2
  • mt 3
  • mt 4
  • mt 5
  • mt 6

mandelbrot

  • ff

1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s mandelbrot opt

  • ff

71.8s 48.4s 35.2s 28.1s 23.0s 19.8s mandelbrot

  • n

69.9s 46.1s 34.6s 28.1s 23.0s 21.9s mandelbrot opt

  • n

70.7s 46.7s 34.7s 28.2s 22.9s 19.6s

Lesson: No more suffering for being elegant Well, less suffering for being elegant

Robert Bernecky The Three Beaars – Dyalog ’12

slide-87
SLIDE 87

. . . . . .

Mama Bear - Small Array Scalarization

◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled

Test Opts

  • mt 1
  • mt 2
  • mt 3
  • mt 4
  • mt 5
  • mt 6

mandelbrot

  • ff

1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s mandelbrot opt

  • ff

71.8s 48.4s 35.2s 28.1s 23.0s 19.8s mandelbrot

  • n

69.9s 46.1s 34.6s 28.1s 23.0s 21.9s mandelbrot opt

  • n

70.7s 46.7s 34.7s 28.2s 22.9s 19.6s

◮ Lesson: No more suffering for being elegant

Well, less suffering for being elegant

Robert Bernecky The Three Beaars – Dyalog ’12

slide-88
SLIDE 88

. . . . . .

Mama Bear - Small Array Scalarization

◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled

Test Opts

  • mt 1
  • mt 2
  • mt 3
  • mt 4
  • mt 5
  • mt 6

mandelbrot

  • ff

1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s mandelbrot opt

  • ff

71.8s 48.4s 35.2s 28.1s 23.0s 19.8s mandelbrot

  • n

69.9s 46.1s 34.6s 28.1s 23.0s 21.9s mandelbrot opt

  • n

70.7s 46.7s 34.7s 28.2s 22.9s 19.6s

◮ Lesson: No more suffering for being elegant ◮ Well, less suffering for being elegant. . .

Robert Bernecky The Three Beaars – Dyalog ’12

slide-89
SLIDE 89

. . . . . .

GPU (CUDA) Support Without Suffering

◮ SaC generates CUDA code automatically: -target cuda

Physics experiment

Robert Bernecky The Three Beaars – Dyalog ’12

slide-90
SLIDE 90

. . . . . .

GPU (CUDA) Support Without Suffering

◮ SaC generates CUDA code automatically: -target cuda ◮ Physics experiment

10 20 30 40 50 60 70 256 384 512 640 768 896 1024 1152 1280 1408 1536

Speedup Problem Size

LatticeBoltzmann CUDA vs. SaC Speedups (8800GT)

10 Steps 25 Steps 50 Steps 100 Steps 200 Steps

Robert Bernecky The Three Beaars – Dyalog ’12

slide-91
SLIDE 91

. . . . . .

Goldilocks - Nested Arrays in APEX/SAC

◮ Nested arrays are alive and living in SAC! (R. Douma)

APL convolution kernel using EACH: SAC convolution kernel using EACH: nested double[.] NDV; nested double NDS; pt=trace++(filter*0.0); NB. No overtake in SAC z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with ( . <= iv <= .) : con(dc[iv],fi,disclose NDV(tr)); : genarray(shape(dc),0.0); con: matmul(fi,take(shape(fi),drop([dc],tr))) Performance is so-so: Optimistic optimizations required

Robert Bernecky The Three Beaars – Dyalog ’12

slide-92
SLIDE 92

. . . . . .

Goldilocks - Nested Arrays in APEX/SAC

◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH:

convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×} SAC convolution kernel using EACH: nested double[.] NDV; nested double NDS; pt=trace++(filter*0.0); NB. No overtake in SAC z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with ( . <= iv <= .) : con(dc[iv],fi,disclose NDV(tr)); : genarray(shape(dc),0.0); con: matmul(fi,take(shape(fi),drop([dc],tr))) Performance is so-so: Optimistic optimizations required

Robert Bernecky The Three Beaars – Dyalog ’12

slide-93
SLIDE 93

. . . . . .

Goldilocks - Nested Arrays in APEX/SAC

◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH:

convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×}

◮ SAC convolution kernel using EACH:

nested double[.] NDV; nested double NDS; pt=trace++(filter*0.0); NB. No overtake in SAC z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with{ ( . <= iv <= .) : con(dc[iv],fi,disclose NDV(tr)); } : genarray(shape(dc),0.0); con: matmul(fi,take(shape(fi),drop([dc],tr))) Performance is so-so: Optimistic optimizations required

Robert Bernecky The Three Beaars – Dyalog ’12

slide-94
SLIDE 94

. . . . . .

Goldilocks - Nested Arrays in APEX/SAC

◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH:

convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×}

◮ SAC convolution kernel using EACH:

nested double[.] NDV; nested double NDS; pt=trace++(filter*0.0); NB. No overtake in SAC z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with{ ( . <= iv <= .) : con(dc[iv],fi,disclose NDV(tr)); } : genarray(shape(dc),0.0); con: matmul(fi,take(shape(fi),drop([dc],tr)))

◮ Performance is so-so: Optimistic optimizations required

Robert Bernecky The Three Beaars – Dyalog ’12

slide-95
SLIDE 95

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X All optimizations are critical for getting excellent performance Array-based algorithms will win, and scale well Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-96
SLIDE 96

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X

◮ All optimizations are critical for getting excellent performance

Array-based algorithms will win, and scale well Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-97
SLIDE 97

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X

◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well

Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-98
SLIDE 98

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X

◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work

Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-99
SLIDE 99

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X

◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive

Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-100
SLIDE 100

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X

◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive ◮ Small arrays: Perhaps (likely!), additional work will be needed

And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12

slide-101
SLIDE 101

. . . . . .

Summary and Future Work

◮ Status:

Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars mature up to 1300X none Mama small developing up to 20X enables other opts Papa large nearly done up to 10X 2X-50X

◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive ◮ Small arrays: Perhaps (likely!), additional work will be needed ◮ And, they lived more or less happily ever after! Thank you!

Robert Bernecky The Three Beaars – Dyalog ’12