The Three Beaars . . . . The Three Beaars Dyalog 12 . - PowerPoint PPT Presentation

. Why are arrays expensive? Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . ◮ Consider the cost to perform ZûX+Y : ◮ (Interpreter) Parse code to find expression: 200ops ◮ Increment reference counts on X,Y : 50ops ◮ Conformance checks (type, rank, shape) for addition: 200ops ◮ Allocate temp for result from heap: 200ops ◮ Initialize temp: 100ops ◮ Perform actual additions: 200ops ◮ Decrement reference counts on X,Y : 50ops ◮ Deallocate old Z , if any: 100ops ◮ Assign Zûtemp : 50ops ◮ TOTAL: 1150ops ◮ vs. compiled scalar code: 10ops

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Traditional optimization methods: CSE, VP, CP, etc. Allocate scalars on stack, instead of heap Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Allocate scalars on stack, instead of heap Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc.

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Generate scalar-specific code This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler This is challenging to do in an interpreter Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Experimental platform: AMD 1075T 6-core CPU, 3.2GHz (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler (cheap ASUS M4A88T-M desktop machine) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter ◮ Experimental platform: AMD 1075T 6-core CPU, 3.2GHz

. . . . . . Baby Bear - Eliminating Scalar Arrays in a Compiler Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Use classical static data flow analysis to find scalars ◮ Traditional optimization methods: CSE, VP, CP, etc. ◮ Allocate scalars on stack, instead of heap ◮ Generate scalar-specific code ◮ This is challenging to do in an interpreter ◮ Experimental platform: AMD 1075T 6-core CPU, 3.2GHz ◮ (cheap ASUS M4A88T-M desktop machine)

. . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! Lesson: Interpreters dislike scalar-dominated algorithms Dynamic programming ( string shuffle): >1000X speedup Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec The Three Beaars – Dyalog ’12 . Baby Bear Problem: Floyd-Warshall Algorithm . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph

. . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! Lesson: Interpreters dislike scalar-dominated algorithms Dynamic programming ( string shuffle): >1000X speedup The Three Beaars – Dyalog ’12 Baby Bear Problem: Floyd-Warshall Algorithm . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec

. . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! Lesson: Interpreters dislike scalar-dominated algorithms The Three Beaars – Dyalog ’12 Baby Bear Problem: Floyd-Warshall Algorithm . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup

. . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! Lesson: Compilers are not fussy; Baby bear problem solved! The Three Beaars – Dyalog ’12 . Baby Bear Problem: Floyd-Warshall Algorithm . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms

. . Robert Bernecky What about array-based solutions? It’s papa bear time! But, no parallelism: Adding threads just makes it slower! The Three Beaars – Dyalog ’12 Baby Bear Problem: Floyd-Warshall Algorithm . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved!

. Baby Bear Problem: Floyd-Warshall Algorithm Robert Bernecky What about array-based solutions? It’s papa bear time! . The Three Beaars – Dyalog ’12 . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved! ◮ But, no parallelism: Adding threads just makes it slower!

. Baby Bear Problem: Floyd-Warshall Algorithm Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . zûfloyd D;i;j;k sizûÉ(ÒD)[0] :For k :In siz :For i :In siz :For j :In siz D[i;j]ûD[i;j]ÄD[i;k]+D[k;j] :EndFor :EndFor :EndFor ◮ Problem size: 3000x3000 graph ◮ Dyalog APL, J interpreters: one week-ish; APEX/SAC: 103sec ◮ Dynamic programming ( string shuffle): >1000X speedup ◮ Lesson: Interpreters dislike scalar-dominated algorithms ◮ Lesson: Compilers are not fussy; Baby bear problem solved! ◮ But, no parallelism: Adding threads just makes it slower! ◮ What about array-based solutions? It’s papa bear time!

. SAC: Scholz & Bernecky ( Classic matmul variant) Robert Bernecky A ”with-loop” is a nested data-parallel FORALL loop return( res); : modarray(D); min( D[i,j], minval( D[i] + DT[j])); (. <= [i,j] <= .) : res = with DT = transpose(D); inline int[.,.] floydSbs1(int[.,.] D ) end.';'y') . floyd=: 3 : Array-based Floyd-Warshall Algorithm . . . . The Three Beaars – Dyalog ’12 ◮ j64-602, from J Essays ( CDC STAR APL algorithm variant) ('for j. i.#y';'do. y=. y <. j ( { "1 +/ { ) y';'

. . Robert Bernecky A ”with-loop” is a nested data-parallel FORALL loop return( res); : modarray(D); min( D[i,j], minval( D[i] + DT[j])); (. <= [i,j] <= .) : res = with DT = transpose(D); The Three Beaars – Dyalog ’12 end.';'y') floyd=: 3 : Array-based Floyd-Warshall Algorithm . . . . ◮ j64-602, from J Essays ( CDC STAR APL algorithm variant) ('for j. i.#y';'do. y=. y <. j ( { "1 +/ { ) y';' ◮ SAC: Scholz & Bernecky ( Classic matmul variant) inline int[.,.] floydSbs1(int[.,.] D ) { }

. end.';'y') Robert Bernecky return( res); : modarray(D); min( D[i,j], minval( D[i] + DT[j])); (. <= [i,j] <= .) : res = with DT = transpose(D); . The Three Beaars – Dyalog ’12 floyd=: 3 : Array-based Floyd-Warshall Algorithm . . . . ◮ j64-602, from J Essays ( CDC STAR APL algorithm variant) ('for j. i.#y';'do. y=. y <. j ( { "1 +/ { ) y';' ◮ SAC: Scholz & Bernecky ( Classic matmul variant) inline int[.,.] floydSbs1(int[.,.] D ) { } ◮ A ”with-loop” is a nested data-parallel FORALL loop

. Lesson: Array-based code and optimizers are good for you Robert Bernecky Figure: Floyd-Warshall Performance . The Three Beaars – Dyalog ’12 Array-based Floyd-Warshall Algorithm Speedup . . . . APEX/SAC (18091) vs. J Performance 2,012−07−20 45x 7.6s Floyd−Warshall Shortest Path benchmark 3000x3,000 40x 8.5s J 8.9s 35x −mt 1 9.7s 11.5s −mt 2 30x Speedup −mt 3 12.5s 25x −mt 4 14.8s 14.2s −mt 5 20x −mt 6 17.9s 15x 22.2s 26.9 10x 50.3s ( J Essays alg for J) 5x 306s103s 306s 306s 0x FOR−loop sbs/rbe−WLF sbs/rbe−AWLF Algorithm

. Loop fusion transforms this into: Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Benefit: Reduced memory subsystem traffic Benefit: Array-valued tmp removed (DCR) (V2[j] * V3[j]); Z[j] = V1[j] + for( j=0; j<n; j++) The Three Beaars – Dyalog ’12 . Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; }

. . Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Benefit: Reduced memory subsystem traffic Benefit: Array-valued tmp removed (DCR) Z[j] = V1[j] + The Three Beaars – Dyalog ’12 . Loop Fusion . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); }

. . Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Benefit: Reduced memory subsystem traffic Z[j] = V1[j] + The Three Beaars – Dyalog ’12 Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR)

. . Robert Bernecky Benefit: Improved parallelism, in some compilers Benefit: Reduced loop overhead Z[j] = V1[j] + The Three Beaars – Dyalog ’12 Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic

. . Robert Bernecky Benefit: Improved parallelism, in some compilers Z[j] = V1[j] + The Three Beaars – Dyalog ’12 Loop Fusion . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic ◮ Benefit: Reduced loop overhead

. Loop Fusion Robert Bernecky Z[j] = V1[j] + . The Three Beaars – Dyalog ’12 . . . . ◮ Z = V1 + (V2 * V3) for( i=0; i<n; i++) { tmp[i] = V2[i] * V3[i] ; } for( j=0; j<n; j++) { Z[j] = V1[j] + tmp[j]; } ◮ Loop fusion transforms this into: for( j=0; j<n; j++) { (V2[j] * V3[j]); } ◮ Benefit: Array-valued tmp removed (DCR) ◮ Benefit: Reduced memory subsystem traffic ◮ Benefit: Reduced loop overhead ◮ Benefit: Improved parallelism, in some compilers

. . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) Handles Arrays of Known Shape (AKS) only AWLF (R. Bernecky) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion

. . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) AWLF (R. Bernecky) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only

. . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) Handles AKS arrays & Arrays of Known Dimension (AKD) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky)

. . . . . . With-Loop Folding (WLF) and Algebraic With-Loop Folding (AWLF) Acoustic signal processing (delta modulation): Robert Bernecky The Three Beaars – Dyalog ’12 ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky) ◮ Handles AKS arrays & Arrays of Known Dimension (AKD)

. With-Loop Folding (WLF) and Algebraic With-Loop Robert Bernecky . Folding (AWLF) The Three Beaars – Dyalog ’12 . . . . ◮ WLF (S.B. Scholz) - a generalization of loop fusion ◮ Handles Arrays of Known Shape (AKS) only ◮ AWLF (R. Bernecky) ◮ Handles AKS arrays & Arrays of Known Dimension (AKD) ◮ Acoustic signal processing (delta modulation): logdû{¢50Ó50Ä50«(DIFF 0,×)ß0.01+×} DIFFû{¢1Õ×-¢1÷×}

. 3.2s n/a -nowlf -O3 10.7s 5.5s 1.9X -doawlf -O3 0.7s 7.8s 4.5X Speedup 3.3X 7.8X 15X Robert Bernecky n/a APL . sec . . . . WLF/AWLF example: Acoustic Signal Processing Sixteen with-loops are folded into two WLs! WLF/AWLF increase available parallelism sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec The Three Beaars – Dyalog ’12 ◮ logd on 200E6-element double-precision vector

. 3.2s n/a -nowlf -O3 10.7s 5.5s 1.9X -doawlf -O3 0.7s 7.8s 4.5X Speedup 3.3X 7.8X 15X Robert Bernecky n/a APL . sec . . . . WLF/AWLF example: Acoustic Signal Processing WLF/AWLF increase available parallelism sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec The Three Beaars – Dyalog ’12 ◮ logd on 200E6-element double-precision vector ◮ Sixteen with-loops are folded into two WLs!

. 3.2s n/a -nowlf -O3 10.7s 5.5s 1.9X -doawlf -O3 0.7s 7.8s 4.5X Speedup 3.3X 7.8X 15X Robert Bernecky n/a APL . sec . . . . WLF/AWLF example: Acoustic Signal Processing The Three Beaars – Dyalog ’12 sac2c options Serial Parallel ( -mt 6) Speedup elapsed time elapsed time sec ◮ logd on 200E6-element double-precision vector ◮ Sixteen with-loops are folded into two WLs! ◮ WLF/AWLF increase available parallelism

. A = with ([0] <= iv < [4]) Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) WLS transforms this into: genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) WLS to merge loop-nest pairs, forming a single WL . non-scalar cells Operates on nested-WLs in which inner loop creates Trojahner) With-Loop Scalarization (WLS) . . . . The Three Beaars – Dyalog ’12 ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K.

. A = with ([0] <= iv < [4]) Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) WLS transforms this into: genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) WLS to merge loop-nest pairs, forming a single WL . non-scalar cells Trojahner) With-Loop Scalarization (WLS) . . . . The Three Beaars – Dyalog ’12 ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates

. . Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) WLS transforms this into: genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) The Three Beaars – Dyalog ’12 non-scalar cells Trojahner) With-Loop Scalarization (WLS) . . . . ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates ◮ WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) { }

. . Robert Bernecky Mandatory for good performance: array-valued temps removed genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) The Three Beaars – Dyalog ’12 non-scalar cells Trojahner) With-Loop Scalarization (WLS) . . . . ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates ◮ WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) { } ◮ WLS transforms this into:

. non-scalar cells Robert Bernecky genarray( [4,4], iv[0] + 2 * iv[1]); A = with ([0,0] <= iv < [4,4]) genarray( [4], B); genarray( [4], iv[0] + 2 * jv[0]); B = with ([0] <= jv < [4]) . The Three Beaars – Dyalog ’12 Trojahner) With-Loop Scalarization (WLS) . . . . ◮ With-Loop Scalarization: ( C. Grelck, S.B. Scholz, K. ◮ Operates on nested-WLs in which inner loop creates ◮ WLS to merge loop-nest pairs, forming a single WL A = with ([0] <= iv < [4]) { } ◮ WLS transforms this into: ◮ Mandatory for good performance: array-valued temps removed

. From Sven-Bodo Scholz: With-Loop-Folding in Sac Robert Bernecky . A good argument for Ken Iverson’s mask verb! The Three Beaars – Dyalog ’12 WLF/AWLF/WLS example: Poisson 2-D Relaxation Kernel . . . . zûrelax A;m;n. . . mû(ÒA)[0] nû(ÒA)[1] Bû((1áA)+(¢1áA)+(1÷A)+(¢1÷A))ß4 upperAû(1,n)ÙA lowerAû((m-1),0)ÕA leftAû1 0Õ((m-1),1)ÙA rightAû((m-2),1)Ù(1,n-1)ÕA innerBû((m-2),n-2)Ù1 1ÕB middleûleftA,innerB,rightA zûupperA¬middle¬lowerA

. . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s APEX/SAC 18091, single-thread: CPU time = 3.65s APEX/SAC 18091: multi-threaded (no source code changes!) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop!

. . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes APEX/SAC 18091, single-thread: CPU time = 3.65s APEX/SAC 18091: multi-threaded (no source code changes!) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s

. . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes APEX/SAC 18091: multi-threaded (no source code changes!) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s ◮ APEX/SAC 18091, single-thread: CPU time = 3.65s

. . . . . . Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes Robert Bernecky The Three Beaars – Dyalog ’12 ◮ AWLF, aided by WLS, folds relax function into 1 loop! ◮ 20K iterations, 250x250 grid: Dyalog APL: CPU time = 47.4s ◮ APEX/SAC 18091, single-thread: CPU time = 3.65s ◮ APEX/SAC 18091: multi-threaded (no source code changes!)

. Poisson 2-D Relaxation: Multi-thread, Various Grid Sizes Robert Bernecky Figure: APEX vs. APL CPU time performance . The Three Beaars – Dyalog ’12 . . . . APEX/SAC (18091) vs. Dyalog APL 13.0 Performance 2,012−07−19 70x 6−core AMD Phenom II X6 1075T, 3.2GHz 3.48s 3.5s 0.83s 60x 0.89s APL 4.26s 50x −mt 1 1.08s Speedup 5.2s −mt 2 1.29s 40x −mt 3 −mt 4 22.7s 7.5s 30x 1.85s −mt 5 23.6s 25.9s −mt 6 23.8s 20x 29.9s 14.7s 3.65s 42.1s 10x 47.4s 204s 418s 0x 250x250,20K 500x500,20K 10Kx10K,100 Problem size, iteration count

. Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s Memory subsystem bandwidth: 4464MB/s . APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint Poisson 2-D Relaxation: Memory footprint . . . . The Three Beaars – Dyalog ’12 ◮ Why poor speedup on 10Kx10K test?

. Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s Memory subsystem bandwidth: 4464MB/s . APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint Poisson 2-D Relaxation: Memory footprint . . . . The Three Beaars – Dyalog ’12 ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint

. Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s Memory subsystem bandwidth: 4464MB/s . Poisson 2-D Relaxation: Memory footprint . . . . The Three Beaars – Dyalog ’12 ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint

. Grid is 800MB Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo 5 writes of grid to/from memory/s The Three Beaars – Dyalog ’12 . Poisson 2-D Relaxation: Memory footprint . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s

. . Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system Therefore, speedup is eventually memory-limited on cheapo The Three Beaars – Dyalog ’12 Poisson 2-D Relaxation: Memory footprint . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s

. . Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. Scholz sees linear speedup on 48-core system system The Three Beaars – Dyalog ’12 . Poisson 2-D Relaxation: Memory footprint . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo

. . Robert Bernecky Lesson: Array optimizations are VERY good for you. Lesson: High memory bandwidth is good for you. system The Three Beaars – Dyalog ’12 . Poisson 2-D Relaxation: Memory footprint . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo ◮ Scholz sees linear speedup on 48-core system

. . Robert Bernecky Lesson: Array optimizations are VERY good for you. system The Three Beaars – Dyalog ’12 Poisson 2-D Relaxation: Memory footprint . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo ◮ Scholz sees linear speedup on 48-core system ◮ Lesson: High memory bandwidth is good for you.

. Poisson 2-D Relaxation: Memory footprint Robert Bernecky system . The Three Beaars – Dyalog ’12 . . . . ◮ Why poor speedup on 10Kx10K test? ◮ Dyalog APL 13.0, 10Kx10K grid: 8.5GB footprint ◮ APEX/SAC 18091: 10Kx10K grid: 3.4GB footprint ◮ Memory subsystem bandwidth: 4464MB/s ◮ Grid is 800MB → 5 writes of grid to/from memory/s ◮ Therefore, speedup is eventually memory-limited on cheapo ◮ Scholz sees linear speedup on 48-core system ◮ Lesson: High memory bandwidth is good for you. ◮ Lesson: Array optimizations are VERY good for you.

. Why is interpreted APL faster than compiled code for some tests? Robert Bernecky Figure: APEX vs. APL CPU time performance . The Three Beaars – Dyalog ’12 Mama Bear Motivation . . . . Speedup (APL/APEX w/AWLF) 1,000 100 0.1 10 1 buildvAKS APL: Dyalog APL 13.0 SAC: 18,221:MODIFIED Higher is better for APEX buildvfAKS buildv2AKS compiotaAKS compiotadAKS csbenchAKS downgradePVAKS fdAKS floydAKS gewlfAKS histgradeAKS APL vs. APEX CPU Time Performance (2,012−09−15) histlpAKS histopAKS histopfAKS iotanAKS ipapeAKS ipbbAKS ipbdAKS ipddAKS ipopneAKS ipplusandAKS lltopAKS Benchmark name logdAKS logd2AKS logd3AKS logd4AKS loopfsAKS loopfvAKS loopisAKS matiotaAKS mconvAKS nsvAKS nthoneAKS poissonAKS primesAKS schedrAKS scsAKS sdyn4AKS snpAKS testforAKS testindxAKS testlcvAKS tjckrbeAKS tjck2AKS ulamAKS unirandAKS upgradeBoolAKS upgradeCharAKS upgradeIntVecAKS upgradePVAKS upgradeRPVAKS waverAKS

. . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: Shape vector generation for variable result shapes Generation of small arrays, e.g. , complex scalars No SaC FOR-loop analog to with-loop Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign

. . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: Generation of small arrays, e.g. , complex scalars No SaC FOR-loop analog to with-loop Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes

. . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: No SaC FOR-loop analog to with-loop Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes ◮ Generation of small arrays, e.g. , complex scalars

. . . . . . Mama Bear Motivation Some reasons for poor performance of compiled SAC code: Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Index vector generation for indexed assign ◮ Shape vector generation for variable result shapes ◮ Generation of small arrays, e.g. , complex scalars ◮ No SaC FOR-loop analog to with-loop

. . . . . . Mama Bear - Small Array Scalarization Optimization: Primitive Function Unrolling (Classic) Optimization: Index Vector Elimination (IVE) ( sacdev) 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form

. . . . . . Mama Bear - Small Array Scalarization Optimization: Index Vector Elimination (IVE) ( sacdev) 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic)

. . . . . . Mama Bear - Small Array Scalarization 2–16X speedup observed Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky) Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic) ◮ Optimization: Index Vector Elimination (IVE) ( sacdev)

. . . . . . Mama Bear - Small Array Scalarization 2–16X speedup observed Robert Bernecky The Three Beaars – Dyalog ’12 ◮ Replace small arrays by their scalarized form ◮ Optimization: Primitive Function Unrolling (Classic) ◮ Optimization: Index Vector Elimination (IVE) ( sacdev) ◮ Optimizations: LS, LACSI, LACSO (S.B. Scholz, R. Bernecky)

. complex z Robert Bernecky while( zr * zr + zi * zi <= 4.0) int calc( double zr, double zi, int maxdepth) mandelbrot opt : Hand-scalarized - pair of scalars z[1] imag(z) z[0] real(z) double(2) z Complex scalars, under the covers: . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) int calc( complex z, int maxdepth) mandelbrot : Uses complex numbers Mama Bear - Small Array Scalarization . . . . The Three Beaars – Dyalog ’12 ◮ Mandelbrot set computation performance

. complex z Robert Bernecky while( zr * zr + zi * zi <= 4.0) int calc( double zr, double zi, int maxdepth) mandelbrot opt : Hand-scalarized - pair of scalars z[1] imag(z) z[0] real(z) double(2) z Complex scalars, under the covers: . Mama Bear - Small Array Scalarization . . . . The Three Beaars – Dyalog ’12 ◮ Mandelbrot set computation performance ◮ mandelbrot : Uses complex numbers int calc( complex z, int maxdepth) { . . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) . . .

. . Robert Bernecky while( zr * zr + zi * zi <= 4.0) int calc( double zr, double zi, int maxdepth) mandelbrot opt : Hand-scalarized - pair of scalars The Three Beaars – Dyalog ’12 . Mama Bear - Small Array Scalarization . . . ◮ Mandelbrot set computation performance ◮ mandelbrot : Uses complex numbers int calc( complex z, int maxdepth) { . . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) . . . ◮ Complex scalars, under the covers: complex z ↔ double(2) z real(z) ↔ z[0] imag(z) ↔ z[1]

. Mama Bear - Small Array Scalarization Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . ◮ Mandelbrot set computation performance ◮ mandelbrot : Uses complex numbers int calc( complex z, int maxdepth) { . . . while(real(z)*real(z)+imag(z)*imag(z)<=4.0) . . . ◮ Complex scalars, under the covers: complex z ↔ double(2) z real(z) ↔ z[0] imag(z) ↔ z[1] ◮ mandelbrot opt : Hand-scalarized - pair of scalars int calc( double zr, double zi, int maxdepth) { . . . while( zr * zr + zi * zi <= 4.0) . . .

. 21.9s 28.1s 23.0s 19.8s mandelbrot on 69.9s 46.1s 34.6s 28.1s 23.0s mandelbrot opt 48.4s on 70.7s 46.7s 34.7s 28.2s 22.9s 19.6s Lesson: No more suffering for being elegant Well, less suffering for being elegant Robert Bernecky 35.2s 71.8s . off . . . . Mama Bear - Small Array Scalarization Test Opts -mt 1 -mt 2 -mt 3 -mt 4 -mt 5 -mt 6 mandelbrot off 1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s mandelbrot opt The Three Beaars – Dyalog ’12 ◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled

. 23.0s 35.2s 28.1s 23.0s 19.8s mandelbrot on 69.9s 46.1s 34.6s 28.1s 21.9s . mandelbrot opt on 70.7s 46.7s 34.7s 28.2s 22.9s 19.6s Well, less suffering for being elegant Robert Bernecky 48.4s 71.8s off mandelbrot opt . . . . Mama Bear - Small Array Scalarization Test Opts -mt 1 -mt 2 -mt 3 -mt 4 -mt 5 -mt 6 mandelbrot off 1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s The Three Beaars – Dyalog ’12 ◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled ◮ Lesson: No more suffering for being elegant

. 28.1s 35.2s 28.1s 23.0s 19.8s mandelbrot on 69.9s 46.1s 34.6s 23.0s . 21.9s mandelbrot opt on 70.7s 46.7s 34.7s 28.2s 22.9s 19.6s Robert Bernecky 48.4s 71.8s off mandelbrot opt . . . . Mama Bear - Small Array Scalarization Test Opts -mt 1 -mt 2 -mt 3 -mt 4 -mt 5 -mt 6 mandelbrot off 1508.9s 956.0s 828.7s 676.8s 655.7s 635.2s The Three Beaars – Dyalog ’12 ◮ Execution times, with LS,LACSI,LACSO opts enabled/disabled ◮ Lesson: No more suffering for being elegant ◮ Well, less suffering for being elegant . . .

. . . . . . GPU (CUDA) Support Without Suffering Physics experiment Robert Bernecky The Three Beaars – Dyalog ’12 ◮ SaC generates CUDA code automatically: -target cuda

. GPU (CUDA) Support Without Suffering Robert Bernecky . The Three Beaars – Dyalog ’12 . . . . ◮ SaC generates CUDA code automatically: -target cuda ◮ Physics experiment LatticeBoltzmann CUDA vs. SaC Speedups (8800GT) 70 10 Steps 25 Steps 50 Steps 60 100 Steps 200 Steps 50 Speedup 40 30 20 10 0 256 384 512 640 768 896 1024 1152 1280 1408 1536 Problem Size

con(dc[iv],fi,disclose NDV(tr)); . pt=trace++(filter*0.0); NB. No overtake in SAC Robert Bernecky Performance is so-so: Optimistic optimizations required con: matmul(fi,take(shape(fi),drop([dc],tr))) : genarray(shape(dc),0.0); ( . <= iv <= .) : convn: z=with z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); nested double NDS; . nested double[.] NDV; SAC convolution kernel using EACH : APL convolution kernel using EACH : Goldilocks - Nested Arrays in APEX/SAC . . . . The Three Beaars – Dyalog ’12 ◮ Nested arrays are alive and living in SAC! (R. Douma)

con(dc[iv],fi,disclose NDV(tr)); . nested double NDS; Robert Bernecky Performance is so-so: Optimistic optimizations required con: matmul(fi,take(shape(fi),drop([dc],tr))) : genarray(shape(dc),0.0); ( . <= iv <= .) : convn: z=with z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); pt=trace++(filter*0.0); NB. No overtake in SAC nested double[.] NDV; . SAC convolution kernel using EACH : Goldilocks - Nested Arrays in APEX/SAC . . . . The Three Beaars – Dyalog ’12 ◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH : convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×}

. . Robert Bernecky Performance is so-so: Optimistic optimizations required con: matmul(fi,take(shape(fi),drop([dc],tr))) con(dc[iv],fi,disclose NDV(tr)); pt=trace++(filter*0.0); NB. No overtake in SAC nested double NDS; nested double[.] NDV; The Three Beaars – Dyalog ’12 Goldilocks - Nested Arrays in APEX/SAC . . . . ◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH : convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×} ◮ SAC convolution kernel using EACH : z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with { ( . <= iv <= .) : } : genarray(shape(dc),0.0);

. . Robert Bernecky con: matmul(fi,take(shape(fi),drop([dc],tr))) con(dc[iv],fi,disclose NDV(tr)); pt=trace++(filter*0.0); NB. No overtake in SAC nested double NDS; nested double[.] NDV; The Three Beaars – Dyalog ’12 Goldilocks - Nested Arrays in APEX/SAC . . . . ◮ Nested arrays are alive and living in SAC! (R. Douma) ◮ APL convolution kernel using EACH : convnû{fiûÁ þ (ÉÒ×)con¡Ú×} conû{fi+.«(Òfi)ÙÁÕ×} ◮ SAC convolution kernel using EACH : z=convn(iota(shape(tr)[0]),fi,enclose NDV(pt)); convn: z=with { ( . <= iv <= .) : } : genarray(shape(dc),0.0); ◮ Performance is so-so: Optimistic optimizations required

. up to 10X small developing up to 20X enables other opts Papa large nearly done 2X-50X none All optimizations are critical for getting excellent performance Array-based algorithms will win, and scale well Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky Mama up to 1300X . Array . . . . Summary and Future Work Bear Optimizers mature Serial Parallel size speedup speedup Baby scalars The Three Beaars – Dyalog ’12 ◮ Status:

. nearly done small developing up to 20X enables other opts Papa large up to 10X none 2X-50X Array-based algorithms will win, and scale well Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky Mama up to 1300X . mature . . . . Summary and Future Work Bear Array Optimizers Serial Parallel size speedup speedup Baby scalars The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance

. large Mama small developing up to 20X enables other opts Papa nearly done . up to 10X 2X-50X Nested arrays: APEX, SAC both require work Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky none up to 1300X mature Array . . . . Summary and Future Work scalars Bear Optimizers Serial Parallel size speedup speedup Baby The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well

. large Mama small developing up to 20X enables other opts Papa nearly done . up to 10X 2X-50X Small arrays: Needs scalarized index-vector-to-offset primitive Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky none up to 1300X mature Array . . . . Summary and Future Work scalars Bear Optimizers Serial Parallel size speedup speedup Baby The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work

. Papa none Mama small developing up to 20X enables other opts large mature nearly done up to 10X 2X-50X Small arrays: Perhaps (likely!), additional work will be needed And, they lived more or less happily ever after! Thank you! Robert Bernecky . up to 1300X scalars Bear . . . . Summary and Future Work Baby The Three Beaars – Dyalog ’12 Array Optimizers Serial Parallel size speedup speedup ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive

. mature Robert Bernecky And, they lived more or less happily ever after! Thank you! 2X-50X up to 10X nearly done large Papa enables other opts up to 20X developing small Mama none . up to 1300X scalars Bear . . . . Baby Summary and Future Work Array Optimizers Serial Parallel size speedup speedup The Three Beaars – Dyalog ’12 ◮ Status: ◮ All optimizations are critical for getting excellent performance ◮ Array-based algorithms will win, and scale well ◮ Nested arrays: APEX, SAC both require work ◮ Small arrays: Needs scalarized index-vector-to-offset primitive ◮ Small arrays: Perhaps (likely!), additional work will be needed

The Three Beaars . . . . The Three Beaars Dyalog 12 . - PowerPoint PPT Presentation

. Snake Island Research Inc Robert Bernecky October 10, 2012 bernecky@snakeisland.com tel: +1 416 203 0854 Toronto, Canada 18 Fifth Street, Wards Island Robert Bernecky . Basically, Every Array Allocation Reduces Speed The Three Beaars

MACBETH Revision Day Slides ACT 1 SCENE 1 Note the significance of the number three: three

RCIA 6: The First Three Commandments 6: The First Three Commandments Scene from the movie The

The Three Sisters The Three Sisters are Corn (Maize) , Beans , and Squash These three

Three Little Pigs Story Powerpoint Presentation Three Little Pigs Story Powerpoint Presentation

1. (Three-player Cournot competition) Consider a three-player Cournot competition, in which three

Intro to THREE.js Dr. Mihail November 2, 2015 (Dr. Mihail) THREE.js November 2, 2015 1 / 18

The three virtues of a safety culture The three virtues of a safety culture 1)

Staying Connected Week Three - Movement www.ronidavis.com FEMPower - Three Part Web Event - Week

Funding Your Future Sarasota, Florida March 2019 Three Es of Travel Impact Three Es of

CLASS OF 2020 Four years of English Three years of social studies Three years of

Three Month Report 2017 We build for a better society. Slussen, Stockholm, Sweden Safety at

Three month report 2020 We build for a better society. Slussen Stockholm, Sweden Three

CLASS OF 2023 Four years of English Three years of social studies Three years of

CLASS OF 2021 Four years of English Three years of social studies Three years of

Triangulation and the Law of Sines A triangle has three sides and three angles; their values

Prehistoric Britain YEAR THREE Autumn 1 LESSON THREE WHAT WERE THE DIFFERENT PERIODS IN THE

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

St Storage performance modeling for future systems Yoonho Park May 3, 2016 Agenda Storage

Lecture 4.5: Hot early life and the hot early Earth The Apex Chert microfossils/ Oxygen isotopes

Database Access Management Giacomo Tenaglia CERN IT/DB HEPiX Spring 2012 CERN IT Department

Active Calculus Matt Boelkins Grand Valley State University boelkinm@gvsu.edu @mattboelkins

Stateful Streaming Data Pipelines with Apache Apex Chandni Singh Timothy Farkas PMC and

Review of the O 3 NAAQS: Second Draft Health Risk and Exposure Assessment (REA) Clean Air

CONFIGURATIONS IN LARGE t -CONNECTED GRAPHS Robin Thomas School of Mathematics Georgia Institute