 
              Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal SpΓΆrri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001. Is Search Really Necessary to Generate High-Performance BLAS? Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005
INTRODUCTION
BLAS (Basic Linear Algebra Subprograms) β’ Level 1 Vector operations π β π½π + π β’ Level 2 Matrix-Vector operations π β π½π©π + π β’ Level 3 Matrix-Matrix operations π¬ β π½π©πͺ + πΎπ«
ATLAS (Automatically Tuned Linear Algebra Software) β’ Implements BLAS β’ Applies empirical optimization techniques to source code to generate an optimized library β’ Fully automatic β’ Produces ANSI-C code
ATLAS Matrix-Matrix Multiplication MMM MMM MMM "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.
ARCHITECTURE
ATLAS Architecture L1 Cache Detect Parameters ATLAS Search ATLAS Code Hardware CPU parameters Engine Generator Parameters Source Code MFLOPS Execute And Measure Multiple versions Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005
ATLAS CODE GENERATOR
ATLAS Optimizations β’ Case: Matrix-Matrix multiplication π· πΆ π π π΅ Γ πΏ = π π πΏ πππ π β 0: 1: π β 1 πππ π β 0: 1: π β 1 πππ π β 0: 1: πΏ β 1 π· π,π = π· π,π + π΅ π,π Γ πΆ π,π
Loop Ordering π΅ πΆ π· πππ π β 0: 1: π β 1 πππ π β 0: 1: π β 1 πππ π β 0: 1: πΏ β 1 π Γ = π· π,π = π· π,π + π΅ π,π Γ πΆ π,π π Store π© in Cache π΅ π· πΆ πππ π β 0: 1: π β 1 πππ π β 0: 1: π β 1 πππ π β 0: 1: πΏ β 1 π Γ = π· π,π = π· π,π + π΅ π,π Γ πΆ π,π π Store πͺ in Cache
1st Level Blocking π πΆ π Γ πΏ π = π πΆ π π πΏ π πΆ is choosen in such that the πππ π β 0: πΆ πͺ : π β 1 working set fits into π 1 πππ π β 0: πΆ πͺ : π β 1 πππ π β 0: πΆ πͺ : πΏ β 1 πππ π β² β [π: 1: π + π πΆ β 1] Γ = πππ π β² β [π: 1: π + π πΆ β 1] πππ πβ² β [π: 1: π + π πΆ β 1] π· πβ²,πβ² = π· πβ²,πβ² + π΅ πβ²,πβ² Γ πΆ πβ²,πβ²
2nd Level Blocking πππ π β 0: πΆ πͺ : π β 1 πππ π β 0: πΆ πͺ : π β 1 πππ π β 0: πΆ πͺ : πΏ β 1 πππ π β² β [π: πΆ π½ : π + π πΆ β 1] πππ π β² β [π: π΅ π½ : π + π πΆ β 1] Γ = πππ πβ² β [π: π³ π½ : π + π πΆ β 1] π π + π π + π π Γ π π β€ π π πππ π β²β² β π β² : 1: πβ² + πΏ π β 1 πππ πβ² β² β π β² : 1: πβ² + π π β 1 πππ πβ²β² β [π β² : 1: πβ² + π π β 1] π· πβ²β²,πβ²β² = π· πβ²β²,πβ²β² + π΅ πβ²β²,πβ²β² Γ πΆ πβ²β²,πβ²β² Unroll Loop π π π π Γ = π π Graphic from βHow To Write Fast Numerical Code: A Small Introductionβ Srinivas Chellappa, Franz Franchetti, and Markus PΓΌschel
Scalar Replacement β’ Replace array accesses with scalars Stored in memory Store intermediate results in registers do doubl uble t[2]; do doubl uble t0, t1, x0, x1, D0; for or (i=0; i<8; i++) { for or (i=0; i<8; i++) { x0 = x[2*i]; x1 = x[2*i+1]; D0 = D[2*i]; Store for reuse t[0] = x[2*i] + x[2*i+1]; t0 = x0 + x1; t[1] = x[2*i] - x[2*i+1]; t1 = x0 - x1; y[2*i] = t[0] * D[2*i]; y[2*i] = t0 * D0; y[2*i+1] = t[0] * D[2*i]; y[2*i+1] = t1 * D0; } } How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus PΓΌschel
Scalar Replacement a11 = A[1][1] c11 = a11*b11 a12 = A[1][2] c11 += a12*b21 a13 = A[1][3] c11 += a13*b31 a14 = A[1][4] β¦ β¦ c12 = a11*b12 b11 = B[1][1] c12 += a12*b22 b12 = B[1][2] c12 += a13*b32 b13 = B[1][3] β¦ b14 = B[1][4] β¦ C[1][1] = c11 C[1][2] = c12 C[1][3] = c13
Data Hazards IF ID EX WB LD R1, 0(R2) MEM DSUB R4, R1, R5 IF ID EX WB MEM IF ID EX WB AND R6, R1, R7 MEM IF ID EX WB OR R8, R1, R9 MEM IF ID EX WB XOR R10, R1, R11 MEM Skewing Factor Jens Teubner Β· Data Processing on Modern Hardware Β· Fall 2010
Pipeline Scheduling Interleave ππ£π and πππ sequences ππ£π 1 ππ£π 2 Skewing factor π π β¦ ππ£π π π πππ 1 ππ£π π π +1 πππ 2 ππ£π π π +2 πππ 3 β¦
Pipeline Scheduling a11 = A[1][1] c11 = a11*b11 a12 = A[1][2] c12 = a11*b12 a13 = A[1][3] β¦ a14 = A[1][4] c11 += a12*b21 β¦ c12 += a12*b22 b11 = B[1][1] β¦ b12 = B[1][2] c11 += a13*b31 b13 = B[1][3] c12 += a13*b32 b14 = B[1][4] β¦ β¦ C[1][1] = c11 C[1][2] = c12 C[1][3] = c13
EMPIRICAL OPTIMIZATION IN ATLAS
ATLAS Architecture L1 Cache Detect Parameters ATLAS Search ATLAS Code Hardware CPU parameters Engine Generator Parameters Optimize π(π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π ) Source Code Execute MFLOPS And Measure Multiple versions Kamen Yotov and Xiaoming Li and Gang Ren and Maria Garzaran and David Padua and Keshav Pingali and Paul Stodghill PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005
Optimization Order 1. Find best block size for outer loop 2. Find best block sizes for inner loop 3. Find best skewing factor 4. Find best parameters for scheduling of loads 5. Additional parameters π π π πΆ π Γ πΏ π = π πΆ π π Γ = π π π π πΏ
Search for best Outer Loop Size Restrict search space 16 β€ π πΆ β€ min(80, π 1 πππ¨π) π πΆ π π Γ πΏ = π πΆ π π πΏ β’ π πΆ must be a multiple of 4 β’ Use fastest version Try with and without unrolling the inner loop
DISCUSSION
Comparison to PhiPAC PhiPAC ATLAS β’ Coding methodology to β’ Library generator write fast code β’ Automatic generation of β’ Precursor for ATLAS optimized BLAS β’ Specialized Code Generator for BLAS Matrix-Matrix β’ Support for handcoded Multiplication routines β’ Optimizes parameters for inner and outer loop
ATLAS Matrix-Matrix Multiplication MMM MMM MMM "Automated Empirical Optimization of Software and the ATLAS project" by R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.
Comparison to eigen http://eigen.tuxfamily.org/index.php?title=Benchmark Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz ( x86_64 )
Conclusion Pro β’ Fast method to generate an optimized library for a new platform β’ Supports hand optimized code β’ Implements BLAS Contra β’ Needs constant adjustment to support new architectures β’ Outdated
Further Information β’ ATLAS Project http://math-atlas.sourceforge.net/ β’ BLAS http://netlib.org/blas/
Recommend
More recommend