Optimizing Matrix Multiply using PHiPAC: a Portable, - PowerPoint PPT Presentation

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology Jeff Bilmes, Krste Asanović, Chee-Whye Chin and Jim Demmel CS Division, University of California at Berkeley, Berkeley CA International Computer Science Institute, Berkeley CA Proceedings of the 11th International Conference on Supercomputing (1997) Stefan Dietiker, October 5th 2011

Matrix Multiplications They are important & interesting ● Linear Algebra ● LA-Kernels, such as LAPACK, heavily use Matrix Multiplication ● There are numerous vendor optimized BLAS- libraries ● Computational viewpoint ● A lot of potential for code optimization

Traditional Approach Hand-optimized libraries

Traditional Approach Hand-optimized libraries ● In general : (Micro-)Architecture specific code is unportable. ● Assembler code is difficult to write and maintain. => High Effort ● We prefer to write code in a high level standardized language that can be compiled on many different platforms.

PHiPAC Approach Generate optimized source code

PHiPAC Approach Parameters are architecture specific

PHiPAC Approach Look ahead Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Coding Guidelines Remove false dependencies a[i] = b[i]+c; a[i+1] = b[i+1]*d; ? &a[i] == &b[i+1]

Coding Guidelines Remove false dependencies a[i] = b[i]+c; a[i+1] = b[i+1]*d; &a[i] != &b[i+1] float f1, f2; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2*d;

Coding Guidelines Scalar Replacement: Exploit Register File while(…) { *res++ = f[0] * sig[0] + f[1] * sig[1] + f[2] * sig[2]; sig++; } float f0,f1,f2; f0=f[0];f1=f[1];f2=[2]; while(…) { *res++ = f0*sig[0] + f1*sig[1] + f2*sig[2]; sig++;}

Coding Guidelines Minimize pointer updates f0 = *r8; r8 += 4; movl (%ecx), %eax f1 = *r8; r8 += 4; addl $16, %ecx f2 = *r8; r8 += 4; movl (%ecx), %ebx addl $16, %ecx movl (%ecx), %edx addl $16, %ecx movl (%ecx), %esi addl $16, %ecx (IA32 Assembler) f0 = r8[0]; movl (%ecx), %eax f1 = r8[4]; movl 16(%ecx), %ebx f2 = r8[8]; movl 32(%ecx), %edx r8 += 12; movl 48(%ecx), %esi (IA32 Assembler)

Coding Guidelines Improve temporal and spatial locality ● Temporal locality : The delay between two consecutive memory accesses to the same memory location should be as short as possible. ● Spatial locality : Consecutive operations should access the same memory area.

Coding Guidelines Summary Guideline Effect Parameterizable Use Scalar Replacement to remove Parallel execute of independent false dependencies operations Use Scalar Replacement exploit Decreased memory bandwidth yes register file Use Scalar Replacement minimize Compressed instruction sequence pointer updates Hide multiple instruction FPU latency Independent execution of instructions in pipelined CPUs Balance the instruction mix Increased instruction throughput Increase locality Increased cache performance yes Minimize branches Decrease number of pipeline flushes Loop unrolling Compressed instruction sequence yes Convert integer multiplies to adds Decrease instruction latency

Matrix Multiplications Simplest Approach: Three nested loops for (i=0; i<M; i++) for (j=0; j<N; j++) for (l=0; l<K; l++) c[i][j] += a[i][l] * b[l][j];

Block Matrix Multiplication General Approach for (i=0; i<M; i+=MBlock) for (j=0; j<N; j+=NBlock) for (l=0; l<K; l+=KBlock) for (r=i; r<i+MBlock; r++) for (s=i; s<i+NBlock; s++) for (t=i; t<i+KBlock; t++) c[r][s] += a[r][t] * b[t][s];

Matrix Multiplications Choose appropriate block sizes for (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];

Parameterized Generator Choose appropriate block sizes $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ]

Matrix Multiplications Blocking Example: innermost 2x2 Blocks $ mm_cgen -l0 2 2 2 -l1 4 4 4 do { /* */ do { /*...*/ do { /*...*/ _b0 = bp[0]; _b1 = bp[1]; bp += Bstride; _a0 = ap_0[0]; c0_0 += _a0*_b0; c0_1 += _a0*_b1; _a1 = ap_1[0]; c1_0 += _a1*_b0; c1_1 += _a1*_b1; _b0 = bp[0]; _b1 = bp[1]; bp += Bstride; _a0 = ap_0[1]; c0_0 += _a0*_b0; c0_1 += _a0*_b1; _a1 = ap_1[1]; c1_0 += _a1*_b0; c1_1 += _a1*_b1; ap_0+=2;ap_1+=2; } while() ; /*...*/

Finding Optimal Block Sizes Using a Search Script

Finding Optimal Block Sizes Example: Finding the L1 Parameters ● We have to limit the parameter space ● For the square case DxD ● We search the neighborhood centered at 2 = L 1 3D ● We set to the values ϕ D / M 0 M 1, K 1, N 1 ● Where ϕ∈( 0.25,0.5,1.0,1.5,2.0 ) ● => 125 Combinations

Results Example (Single Precision Matrix Mult. on a 100MHz SGI Indigo R4K) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Results Example (Double Precision Matrix Mult. on a SGI R8K Power Challenge) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Strengths & Limitations There's no golden hammer ● Strengths ● Limitations ● Automatic Search for ● Focus on uniprocessor optimal Parameters Machines ● Produces portable ● No support for vector ANSI C Code. based CPUs ● No control over instruction scheduling

Further Information Try yourself… ● Website: http://www.icsi.berkeley.edu/~bilmes/phipac/

Optimizing Matrix Multiply using PHiPAC: a Portable, - PowerPoint PPT Presentation

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology Jeff Bilmes, Krste Asanovi, Chee-Whye Chin and Jim Demmel CS Division, University of California at Berkeley, Berkeley CA International Computer

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

Lesson 9 - I can multiply 3 digits by 1 digit Today we will learn to multiply 3 digits by 1

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Matrix Multiply in Hadoop Botong Huang and You Wu (Will) Content Dense Matrix Multiplication

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Lesson 2- I can multiply by 10 and 100 reasoning and problem solving Recap: Yesterdays

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Lecture 2: Tiling matrix-matrix multiply, code tuning David Bindel 1 Feb 2010 Logistics

PORTABLE MANAGEMENT BEX/BTA Oversight Committee May 17, 2019 Agenda Portable Management

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Empirical Methods for the Analysis of Optimization Heuristics Marco Chiarandini Department of

Tow Towar ards ds a a rel elat ation on ont ontology ogy for or the he Sem emant

Assessing IPv6 Adoption Mark Allman (ICSI), Michael Bailey (UMich), Jakub Czyz (UMich), Scott

Agreement Detection in Agreement Detection in Multiparty Conversations Multiparty Conversations

Measurements and Observations of IP Introduction Multicast Traffic Environment and methodology

Towards a constructicon using patterns and frames Florent Perek Amanda Patten University of

Computational Semantics and Pragmatics Raquel Fernndez Institute for Logic, Language &

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Presenter: Tyler

Optimizing Matrix Multiply using PHiPAC: a Portable, - PowerPoint PPT Presentation

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology Jeff Bilmes, Krste Asanovi, Chee-Whye Chin and Jim Demmel CS Division, University of California at Berkeley, Berkeley CA International Computer

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

Lesson 9 - I can multiply 3 digits by 1 digit Today we will learn to multiply 3 digits by 1

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Matrix Multiply in Hadoop Botong Huang and You Wu (Will) Content Dense Matrix Multiplication

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Lesson 2- I can multiply by 10 and 100 reasoning and problem solving Recap: Yesterdays

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Lecture 2: Tiling matrix-matrix multiply, code tuning David Bindel 1 Feb 2010 Logistics

PORTABLE MANAGEMENT BEX/BTA Oversight Committee May 17, 2019 Agenda Portable Management

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Empirical Methods for the Analysis of Optimization Heuristics Marco Chiarandini Department of

Tow Towar ards ds a a rel elat ation on ont ontology ogy for or the he Sem emant

Assessing IPv6 Adoption Mark Allman (ICSI), Michael Bailey (UMich), Jakub Czyz (UMich), Scott

Agreement Detection in Agreement Detection in Multiparty Conversations Multiparty Conversations

Measurements and Observations of IP Introduction Multicast Traffic Environment and methodology

Towards a constructicon using patterns and frames Florent Perek Amanda Patten University of

Computational Semantics and Pragmatics Raquel Fernndez Institute for Logic, Language &amp;

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Presenter: Tyler

Computational Semantics and Pragmatics Raquel Fernndez Institute for Logic, Language &