GEMM-Like Operations Applications Richard Michael Veras Platforms - PowerPoint PPT Presentation

Carnegie Mellon Performance An Algorithmic Specific Code Generator for GEMM-Like Operations Applications Richard Michael Veras Platforms

Carnegie Mellon Want Automatic High Performance Model Driven GEMM-Like approach for Operations generating DGEMM: High Perf. Compiler Techniques: Richard Veras (rveras@cmu.edu) 2

Carnegie Mellon GEMM Like Operations Clustering : ~𝑩𝑩 𝑼 𝑩 Centrality : ~𝑩 𝑼 𝑩 (triangle) [betweeness] (𝒂, ⋀, +, 𝟐, 𝟏) (𝒂, +, 𝑵𝑱𝑶, 𝟏, 𝟐) Community Detection : ~𝑩 𝑳 Check out GraphBLAS Richard Veras (rveras@cmu.edu) 3

Carnegie Mellon High Performance Micro-Kernels Cast micro kernel as outer product: Use Models to Select from Design Space: Aggressively Schedule Enumerate all possible tilings and Optimize: given ISA Veras, R. , Smith T., Low T.M., Franchetti, F. van de Geijn, R. [CGO 2017 Submitted] Richard Veras (rveras@cmu.edu) 4

Carnegie Mellon High Performance Micro-Kernels OpenBLAS Our Generated ATLAS Veras, R. , Smith T., Low T.M., Franchetti, F. van de Geijn, R. [CGO 2017 Submitted] Richard Veras (rveras@cmu.edu) 5

Carnegie Mellon Automating with Compiler Techniques Subtle Semiring Changes Cast as ILP Problem Impact Performance: [extremetech.com] Minimizing Stalls [cs.duke.edu] [massey.ac.nz] Richard Veras (rveras@cmu.edu) 6

Carnegie Mellon A Generator for GEMM-Like Kernels Input: Ouput: Kernel Algorithm from Betweeness Centrality our design space: (Floyd-Warshall) Kernel Sustains High Throughput: GEMM Like: 𝑫 𝑩𝑪 + 𝑫 𝑫 𝒃𝒅𝒅 𝟏 Initialize 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 Compute Tuned to the Target 𝒈(𝑫 𝒃𝒅𝒅 ) Accumulate 𝑫 Architecture Semiring (𝒂, +, 𝑵𝑱𝑶, 𝟏, 𝟐) Richard Veras (rveras@cmu.edu) 7

Carnegie Mellon Our GEMM Generator Pipeline Find Efficient Instructions Mix (Algo) Turn Mix into Efficient Code (Implementation) Template Template Enumerate User defines Top Candidate transformed into Created for algorithm space block size, ISA selected optimized and Candidate and semiring scheduled code Richard Veras (rveras@cmu.edu) 8

Carnegie Mellon From Math to Tiling Identify Small Outer Products from ISA 𝑫 𝑩𝑪 + 𝑫 𝑫 𝒃𝒅𝒅 𝟏 Initialize 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 Compute 𝒈(𝑫 𝒃𝒅𝒅 ) Accumulate 𝑫 A High Throughput Mix Enumerate Space of Outer Products Start with ISA Select Best Mix with Queueing Model Richard Veras (rveras@cmu.edu) 9

Carnegie Mellon From Tiling to Template Selected Outer Product: 𝑫 𝒃𝒅𝒅 𝟏 for( i = 0; i < m_r; i++ ) for( j = 0; j < n_r; j++ ) init(c_reg, ii,jj ) 𝒃𝒄 𝑼 𝑫 𝒃𝒅𝒅 𝒒 for( pp = 0; pp < k_b; pp++ ) /* perform the outer products */ for( i = 0; i < m_r; i+=m_s ) for( j = 0; j < n_r; j+=n_s ) for( ii = i; ii < i+m_s; ii++ ) Embedding Function (get_b): get_a_elem(a_reg, ii,j ); for( jj = j; jj < j+n_s; jj++ ) get_b_elem(b_reg, ii,jj ); apply(c_reg,a_reg,b_reg,ii,jj,pp); def get_b_element ( var array b_reg[][], ptr B, ii, jj ) opts = { 𝑫 𝒈(𝑫 𝒃𝒅𝒅 ) 0: assign( b_reg[jj], vload(B,jj )), 1: assign( b_reg[jj], shuffle(b_reg[jj-1]) ), for( i = 0; i < m_r; i++ ) 2: assign( b_reg[jj], permute(b_reg[jj-1]) ), for( j = 0; j < n_r; j++ ) 3: assign( b_reg[jj], shuffle(b_reg[jj-1]) )} accumulate(C, c_reg, ii, jj ); if ii mod v = 0 return opts[jj] Richard Veras (rveras@cmu.edu) 10

Carnegie Mellon Floyd-Warshall Embedded in GEMM DGEMM Floyd-Warshall Semiring: (𝑺,∗, +, 𝟐, 𝟏) (𝒂, +, 𝑵𝑱𝑶, 𝟏, ∞) Initialize: assign( c_reg[ii,jj], 0) assign( c_reg[ii,jj], 𝑫 𝒃𝒅𝒅 𝟏 (ii==jj)? 0 : INFINITY) Compute: assign( c_reg[ii,jj], assign( c_reg[ii,jj], 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 a_reg[ii,pp]* MIN(a_reg[ii,pp]+ b_reg[pp,jj], b_reg[pp,jj], +c_reg[ii,jj])) c_reg[ii,jj])) Accumulate: assign( C[(ii,jj)], assign( C[(ii,jj)], 𝑫 𝒈(𝑫 𝒃𝒅𝒅 ) MIN(c_reg[ii,jj], c_reg[ii,jj]+ C[(ii,jj)] ) C[(ii,jj)] Richard Veras (rveras@cmu.edu) 11

Carnegie Mellon Scheduling the Problem Pipeline for Scheduling: We have built the kernel code, Built Kernel + Now we need to schedule: Embedding Func. Express as Decision Vars Static Scheduling still matters on OOO Processors: Formulate Constraints over Vars Minimize with ILP Solver Scheduled Kernel Code Richard Veras (rveras@cmu.edu) 12

Carnegie Mellon OASIC approach for ILP Scheduling Expressing Constraints in Representing Design Space as terms of X: Polytope: Every instruction n is executed once 𝒍 = 𝟐 𝒖 𝒚 𝒐𝒖 𝒍 At any timestep t, functional unit k is used no more than it can 𝒍 𝒍 𝒚 𝒐𝒖 ≤ 𝑺 𝒍 Decision Variable: instruction n is executed on functional unit k at time step t If 𝒖 𝒏 depends on 𝒖 𝒐 , then 𝒖 𝒏 will not execute until l cycles after 𝒖 𝒐 𝒍 𝒍 𝒍 𝒖 𝒚 𝒐𝒖 𝒏 + 𝒖 𝒚 𝒐𝒖 𝒏 ≤ 𝟐 𝒚 𝒐𝒖 Functional unit 𝒍 𝒍 cycle But wait, there’s more! Instruction label Richard Veras (rveras@cmu.edu) 13

Carnegie Mellon Emitting The Code Code is emitted: for( pp = 0; pp < k_b; pp+=KUNR ) Code is now Scheduled /* STEADY STATE CODE */ VLOAD_IA(GET_A_ADDR(0),GET_A_REG(0)) VLOAD_IA(GET_A_ADDR(1),GET_A_REG(1)) VLOAD_IA(GET_B_ADDR(0),GET_B_REG(0)) VSHUFFLE_IA(GET_B_REG(0),GET_B_REG(1)) VFMA(GET_A_REG(0), GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(0),GET_B_REG(1),GET_C_REG(0,1)) VPERM2F128_IA(0x01,GET_B_REG(1),GET_B_REG(2)) VSHUFFLE_IA(0x05,GET_B_REG(2),GET_B_REG(3)) VFMA(GET_A_REG(1),GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(1),GET_B_REG(1),GET_C_REG(0,1)) Need Custom ANSI C compliant SIMD Is this necessary? wrappers to schedule in Compiler: #define VADD(srca,srcb,dest) asm volatile( "vaddpd %[vsrca],%[vsrcb], %[vdest]" : [vdest] "=x"(dest) : [vsrca] "x"(srca), [vsrcb] "x"(srcb)); Richard Veras (rveras@cmu.edu) 14

Carnegie Mellon Putting it All Together Have Operation that we can express like GEMM: 𝑫 𝑩𝑪 + 𝑫 𝑫 𝒃𝒅𝒅 𝟏 Initialize 𝑫 𝒃𝒅𝒅 𝒒 𝒃𝒄 𝑼 Compute 𝒈(𝑫 𝒃𝒅𝒅 ) Accumulate 𝑫 Run through our Pipeline: Richard Veras (rveras@cmu.edu) 15

Carnegie Mellon Moving Forward Clustering : ~𝑩𝑩 𝑼 𝑩 Centrality : ~𝑩 𝑼 𝑩 (triangle) (betweeness) Community Detection : ~𝑩 𝑳 Richard Veras (rveras@cmu.edu) 16

Carnegie Mellon Summary  There exists a large class of GEMM-like Operations  Obtaining DGEMM level performance for each of these operations requires automation  We have a systematic approach for automatically generating DGEMM  We are extending it by allowing the user to define a semi-ring with an initialize and accumulate function Richard Veras (rveras@cmu.edu) 17

GEMM-Like Operations Applications Richard Michael Veras Platforms - PowerPoint PPT Presentation

Carnegie Mellon Performance An Algorithmic Specific Code Generator for GEMM-Like Operations Applications Richard Michael Veras Platforms Carnegie Mellon Want Automatic High Performance Model Driven GEMM-Like approach for Operations

Integer GEMM (under)performance Marat Dukhan Software Engineer on Ca ff e 2 GEMM in Neural

Flyte-MM: A Software Based Sub-Floating Point GEMM Richard Veras (Louisiana State University)

JET Job Skills Elementary School I Like Rain By Sarah Rogers-Tanner I like rain I dont like

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

GEMM April 17th Vancouver S.Kibsey ,BSc,BEng,MBA,CFA,SIPC VP Equity Risk Management Caisse de

A. B. Like Neutral Dislike Like Neutral Dislike 3 1 15 24 3 2 22 16 1 A. B. Like

No place like No place like HOME No place like No place like HOME HOME HOME (Harmonising

NORTHERN REGION OPERATIONS SIGNAL OPERATIONS FAIRFAX COUNTY Ling Li, P.E. Operations

Auxiliar xiliary Operations y Operations Auxiliar Auxiliary Operations Operations The Series:

Operations in C Have the data, what now? Bit-wise boolean operations Logical operations

DS 2001: Practicum 1 Warm-up: All TAs like puppies. No robots like things that TAs like.

Presenters Justin Droste, P.E. Roadway Operations Engineer Mark Crouch, Roadway Operations

EGI Operations Tiziana Ferrari/EGI.eu EGI Chief Operations Officer EGI Operations, TF-NOC

Specifying Operations Specifying Operations Why operations are specified Algorithmic methods

ARITHMETIC OPERATIONS WE HAVE THE DATA, WHAT NOW? Operations in C Bit-wise boolean operations

.Im a little bit weak .... because some thingsre like in my head yeah but I cant

Overview of Monte Carlo Generators John Campbell, Fermilab Monte Carlo overview: history

Blaise Code Generator From implementing standards to coding automation ric Joyal September

@MilanGabor Dont be afraid to bug me! I dont bite! ;)

Computer Graphics CPSC 453 Fall 2018 Sonny Chan Your Professor Dr. Sonny Chan -

O PTGEN : A Generator for Local Optimizations Sebastian Buchwald Institute for Program Structures

Short generators without quantum computers: the case of multiquadratics Christine van Vredendaal

Combinatorial interpretations in affine Coxeter groups Christopher R. H. Hanusa Queens College,

Near-Optimal Pseudorandom Generators for Constant-Depth Read-Once Formulas Dean Doron 1 Pooya