GEMM-Like Operations Applications Richard Michael Veras Platforms - - PowerPoint PPT Presentation

β–Ά
gemm like operations
SMART_READER_LITE
LIVE PREVIEW

GEMM-Like Operations Applications Richard Michael Veras Platforms - - PowerPoint PPT Presentation

Carnegie Mellon Performance An Algorithmic Specific Code Generator for GEMM-Like Operations Applications Richard Michael Veras Platforms Carnegie Mellon Want Automatic High Performance Model Driven GEMM-Like approach for Operations


slide-1
SLIDE 1

Carnegie Mellon

An Algorithmic Specific Code Generator for GEMM-Like Operations

Richard Michael Veras

Applications Platforms Performance

slide-2
SLIDE 2

Carnegie Mellon

Want Automatic High Performance

Richard Veras (rveras@cmu.edu) 2 Compiler Techniques: Model Driven approach for generating DGEMM: GEMM-Like Operations

High Perf.

slide-3
SLIDE 3

Carnegie Mellon

GEMM Like Operations

Richard Veras (rveras@cmu.edu) 3 Centrality: ~𝑩𝑼𝑩 [betweeness] (𝒂, +, 𝑡𝑱𝑢, 𝟏, 𝟐) Clustering: ~𝑩𝑩𝑼𝑩 (triangle) (𝒂, β‹€, +, 𝟐, 𝟏) Community Detection: ~𝑩𝑳 Check out GraphBLAS

slide-4
SLIDE 4

Carnegie Mellon

High Performance Micro-Kernels

Richard Veras (rveras@cmu.edu) 4 Veras, R. , Smith T., Low T.M., Franchetti, F. van de Geijn, R. [CGO 2017 Submitted] Cast micro kernel as outer product: Use Models to Select from Design Space: Aggressively Schedule and Optimize: Enumerate all possible tilings given ISA

slide-5
SLIDE 5

Carnegie Mellon

High Performance Micro-Kernels

Richard Veras (rveras@cmu.edu) 5 Veras, R. , Smith T., Low T.M., Franchetti, F. van de Geijn, R. [CGO 2017 Submitted] OpenBLAS Our Generated ATLAS

slide-6
SLIDE 6

Carnegie Mellon

Automating with Compiler Techniques

Richard Veras (rveras@cmu.edu) 6 Subtle Semiring Changes Impact Performance: Minimizing Stalls Cast as ILP Problem

[extremetech.com] [massey.ac.nz] [cs.duke.edu]

slide-7
SLIDE 7

Carnegie Mellon

Input: Ouput:

Richard Veras (rveras@cmu.edu) 7

A Generator for GEMM-Like Kernels

Betweeness Centrality (Floyd-Warshall) Semiring (𝒂, +, 𝑡𝑱𝑢, 𝟏, 𝟐) GEMM Like:

𝑫 𝑩π‘ͺ + 𝑫 𝑫𝒃𝒅𝒅 𝟏 Initialize 𝑫𝒃𝒅𝒅 𝒒 𝒃𝒄𝑼 Compute 𝑫 π’ˆ(𝑫𝒃𝒅𝒅) Accumulate

Kernel Algorithm from

  • ur design space:

Kernel Sustains High Throughput: Tuned to the Target Architecture

slide-8
SLIDE 8

Carnegie Mellon

Richard Veras (rveras@cmu.edu) 8

Our GEMM Generator Pipeline

Find Efficient Instructions Mix (Algo) Turn Mix into Efficient Code (Implementation) Enumerate algorithm space User defines block size, ISA and semiring Top Candidate selected Template Created for Candidate Template transformed into

  • ptimized and

scheduled code

slide-9
SLIDE 9

Carnegie Mellon

From Math to Tiling

Richard Veras (rveras@cmu.edu) 9 𝑫 𝑩π‘ͺ + 𝑫 𝑫𝒃𝒅𝒅 𝟏 Initialize 𝑫𝒃𝒅𝒅 𝒒 𝒃𝒄𝑼 Compute 𝑫 π’ˆ(𝑫𝒃𝒅𝒅) Accumulate Identify Small Outer Products from ISA Enumerate Space of Outer Products Start with ISA Select Best Mix with Queueing Model A High Throughput Mix

slide-10
SLIDE 10

Carnegie Mellon

From Tiling to Template

Richard Veras (rveras@cmu.edu) 10 𝑫𝒃𝒅𝒅

𝒒

𝒃𝒄𝑼

for( pp = 0; pp < k_b; pp++ ) /* perform the outer products */ for( i = 0; i < m_r; i+=m_s ) for( j = 0; j < n_r; j+=n_s ) for( ii = i; ii < i+m_s; ii++ ) get_a_elem(a_reg, ii,j ); for( jj = j; jj < j+n_s; jj++ ) get_b_elem(b_reg, ii,jj ); apply(c_reg,a_reg,b_reg,ii,jj,pp);

𝑫𝒃𝒅𝒅 𝟏

for( i = 0; i < m_r; i++ ) for( j = 0; j < n_r; j++ ) init(c_reg, ii,jj )

𝑫 π’ˆ(𝑫𝒃𝒅𝒅)

for( i = 0; i < m_r; i++ ) for( j = 0; j < n_r; j++ ) accumulate(C, c_reg, ii, jj );

def get_b_element( var array b_reg[][], ptr B, ii, jj )

  • pts = {

0: assign( b_reg[jj], vload(B,jj)), 1: assign( b_reg[jj], shuffle(b_reg[jj-1]) ), 2: assign( b_reg[jj], permute(b_reg[jj-1]) ), 3: assign( b_reg[jj], shuffle(b_reg[jj-1]) )} if ii mod v = 0 return opts[jj]

Embedding Function (get_b): Selected Outer Product:

slide-11
SLIDE 11

Carnegie Mellon

Floyd-Warshall Embedded in GEMM

Richard Veras (rveras@cmu.edu) 11

assign( c_reg[ii,jj], MIN(a_reg[ii,pp]+ b_reg[pp,jj], c_reg[ii,jj]))

(𝒂, +, 𝑡𝑱𝑢, 𝟏, ∞) Semiring:

DGEMM Floyd-Warshall

Initialize: 𝑫𝒃𝒅𝒅 𝟏 Compute: 𝑫𝒃𝒅𝒅 𝒒 𝒃𝒄𝑼 Accumulate: 𝑫 π’ˆ(𝑫𝒃𝒅𝒅)

assign( c_reg[ii,jj], a_reg[ii,pp]* b_reg[pp,jj], +c_reg[ii,jj])) assign( C[(ii,jj)], c_reg[ii,jj]+ C[(ii,jj)] assign( c_reg[ii,jj], 0)

(𝑺,βˆ—, +, 𝟐, 𝟏)

assign( c_reg[ii,jj], (ii==jj)? 0 : INFINITY) assign( C[(ii,jj)], MIN(c_reg[ii,jj], C[(ii,jj)] )

slide-12
SLIDE 12

Carnegie Mellon

Scheduling the Problem

Richard Veras (rveras@cmu.edu) 12 We have built the kernel code, Now we need to schedule: Static Scheduling still matters

  • n OOO Processors:

Express as Decision Vars Formulate Constraints

  • ver Vars

Minimize with ILP Solver

Pipeline for Scheduling: Built Kernel + Embedding Func. Scheduled Kernel Code

slide-13
SLIDE 13

Carnegie Mellon

OASIC approach for ILP Scheduling

Richard Veras (rveras@cmu.edu) 13

π’šπ’π’–

𝒍

Representing Design Space as Polytope: Decision Variable: instruction n is executed on functional unit k at time step t Expressing Constraints in terms of X:

𝒍 𝒖 π’šπ’π’–

𝒍 = 𝟐

Instruction label cycle Functional unit Every instruction n is executed once At any timestep t, functional unit k is used no more than it can If 𝒖𝒏 depends on 𝒖𝒐, then 𝒖𝒏 will not execute until l cycles after 𝒖𝒐

𝒍 π’šπ’π’–

𝒍

≀ 𝑺𝒍

𝒍 𝒖 π’šπ’π’–π’

𝒍

+

𝒍 𝒖 π’šπ’π’–π’

𝒍

≀ 𝟐

But wait, there’s more!

slide-14
SLIDE 14

Carnegie Mellon

Emitting The Code

Richard Veras (rveras@cmu.edu) 14 Need Custom ANSI C compliant SIMD wrappers to schedule in Compiler: #define VADD(srca,srcb,dest) asm volatile( "vaddpd %[vsrca],%[vsrcb], %[vdest]" : [vdest] "=x"(dest) : [vsrca] "x"(srca), [vsrcb] "x"(srcb)); Code is now Scheduled

for( pp = 0; pp < k_b; pp+=KUNR ) /* STEADY STATE CODE */ VLOAD_IA(GET_A_ADDR(0),GET_A_REG(0)) VLOAD_IA(GET_A_ADDR(1),GET_A_REG(1)) VLOAD_IA(GET_B_ADDR(0),GET_B_REG(0)) VSHUFFLE_IA(GET_B_REG(0),GET_B_REG(1)) VFMA(GET_A_REG(0), GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(0),GET_B_REG(1),GET_C_REG(0,1)) VPERM2F128_IA(0x01,GET_B_REG(1),GET_B_REG(2)) VSHUFFLE_IA(0x05,GET_B_REG(2),GET_B_REG(3)) VFMA(GET_A_REG(1),GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(1),GET_B_REG(1),GET_C_REG(0,1))

Code is emitted: Is this necessary?

slide-15
SLIDE 15

Carnegie Mellon

Putting it All Together

Richard Veras (rveras@cmu.edu) 15 Run through our Pipeline: Have Operation that we can express like GEMM: 𝑫 𝑩π‘ͺ + 𝑫 𝑫𝒃𝒅𝒅 𝟏 Initialize 𝑫𝒃𝒅𝒅 𝒒 𝒃𝒄𝑼 Compute 𝑫 π’ˆ(𝑫𝒃𝒅𝒅) Accumulate

slide-16
SLIDE 16

Carnegie Mellon

Moving Forward

Richard Veras (rveras@cmu.edu) 16 Centrality: ~𝑩𝑼𝑩 (betweeness) Clustering: ~𝑩𝑩𝑼𝑩 (triangle) Community Detection: ~𝑩𝑳

slide-17
SLIDE 17

Carnegie Mellon

Summary

ο‚’ There exists a large class of GEMM-like

Operations

ο‚’ Obtaining DGEMM level performance for each of

these operations requires automation

ο‚’ We have a systematic approach for automatically

generating DGEMM

ο‚’ We are extending it by allowing the user to define

a semi-ring with an initialize and accumulate function

Richard Veras (rveras@cmu.edu) 17