From Tool Supported Performance Modeling of Regular Algorithms to - - PowerPoint PPT Presentation

from tool supported performance modeling of regular
SMART_READER_LITE
LIVE PREVIEW

From Tool Supported Performance Modeling of Regular Algorithms to - - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein Overview 1. Loop Kernels 2. Roofline and


slide-1
SLIDE 1

ERLANGEN REGIONAL COMPUTING CENTER [RRZE]

From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms

Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein

slide-2
SLIDE 2

2

Overview

  • 1. Loop Kernels
  • 2. Roofline and ECM
  • 3. Kerncraft
  • 1. Overview and Structure
  • 2. Output and Results
  • 4. 3D-long-range Example
  • 5. Outlook
  • 1. Irregular Algorithms

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-3
SLIDE 3

LOOP KERNELS

Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

slide-4
SLIDE 4

4

§ Many inner-loop iterations § No branching § Access fully determined by loop counters (i.e., no irregularities)

Loop Kernels

double a[5000], b[5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; double a[5000][5000]; double b[5000][5000]; double s; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s;

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-5
SLIDE 5

5

Streaming Kernel § Simple structure § No data-reuse

Loop Kernels

double a[5000], b[5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; double a[5000][5000]; double b[5000][5000]; double s; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s;

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

Stencil Code § Complex Structure § Heavy data-reuse

slide-6
SLIDE 6

6

How to predict performance on complex architectures? Two major contributions/bottlenecks:

Loop Kernels

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

Control fmow L1 instruction cache L1 Dcache STORE LOAD LOAD ADD AGU AGU ALU ALU ALU Port 0 Port 5 Port 4 Port 3 Port 2 Port 1 Scheduler Reorder buffer / Register renaming DIV Memory control MULT Register fjle MOV/MASK JMP

  • Pot. bottleneck

Data fmow Execution Units Decoder Decoder Decoder Decoder

incore execution / arithmetic operations memory and cache transfers

slide-7
SLIDE 7

ROOFLINE AND ECM

Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

slide-8
SLIDE 8

9

Roofline

§ All memory levels are separate bottlenecks P = min(Pcomp., I • bs)

Pcomp. Peak performance [FLOP/s] I Operational Intensity [FLOP/B] bs Peak bandwidth [B/s]

§ Bandwidths are measured by suitable benchmarks

Data FLOP/s

Roofline

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer Predicted performance

slide-9
SLIDE 9

10

Roofline: Performance vs Time

Data FLOP/s

Roofline

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer Predicted performance

§ CPU frequency?

→ Cycles!

§ Basic memory units?

→ Cache Lines! (64 Byte)

slide-10
SLIDE 10

11

§ CPU frequency?

→ Cycles!

§ Basic memory units?

→ Cache Lines! (1 CL=64 B) → 1 unit of work = 1 CL

→ Cycle / Cache Line (cy/CL)

§

Lower is better

Data cy/CL

Roofline

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer Predicted runtime

Roofline: Performance vs Time

slide-11
SLIDE 11

12

Data cy/CL

Roofline

Execution-Cache-Memory (ECM) Model

STORE & Comp. Data cy/CL

ECM

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

§ Memory and cache levels contribute to runtime TOL computation & stores TnOL loads from L1 TL1-L2 loads from L2 into L1 TL2-L3 loads from L3 into L2 TL3-MEM loads from main memory into L3

{ TOL || TnOL | TL1-L2 | TL2-L3 | TL3-MEM }

§ One measured input: full-socket mem. bandwidth

Predicted runtime

slide-12
SLIDE 12

13

Data cy/CL

Roofline

Roofline and ECM

STORE & Comp. Data cy/CL

ECM

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-13
SLIDE 13

14

Performance Modeling

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

STREAM Scale § For each cache line (8 it.):

› 1 CL is stored › 8 FLOP › 1 CL are loaded

2D 5-point Stencil § For each cache line (8 it.):

› 1 CL is stored › 32 FLOP › Up to 3 CL are loaded

Up to 3?

double a[5000], b[5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; double a[5000][5000]; double b[5000][5000]; double s; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s;

slide-14
SLIDE 14

15

Layer Conditions

pattern/ stencil workload hit/miss hit/miss 1D layer condition: stencil-width * stencil-height < cache-size 2D layer condition: stencil-height * matrix-width < cache-size nD layer condition:

( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] )

code

  • Creq. =

⇣X Lrel.offsets + max(Lrel.offsets) ∗ nslices ⌘ ∗ s

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-15
SLIDE 15

16

Performance Modeling

double U[M][N][N]; double V[M][N][N]; double ROC[M][N][N]; double c0, c1, c2, c3, c4, lap; for(int k=4; k < M-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { lap = c0 * V[k][j][i] + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ] + V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ] + V[k-1][ j ][ i ]) + c2 * ( V[ k ][ j ][i+2] + V[ k ][ j ][i-2]) + c2 * ( V[ k ][j+2][ i ] + V[ k ][j-2][ i ]) + c2 * ( V[k+2][ j ][ i ] + V[k-2][ j ][ i ]) + c3 * ( V[ k ][ j ][i+3] + V[ k ][ j ][i-3]) + c3 * ( V[ k ][j+3][ i ] + V[ k ][j-3][ i ]) + c3 * ( V[k+3][ j ][ i ] + V[k-3][ j ][ i ]) + c4 * ( V[ k ][ j ][i+4] + V[ k ][ j ][i-4]) + c4 * ( V[ k ][j+4][ i ] + V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}}

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-16
SLIDE 16

KERNCRAFT

Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

slide-17
SLIDE 17

18

Kerncraft

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

data pattern

name | offsets ...

  • -----+------------...

a | ('rel', 'j', 0), ('rel', 'i', -1) | ('rel', 'j', 0), ('rel', 'i', 1) | ('rel', 'j', -1), ('rel', 'i', 0) | ('rel', 'j', 1), ('rel', 'i', 0) s | ('dir',)

user-input

kernel code constants

binary

marked for IACA abstract syntax tree IACA throughput analysis cache usage prediction with pycachesim

data transfers

T_OL, T_nOL T_L1L2, T_L2L3, T_L3MEM

ECM/Roofline model Layer Condition model in-core AST

pycparser symbolic application

  • f LC formulation

compiler

#define N 1000 #define M 2000 for(j=1; j < N-1; ++j) for(i=1; i < M-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] + a[j-1][ i ] + a[j+1][ i ] ) * s; vmovsd (%rsi,%rbx,8), %xmm1 vaddsd 16(%rsi,%rbx,8), %xmm1, %xmm2 vaddsd 8(%rdx,%rbx,8), %xmm2, %xmm3 vaddsd 8(%rcx,%rbx,8), %xmm3, %xmm4 vaddsd 8(%r8,%rbx,8), %xmm4, %xmm5 vaddsd 8(%r9,%rbx,8), %xmm5, %xmm6 vmulsd %xmm6, %xmm0, %xmm7

likwid-bench documentation

machine file

clock: 2.7 GHz cacheline size: 64 B memory hierarchy:

  • {cores per group: 1, cycles per cacheline: 2,

level: L1, size per group: 32 kB}

  • {cores per group: 1, cycles per cacheline: 2,

level: L2, size per group: 256 kB}

  • {cores per group: 8, bandwidth: 40 GB/s,

level: L3, size per group: 20 MB} [...]

Input Intermediate Output

slide-18
SLIDE 18

23

Kerncraft – Output

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

$ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel ECM -D N 5000 -D M 500 =============================================================================== kernels/2d-5pt.c =============================================================================== { 9.0 || 8.0 | 10 | 6 | 12.74 } = 36.74 cy/CL { 9.0 \ 18.00 \ 24.00 \ 36.74 } cy/CL saturating at 3 cores $ $ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel Roofline -—unit cy/CL -D N 5000 -D M 500 =============================================================================== kernels/2d-5pt.c =============================================================================== Cache or mem bound with 1 core(s) 29.79 cy/CL due to L3-MEM transfer bottleneck (bw from copy benchmark) Arithmetic Intensity: 0.17 FLOP/b $

ECM model: { TOL || TnOL | TL1-L2 | TL2-L3 | TL3-MEM }

double a[M][N]; double b[M][N]; double s; for(j=1; j<M-1; ++j) for(i=1; i<N-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s;

slide-19
SLIDE 19

25

10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

Roofline 10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

Roofline ECM 10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

ECM OL nOL 10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

ECM OL L1-L2 nOL 10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

ECM OL L2-L3 L1-L2 nOL 10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

ECM OL L3-MEM L2-L3 L1-L2 nOL 10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

ECM OL

Kerncraft – Results

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-20
SLIDE 20

26

Kerncraft – Results

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

L3-MEM L2-L3 L1-L2 nOL OL Roofline ECM

spatial blocking termporal blocking

slide-21
SLIDE 21

27

Kerncraft – Spatial Blocking

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

10 1000 8111 705480 N (of N*M matrix) 32.7 32.7 36.7 40.7 49.9 0.0 10.0 20.0 30.0 cy/CL

2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

Roofline Inner-dim. block: ECM 768 20032 2048

slide-22
SLIDE 22

28

Kerncraft – Verbose Output

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

$ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel LC […] 1D Layer-Condition: L1: unconditionally fulfilled L2: unconditionally fulfilled L3: unconditionally fulfilled 2D Layer-Condition: L1: N <= 1024 L2: N <= 8192 L3: N <= 655360 $

Layer condition analysis: Also available as web-based calculator: https://rrze-hpc.github.io/layer-condition/#calculator

slide-23
SLIDE 23

29

Kerncraft – In-Socket Scaling

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 1 2 3 4 5 6 7 8 cores 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 runtime per full matrix sweep [s] predicted scaling for 1024, 2048, 4096 and 8192 predicted scaling for 20000, 230000 and 260000

2D-5pt in memory on Intel Xeon E5-2680 core with OpenMP schedule(static, 1024)

Inner loop length (N)

1024 2048 4096 8192 20000 23000 26000

slide-24
SLIDE 24

3D-LONG-RANGE EXAMPLE

double U[M][N][N]; double V[M][N][N]; double ROC[M][N][N]; double c0, c1, c2, c3, c4, lap; for(int k=4; k < M-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { lap = c0 * V[k][j][i] + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ] + V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ] + V[k-1][ j ][ i ]) + c2 * ( V[ k ][ j ][i+2] + V[ k ][ j ][i-2]) + c2 * ( V[ k ][j+2][ i ] + V[ k ][j-2][ i ]) + c2 * ( V[k+2][ j ][ i ] + V[k-2][ j ][ i ]) + c3 * ( V[ k ][ j ][i+3] + V[ k ][ j ][i-3]) + c3 * ( V[ k ][j+3][ i ] + V[ k ][j-3][ i ]) + c3 * ( V[k+3][ j ][ i ] + V[k-3][ j ][ i ]) + c4 * ( V[ k ][ j ][i+4] + V[ k ][ j ][i-4]) + c4 * ( V[ k ][j+4][ i ] + V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}}

slide-25
SLIDE 25

33

3D-long-range Example

10 21 57 215 497 1747 N (of N*N*M matrix) 86.0 86.0 102.0 118.0 134.0 169.6 185.6 0.0 25.0 50.0 75.0 cy/CL

3d-long-range-stencil.c in memory on single Intel Xeon E5-2680 (SandyBridge) core

L1-L2 +16cy L2-L3 +16cy L1-L2 +16cy L2-L3 +16cy L3-MEM +34.6cy L1 32kB L2 256kB L3 20MB

1D 1D 1D D 2 D 2 3D 2D 2D 1D D 3 D 2 3D 3D 2D 2D D 3 D 3 3D

layer-condition L3-MEM L2-L3 L1-L2 nOL OL Roofline

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-26
SLIDE 26

36

§ Guiding optimizations § Hardware-software co-design § Energy optimized computing § Deeper understanding of code and hardware interactions Kerncraft... § is a white-box utility § takes some of the pain out of performance modeling § is free (as in free beer and freedom) § is NOT for inexperienced programmers § is NOT a fully-automated jack-of-all-trades yielding better performance

Benefits

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-27
SLIDE 27

37

Open Source

https://github.com/RRZE-HPC/kerncraft

Licensed under AGPLv3

slide-28
SLIDE 28

38

§ Replacement for IACA (under investigation) § Support for non-Intel Architectures (AMD and POWER8)

§ Depends on:

› Support for non-inclusive cache-architectures and (work in progress) › ECM model support › Replacement for IACA

§ Phenomenological performance modeling with LIKWID § LLVM integration with polyhedral model

§ Import of kernels embedded in large code bases § Automatic tiling during compilation

§ Irregular Performance Modeling (e.g., graph algorithms)

Outlook

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-29
SLIDE 29

39

§ Assumptions:

§ More work, will lead to longer

execution time

§ Difference in time, can be

modeled by additional work

§ Basic Model for BFS-TD tNT(#nodes) + tET(#edgestraversed) + tUP(#nodesupdated) = t

Outlook – Breadth-First-Search

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

int64_t Step(const Graph &g, int64_t lvl, pvector<int64_t> &levels, pvector<NodeID> &parent) { int64_t changed = 0; for(NodeID u = 0; u < g.num_nodes(); u++) { if (levels[u] != lvl) continue; // Node Traversal (NT) until here for(NodeID v : g.in_neigh(u)) { if(levels[v] < 0) { // Edge Traversal (ET) until here levels[v] = lvl + 1; changed += 1; // Update (UP) until here } } } return changed; }

slide-30
SLIDE 30

40

Node Traversal Node Filter Edge Traversal Edge Destination Filter Node Update

Outlook – Breadth-First-Search

00.00.2015 | Thema | Name des Vortragenden

NT ET UP

slide-31
SLIDE 31

41

Outlook – Breadth-First-Search

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

§ Naïve prediction model:

slide-32
SLIDE 32

42

Outlook – Breadth-First-Search

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

§ Naïve prediction model, stepwise:

slide-33
SLIDE 33

43

§ Base model on queue network theory § Major challenge: Which graph properties to take into account? § Many common–but unsupported–building blocks.

§ Indirect accesses (graph in CSR format) § Branches § Pointer chasing (Shiloach-Vishkin for Connected-Components) § …

Outlook

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-34
SLIDE 34

44 Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-35
SLIDE 35

ERLANGEN REGIONAL COMPUTING CENTER [RRZE]

Thank You for Your Attention!

Julian Hammer <julian.hammer@fau.de> RRZE High Performance Computing Group http://www.rrze.fau.de/hpc

slide-36
SLIDE 36

46

  • 1. Transform
  • 2. Compile to assembly
  • 3. Mark inner loop
  • 4. Extract unrolling factor
  • 5. Compile to binary
  • 6. Analyze with IACA

Kerncraft – In-core Prediction

double a[N][N]; double b[N][N]; double s; for(j=1; j<N-1; ++j) for(i=1; i<N-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s;

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-37
SLIDE 37

47

  • 1. Transform

§ Linearized arrays and malloc

  • 2. Compile to assembly
  • 3. Mark inner loop
  • 4. Extract unrolling factor
  • 5. Compile to binary
  • 6. Analyze with IACA

Kerncraft – In-core Prediction

#include <stdlib.h> void dummy(double *); extern int var_false; int main(int argc, char **argv) { const int N = atoi(argv[2]); const int M = atoi(argv[1]); double *a = _mm_malloc( (sizeof(double)) * (M * N), 32); for (int i = 0; i < (M * N); ++i) a[i] = 0.23; if (var_false) dummy(a); double *b = _mm_malloc( (sizeof(double)) * (M * N), 32); for (int i = 0; i < (M * N); ++i) b[i] = 0.23; if (var_false) dummy(b); double s = 0.23; if (var_false) dummy(&s); for (int j = 1; j < (M - 1); ++j) for (int i = 1; i < (N - 1); ++i) b[i + (j * N)] = (((a[(i - 1) + (j * N)] + a[(i + 1) + (j * N)]) + a[i + ((j - 1) * N)]) + a[i + ((j + 1) * N)]) * s; }

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-38
SLIDE 38

48

  • 1. Transform
  • 2. Compile to assembly
  • 3. Mark inner loop

§ Detection heuristic

  • 4. Extract unrolling factor
  • 5. Compile to binary
  • 6. Analyze with IACA

Kerncraft – In-core Prediction

[...] ..B1.25: vmovddup %xmm1, %xmm0 movslq %r9d, %rdi vinsertf128 $1, %xmm0, %ymm0, %ymm0 movl $111, %ebx # INSERTED BY KERNCRAFT .byte 100 # INSERTED BY KERNCRAFT .byte 103 # INSERTED BY KERNCRAFT .byte 144 # INSERTED BY KERNCRAFT ..B1.26: vmovupd (%rbx,%rsi,8), %xmm2 vmovupd 16(%rbx,%rsi,8), %xmm3 vmovupd 32(%rbx,%rsi,8), %xmm14 [...] vmulpd %ymm7, %ymm0, %ymm8 vmovupd %ymm8, 104(%r8,%rsi,8) addq $16, %rsi cmpq %rdi, %rsi jb ..B1.26 movl $222, %ebx # INSERTED BY KERNCRAFT .byte 100 # INSERTED BY KERNCRAFT .byte 103 # INSERTED BY KERNCRAFT .byte 144 # INSERTED BY KERNCRAFT ..B1.28: [...]

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-39
SLIDE 39

49

  • 1. Transform
  • 2. Compile to assembly
  • 3. Mark inner loop
  • 4. Extract unrolling factor

§ From mem. ref. increments

  • 5. Compile to binary
  • 6. Analyze with IACA

Kerncraft – In-core Prediction

[...] ..B1.25: vmovddup %xmm1, %xmm0 movslq %r9d, %rdi vinsertf128 $1, %xmm0, %ymm0, %ymm0 movl $111, %ebx # INSERTED BY KERNCRAFT .byte 100 # INSERTED BY KERNCRAFT .byte 103 # INSERTED BY KERNCRAFT .byte 144 # INSERTED BY KERNCRAFT ..B1.26: vmovupd (%rbx,%rsi,8), %xmm2 vmovupd 16(%rbx,%rsi,8), %xmm3 vmovupd 32(%rbx,%rsi,8), %xmm14 [...] vmulpd %ymm7, %ymm0, %ymm8 vmovupd %ymm8, 104(%r8,%rsi,8) addq $16, %rsi cmpq %rdi, %rsi jb ..B1.26 movl $222, %ebx # INSERTED BY KERNCRAFT .byte 100 # INSERTED BY KERNCRAFT .byte 103 # INSERTED BY KERNCRAFT .byte 144 # INSERTED BY KERNCRAFT ..B1.28: [...]

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

slide-40
SLIDE 40

50

  • 1. Transform
  • 2. Compile to assembly
  • 3. Mark inner loop
  • 4. Extract unrolling factor
  • 5. Compile to binary
  • 6. Analyze with IACA

→2D and 3D are LOADs

Kerncraft – In-core Prediction

Throughput Analysis Report

  • Block Throughput: 18.90 Cycles

Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU Port Binding In Cycles Per Iteration:

  • | Port | 0 -

DV | 1 | 2 - D | 3 - D |

  • | Cycles | 10.1 0.0 | 12.0 | 18.0 16.0 | 18.0 16.0|
  • | Port | 4 | 5 |
  • | Cycles | 8.0 | 11.9 |
  • Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer
slide-41
SLIDE 41

51

  • 1. Parse kernel code
  • 2. Enforce restrictions
  • 3. Extract data accesses
  • 4. Calculate cache accesses

1.

Compile offsets to fill all cache levels

2.

Reset cache simulator stats

3.

Execute next cache line accesses

4.

Check for hits/misses

Kerncraft – Cache Prediction

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

double a[N][N]; double b[N][N]; double s; for(j=1; j<N-1; ++j) for(i=1; i<N-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s;

slide-42
SLIDE 42

52

  • 1. Parse kernel code
  • 2. Enforce restrictions
  • 3. Extract data accesses
  • 4. Calculate cache accesses

1.

Compile offsets to fill all cache levels

2.

Reset cache simulator stats

3.

Execute next cache line accesses

4.

Check for hits/misses

Kerncraft – Cache Prediction

Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer

variables: name | type size

  • --------+-------------------------

a | double [M, N] s | double None b | double [M, N] loop stack: idx | min max step

  • --------+---------------------------------

j | 1 M - 1 1 i | 1 N - 1 1 data sources: name | offsets ...

  • --------+------------...

a | [j, i - 1] | [j, i + 1] | [j - 1, i] | [j + 1, i] s | None data destinations: name | offsets ...

  • --------+------------...

b | [j, i] constants: name | value

  • --------+-----------

N | 511 M | 511