From Tool Supported Performance Modeling of Regular Algorithms to - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein

Overview 1. Loop Kernels 2. Roofline and ECM 3. Kerncraft 1. Overview and Structure 2. Output and Results 4. 3D-long-range Example 5. Outlook 1. Irregular Algorithms Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 2

LOOP KERNELS Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

Loop Kernels double a[5000], b[5000]; double a[5000][5000]; double s; double b[5000][5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s; § Many inner-loop iterations § No branching § Access fully determined by loop counters (i.e., no irregularities) Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 4

Loop Kernels double a[5000], b[5000]; double a[5000][5000]; double s; double b[5000][5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s; Streaming Kernel Stencil Code § Simple structure § Complex Structure § No data-reuse § Heavy data-reuse Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 5

Loop Kernels How to predict performance on complex architectures? Two major contributions/bottlenecks: L1 instruction cache Decoder Decoder Decoder Decoder Reorder buffer / Register renaming Register incore execution / Scheduler fj le memory and Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 arithmetic operations cache transfers ALU ALU LOAD LOAD STORE ALU Execution MULT ADD AGU AGU MOV/MASK Units DIV JMP Data fm ow Control fm ow L1 Dcache Memory control Pot. bottleneck Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 6

ROOFLINE AND ECM Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

Roofline § All memory levels are separate bottlenecks Predicted performance P = min(P comp. , I • b s ) P comp. Peak performance Data [FLOP/s] I Operational Intensity [FLOP/B] b s Peak bandwidth FLOP/s [B/s] Roofline § Bandwidths are measured by suitable benchmarks Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 9

Roofline: Performance vs Time § CPU frequency? → Cycles! Predicted performance § Basic memory units? → Cache Lines! (64 Byte) Data FLOP/s Roofline Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 10

Roofline: Performance vs Time § CPU frequency? → Cycles! Predicted runtime § Basic memory units? → Cache Lines! (1 CL=64 B) Data → 1 unit of work = 1 CL → Cycle / Cache Line (cy/CL) Lower is better § cy/CL Roofline Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 11

Execution-Cache-Memory (ECM) Model § Memory and cache levels contribute to runtime T OL computation & stores T nOL loads from L1 Predicted runtime Data T L1-L2 loads from L2 into L1 Data T L2-L3 loads from L3 into L2 T L3-MEM loads from main STORE & memory into L3 Comp. cy/CL cy/CL { T OL || T nOL | T L1-L2 | T L2-L3 | T L3-MEM } Roofline ECM § One measured input: full-socket mem. bandwidth Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 12

Roofline and ECM Data Data STORE & Comp. cy/CL cy/CL Roofline ECM Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 13

Performance Modeling double a[5000], b[5000]; double a[5000][5000]; double s; double b[5000][5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s; STREAM Scale 2D 5-point Stencil § For each cache line (8 it.): § For each cache line (8 it.): › 1 CL is stored › 1 CL is stored › 8 FLOP › 32 FLOP › 1 CL are loaded › Up to 3 CL are loaded Up to 3? Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 14

Layer Conditions ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) code pattern/ workload hit/miss hit/miss stencil 1D layer condition: stencil-width * stencil-height < cache-size 2D layer condition: stencil-height * matrix-width < cache-size ⇣X ⌘ nD layer condition: C req . = L rel . o ff sets + max( L rel . o ff sets ) ∗ n slices ∗ s Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 15

Performance Modeling double U[M][N][N]; double V[M][N][N]; double ROC[M][N][N]; double c0, c1, c2, c3, c4, lap; for(int k=4; k < M-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { lap = c0 * V[k][j][i] + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ] + V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ] + V[k-1][ j ][ i ]) + c2 * ( V[ k ][ j ][i+2] + V[ k ][ j ][i-2]) + c2 * ( V[ k ][j+2][ i ] + V[ k ][j-2][ i ]) + c2 * ( V[k+2][ j ][ i ] + V[k-2][ j ][ i ]) + c3 * ( V[ k ][ j ][i+3] + V[ k ][ j ][i-3]) + c3 * ( V[ k ][j+3][ i ] + V[ k ][j-3][ i ]) + c3 * ( V[k+3][ j ][ i ] + V[k-3][ j ][ i ]) + c4 * ( V[ k ][ j ][i+4] + V[ k ][ j ][i-4]) + c4 * ( V[ k ][j+4][ i ] + V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}} Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 16

KERNCRAFT Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

Kerncraft user-input binary IACA throughput in-core compiler analysis kernel code constants marked for IACA T_OL, T_nOL #define N 1000 vmovsd (%rsi,%rbx,8), %xmm1 #define M 2000 vaddsd 16(%rsi,%rbx,8), %xmm1, %xmm2 ECM/Roofline vaddsd 8(%rdx,%rbx,8), %xmm2, %xmm3 for(j=1; j < N-1; ++j) vaddsd 8(%rcx,%rbx,8), %xmm3, %xmm4 for(i=1; i < M-1; ++i) vaddsd 8(%r8,%rbx,8), %xmm4, %xmm5 model b[j][i] = (a[ j ][i-1] + a[ j ][i+1] vaddsd 8(%r9,%rbx,8), %xmm5, %xmm6 + a[j-1][ i ] + a[j+1][ i ] ) * s; vmulsd %xmm6, %xmm0, %xmm7 data transfers pycparser AST data pattern T_L1L2, T_L2L3, T_L3MEM abstract syntax tree cache usage prediction name | offsets ... with pycachesim ------+------------... a | ('rel', 'j', 0), ('rel', 'i', -1) | ('rel', 'j', 0), ('rel', 'i', 1) | ('rel', 'j', -1), ('rel', 'i', 0) | ('rel', 'j', 1), ('rel', 'i', 0) machine file s | ('dir',) Layer Condition clock: 2.7 GHz symbolic application cacheline size: 64 B of LC formulation memory hierarchy: - {cores per group: 1, cycles per cacheline: 2, model level: L1, size per group: 32 kB} documentation - {cores per group: 1, cycles per cacheline: 2, Input level: L2, size per group: 256 kB} - {cores per group: 8, bandwidth: 40 GB/s, Intermediate likwid-bench level: L3, size per group: 20 MB} [...] Output Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 18

Kerncraft – Output ECM model: { T OL || T nOL | T L1-L2 | T L2-L3 | T L3-MEM } $ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel ECM -D N 5000 -D M 500 =============================================================================== kernels/2d-5pt.c =============================================================================== double a[M][N]; { 9.0 || 8.0 | 10 | 6 | 12.74 } = 36.74 cy/CL double b[M][N]; { 9.0 \ 18.00 \ 24.00 \ 36.74 } cy/CL double s; saturating at 3 cores $ for(j=1; j<M-1; ++j) $ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel Roofline -—unit cy/CL -D N 5000 -D M 500 for(i=1; i<N-1; ++i) =============================================================================== b[j][i] = ( a[j][i-1] + a[j][i+1] kernels/2d-5pt.c + a[j-1][i] + a[j+1][i] ) =============================================================================== * s; Cache or mem bound with 1 core(s) 29.79 cy/CL due to L3-MEM transfer bottleneck (bw from copy benchmark) Arithmetic Intensity: 0.17 FLOP/b $ Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 23

Kerncraft – Results 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core Roofline Roofline 49.9 49.9 49.9 49.9 49.9 49.9 49.9 ECM ECM ECM ECM ECM ECM 40.7 40.7 40.7 40.7 40.7 40.7 40.7 L3-MEM 36.7 36.7 36.7 36.7 36.7 36.7 36.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 cy/CL cy/CL cy/CL cy/CL cy/CL cy/CL cy/CL 30.0 30.0 30.0 30.0 30.0 30.0 30.0 L2-L3 L2-L3 20.0 20.0 20.0 20.0 20.0 20.0 20.0 L1-L2 L1-L2 L1-L2 10.0 10.0 10.0 10.0 10.0 10.0 10.0 OL OL OL OL OL nOL nOL nOL nOL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10 10 10 10 10 10 10 1000 1000 1000 1000 1000 1000 1000 8111 8111 8111 8111 8111 8111 8111 705480 705480 705480 705480 705480 705480 705480 N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 25

From Tool Supported Performance Modeling of Regular Algorithms to - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein Overview 1. Loop Kernels 2. Roofline and

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Regular Expressions A regular expression describes a language using three operations. Regular

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Edge-regular graphs and regular cliques Gary Greaves Nanyang Technological University, Singapore

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

A Theory of Regular Queries Moshe Y. Vardi Rice University Theory of Regular Languages, I

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

EMPLOYER EMPLOYER AN INTRODUCTION TO SUPPORTED SUPPORTED POLICING POLICING An introduction

Regular Languages Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

42pt 20pt 11pt 9pt Light Semilight Light Semilight Regular Regular Regular Semibold

Closure under the Regular Operations Closure under the Regular Operations p.1/26

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

Probabilistic Foundations of Statistical Network Analysis Chapter 1: Orientation Harry Crane

Quadrennial Defense Review Results February 3, 2006 Introduction A wartime QDR: conducted

How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher,

Design Update of the Solenoid Design Andrea Bersani INFN Sezione di Genova Cryogenic Turret

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

Robotron: Top-down Network Management at Scale Yu-Wei Eric Sung , Xiaozheng Tie,

Location- -based Routing in based Routing in Location Sensor Networks Sensor Networks Jie Gao

Design and Implemention of a Plugin Scheduler for Diet & Performance Prediction in Diet with

From Tool Supported Performance Modeling of Regular Algorithms to - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein Overview 1. Loop Kernels 2. Roofline and

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Regular Expressions A regular expression describes a language using three operations. Regular

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Edge-regular graphs and regular cliques Gary Greaves Nanyang Technological University, Singapore

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

A Theory of Regular Queries Moshe Y. Vardi Rice University Theory of Regular Languages, I

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

EMPLOYER EMPLOYER AN INTRODUCTION TO SUPPORTED SUPPORTED POLICING POLICING An introduction

Regular Languages Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

42pt 20pt 11pt 9pt Light Semilight Light Semilight Regular Regular Regular Semibold

Closure under the Regular Operations Closure under the Regular Operations p.1/26

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

Probabilistic Foundations of Statistical Network Analysis Chapter 1: Orientation Harry Crane

Quadrennial Defense Review Results February 3, 2006 Introduction A wartime QDR: conducted

How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher,

Design Update of the Solenoid Design Andrea Bersani INFN Sezione di Genova Cryogenic Turret

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

Robotron: Top-down Network Management at Scale Yu-Wei Eric Sung , Xiaozheng Tie,

Location- -based Routing in based Routing in Location Sensor Networks Sensor Networks Jie Gao

Design and Implemention of a Plugin Scheduler for Diet &amp; Performance Prediction in Diet with

Design and Implemention of a Plugin Scheduler for Diet & Performance Prediction in Diet with