from tool supported performance modeling of regular
play

From Tool Supported Performance Modeling of Regular Algorithms to - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein Overview 1. Loop Kernels 2. Roofline and


  1. ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein

  2. Overview 1. Loop Kernels 2. Roofline and ECM 3. Kerncraft 1. Overview and Structure 2. Output and Results 4. 3D-long-range Example 5. Outlook 1. Irregular Algorithms Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 2

  3. LOOP KERNELS Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

  4. Loop Kernels double a[5000], b[5000]; double a[5000][5000]; double s; double b[5000][5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s; § Many inner-loop iterations § No branching § Access fully determined by loop counters (i.e., no irregularities) Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 4

  5. Loop Kernels double a[5000], b[5000]; double a[5000][5000]; double s; double b[5000][5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s; Streaming Kernel Stencil Code § Simple structure § Complex Structure § No data-reuse § Heavy data-reuse Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 5

  6. Loop Kernels How to predict performance on complex architectures? Two major contributions/bottlenecks: L1 instruction cache Decoder Decoder Decoder Decoder Reorder buffer / Register renaming Register incore execution / Scheduler fj le memory and Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 arithmetic operations cache transfers ALU ALU LOAD LOAD STORE ALU Execution MULT ADD AGU AGU MOV/MASK Units DIV JMP Data fm ow Control fm ow L1 Dcache Memory control Pot. bottleneck Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 6

  7. ROOFLINE AND ECM Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

  8. Roofline § All memory levels are separate bottlenecks Predicted performance P = min(P comp. , I • b s ) P comp. Peak performance Data [FLOP/s] I Operational Intensity [FLOP/B] b s Peak bandwidth FLOP/s [B/s] Roofline § Bandwidths are measured by suitable benchmarks Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 9

  9. Roofline: Performance vs Time § CPU frequency? → Cycles! Predicted performance § Basic memory units? → Cache Lines! (64 Byte) Data FLOP/s Roofline Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 10

  10. Roofline: Performance vs Time § CPU frequency? → Cycles! Predicted runtime § Basic memory units? → Cache Lines! (1 CL=64 B) Data → 1 unit of work = 1 CL → Cycle / Cache Line (cy/CL) Lower is better § cy/CL Roofline Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 11

  11. Execution-Cache-Memory (ECM) Model § Memory and cache levels contribute to runtime T OL computation & stores T nOL loads from L1 Predicted runtime Data T L1-L2 loads from L2 into L1 Data T L2-L3 loads from L3 into L2 T L3-MEM loads from main STORE & memory into L3 Comp. cy/CL cy/CL { T OL || T nOL | T L1-L2 | T L2-L3 | T L3-MEM } Roofline ECM § One measured input: full-socket mem. bandwidth Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 12

  12. Roofline and ECM Data Data STORE & Comp. cy/CL cy/CL Roofline ECM Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 13

  13. Performance Modeling double a[5000], b[5000]; double a[5000][5000]; double s; double b[5000][5000]; double s; for(i=0; i<5000; ++i) a[i] = s * b[i]; for(j=1; j<5000-1; ++j) for(i=1; i<5000-1; ++i) b[j][i] = ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) * s; STREAM Scale 2D 5-point Stencil § For each cache line (8 it.): § For each cache line (8 it.): › 1 CL is stored › 1 CL is stored › 8 FLOP › 32 FLOP › 1 CL are loaded › Up to 3 CL are loaded Up to 3? Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 14

  14. Layer Conditions ( a[j][i-1] + a[j][i+1] + a[j-1][i] + a[j+1][i] ) code pattern/ workload hit/miss hit/miss stencil 1D layer condition: stencil-width * stencil-height < cache-size 2D layer condition: stencil-height * matrix-width < cache-size ⇣X ⌘ nD layer condition: C req . = L rel . o ff sets + max( L rel . o ff sets ) ∗ n slices ∗ s Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 15

  15. Performance Modeling double U[M][N][N]; double V[M][N][N]; double ROC[M][N][N]; double c0, c1, c2, c3, c4, lap; for(int k=4; k < M-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { lap = c0 * V[k][j][i] + c1 * ( V[ k ][ j ][i+1] + V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ] + V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ] + V[k-1][ j ][ i ]) + c2 * ( V[ k ][ j ][i+2] + V[ k ][ j ][i-2]) + c2 * ( V[ k ][j+2][ i ] + V[ k ][j-2][ i ]) + c2 * ( V[k+2][ j ][ i ] + V[k-2][ j ][ i ]) + c3 * ( V[ k ][ j ][i+3] + V[ k ][ j ][i-3]) + c3 * ( V[ k ][j+3][ i ] + V[ k ][j-3][ i ]) + c3 * ( V[k+3][ j ][ i ] + V[k-3][ j ][ i ]) + c4 * ( V[ k ][ j ][i+4] + V[ k ][ j ][i-4]) + c4 * ( V[ k ][j+4][ i ] + V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ] + V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}} Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 16

  16. KERNCRAFT Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft

  17. Kerncraft user-input binary IACA throughput in-core compiler analysis kernel code constants marked for IACA T_OL, T_nOL #define N 1000 vmovsd (%rsi,%rbx,8), %xmm1 #define M 2000 vaddsd 16(%rsi,%rbx,8), %xmm1, %xmm2 ECM/Roofline vaddsd 8(%rdx,%rbx,8), %xmm2, %xmm3 for(j=1; j < N-1; ++j) vaddsd 8(%rcx,%rbx,8), %xmm3, %xmm4 for(i=1; i < M-1; ++i) vaddsd 8(%r8,%rbx,8), %xmm4, %xmm5 model b[j][i] = (a[ j ][i-1] + a[ j ][i+1] vaddsd 8(%r9,%rbx,8), %xmm5, %xmm6 + a[j-1][ i ] + a[j+1][ i ] ) * s; vmulsd %xmm6, %xmm0, %xmm7 data transfers pycparser AST data pattern T_L1L2, T_L2L3, T_L3MEM abstract syntax tree cache usage prediction name | offsets ... with pycachesim ------+------------... a | ('rel', 'j', 0), ('rel', 'i', -1) | ('rel', 'j', 0), ('rel', 'i', 1) | ('rel', 'j', -1), ('rel', 'i', 0) | ('rel', 'j', 1), ('rel', 'i', 0) machine file s | ('dir',) Layer Condition clock: 2.7 GHz symbolic application cacheline size: 64 B of LC formulation memory hierarchy: - {cores per group: 1, cycles per cacheline: 2, model level: L1, size per group: 32 kB} documentation - {cores per group: 1, cycles per cacheline: 2, Input level: L2, size per group: 256 kB} - {cores per group: 8, bandwidth: 40 GB/s, Intermediate likwid-bench level: L3, size per group: 20 MB} [...] Output Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 18

  18. Kerncraft – Output ECM model: { T OL || T nOL | T L1-L2 | T L2-L3 | T L3-MEM } $ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel ECM -D N 5000 -D M 500 =============================================================================== kernels/2d-5pt.c =============================================================================== double a[M][N]; { 9.0 || 8.0 | 10 | 6 | 12.74 } = 36.74 cy/CL double b[M][N]; { 9.0 \ 18.00 \ 24.00 \ 36.74 } cy/CL double s; saturating at 3 cores $ for(j=1; j<M-1; ++j) $ kerncraft -—machine snb.yaml 2d-5pt.c -—pmodel Roofline -—unit cy/CL -D N 5000 -D M 500 for(i=1; i<N-1; ++i) =============================================================================== b[j][i] = ( a[j][i-1] + a[j][i+1] kernels/2d-5pt.c + a[j-1][i] + a[j+1][i] ) =============================================================================== * s; Cache or mem bound with 1 core(s) 29.79 cy/CL due to L3-MEM transfer bottleneck (bw from copy benchmark) Arithmetic Intensity: 0.17 FLOP/b $ Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 23

  19. Kerncraft – Results 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core 2d-5pt.c in memory on single Intel Xeon E5-2680 (SandyBridge) core Roofline Roofline 49.9 49.9 49.9 49.9 49.9 49.9 49.9 ECM ECM ECM ECM ECM ECM 40.7 40.7 40.7 40.7 40.7 40.7 40.7 L3-MEM 36.7 36.7 36.7 36.7 36.7 36.7 36.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 32.7 cy/CL cy/CL cy/CL cy/CL cy/CL cy/CL cy/CL 30.0 30.0 30.0 30.0 30.0 30.0 30.0 L2-L3 L2-L3 20.0 20.0 20.0 20.0 20.0 20.0 20.0 L1-L2 L1-L2 L1-L2 10.0 10.0 10.0 10.0 10.0 10.0 10.0 OL OL OL OL OL nOL nOL nOL nOL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10 10 10 10 10 10 10 1000 1000 1000 1000 1000 1000 1000 8111 8111 8111 8111 8111 8111 8111 705480 705480 705480 705480 705480 705480 705480 N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) N (of N*M matrix) Scalable Tools Workshop 2016 | Kerncraft | Julian Hammer 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend