erlangen regional
play

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported - PowerPoint PPT Presentation

ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August 3-6, 2015, Lake Tahoe, CA RRZE


  1. ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August 3-6, 2015, Lake Tahoe, CA

  2. RRZE  LIKWID tiny.cc/LIKWID  GHOST tiny.cc/GHOST  Performance Engineering http://blogs.fau.de/... hager/talks/nlpe Automated loop performance model construction | G. Hager 2

  3. Motivation DAXPY on Sandy Bridge core 2D-5pt stencil on Sandy Bridge core Loop length Inner dimension w/ in- memory data Automated loop performance model construction | G. Hager 3

  4. THE ECM MODEL Registers L1 L2 L3 MEM

  5. ECM model – the rules 4 cy 8 cy 3 cy 43 cy 1. LOADs in the L1 cache do not STORE MULT … 6 cy LOAD ADD overlap with any other data L2-L1 transfer in the memory hierarchy 9 cy L3-L2 9 cy 2. Everything else in the core overlaps perfectly with data transfers (STOREs show some Mem-L3 19 cy non-overlap) time [cy] Example: 3. The scaling limit is set by the ratio of Single-core (data in L1): 8 cy (ADD) Single-core (data in memory): # cycles per CL overall 6+9+9+19 cy = 43 cy # cycles per CL at the bottleneck Scaling limit: 43 / 19 = 2.3 cores Automated loop performance model construction | G. Hager 5

  6. ECM model – composition ECM predicted time 𝑈 𝐹𝐷𝑁 = maximum of overlapping time and sum of all other contributions 𝑈 𝑃𝑀 𝑈 𝑑𝑝𝑠𝑓 = max(𝑈 𝑜𝑃𝑀 , 𝑈 𝑃𝑀 ) 𝑈 𝑜𝑃𝑀 LOAD 𝑈 ADD 𝑈 𝐹𝐷𝑁 = max(𝑈 𝑜𝑃𝑀 + 𝑈 𝑒𝑏𝑢𝑏 , 𝑈 𝑃𝑀 ) 𝑑𝑝𝑠𝑓 L2-L1 𝑈 𝑀1𝑀2 Shorthand notation for time contributions: L3-L2 𝑈 𝑀2𝑀3 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑈 𝑈 𝑈 𝑀1𝑀2 𝑈 𝑀2𝑀3 | 𝑈 𝑃𝑀 𝑜𝑃𝑀 𝑀3𝑁𝑓𝑛 𝑈 𝑒𝑏𝑢𝑏 Mem-L3 # cy invariant to # cy changes w/ 𝑈 𝑀3𝑁𝑓𝑛 clock speed clock speed Example from previous slide: 8 6 9 9 | 19 cy Automated loop performance model construction | G. Hager 6

  7. ECM model – prediction Notation for cycle predictions in different memory hierarchy levels: 𝑈 𝑃𝑀 𝑀1 𝑀2 𝑀3 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝐹𝐷𝑁 𝑈 𝑜𝑃𝑀 LOAD 𝑀1 𝑈 𝐹𝐷𝑁 ADD 𝑀2 𝑈 𝐹𝐷𝑁 L2-L1 𝑀1 𝑈 𝐹𝐷𝑁 = 𝑈 𝑑𝑝𝑠𝑓 = max 𝑈 𝑜𝑃𝑀 , 𝑈 𝑃𝑀 𝑀3 𝑈 𝐹𝐷𝑁 𝑀2 𝑈 𝐹𝐷𝑁 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 , 𝑈 𝑃𝑀 L3-L2 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 𝑀3 𝑈 𝐹𝐷𝑁 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 + 𝑈 𝑀2𝑀3 , 𝑈 𝑃𝑀 𝑁𝑓𝑛 = max 𝑈 𝑜𝑃𝑀 + 𝑈 𝑀1𝑀2 + 𝑈 𝑀2𝑀3 + 𝑈 𝑀3𝑁𝑓𝑛 , 𝑈 𝑃𝑀 Mem-L3 𝑈 𝐹𝐷𝑁 Substitute by Example: 8 15 24 43 cy commas  Roofline 8.6 16.2 26 47 cy Experimental data (measured) notation: Automated loop performance model construction | G. Hager 7

  8. ECM model – saturation Main assumption: Performance scaling is linear until a bandwidth bottleneck ( 𝑐 𝑇 ) is hit LOAD ADD Performance vs. cores (Memory BN): L2-L1 𝑁𝑓𝑛 𝑁𝑓𝑛 , 𝑐 𝑇 𝑄 𝐹𝐷𝑁 𝑜 = min 𝑜𝑄 𝐹𝐷𝑁 𝑁𝑓𝑛 𝐶 𝐷 L3-L2 𝑁𝑓𝑛 𝑈 𝐹𝐷𝑁 Number of cores at saturation: Mem-L3 𝑁𝑓𝑛 𝑐 𝑇 𝐶 𝐷 𝑈 𝐹𝐷𝑁 𝑈 𝑀3𝑁𝑓𝑛 𝑜 𝑇 = = 𝑁𝑓𝑛 𝑈 𝑄 𝑀3𝑁𝑓𝑛 𝐹𝐷𝑁 Example: cy ⟹ 𝑜 𝑇 = 43 8 6 9 9 | 19 cy, 8 15 24 43 19 = 3 Automated loop performance model construction | G. Hager 9

  9. How do we get the numbers? Basic information about hardware Registers capabilities: 𝑈 𝑑𝑝𝑠𝑓 : Code analysis, Intel IACA  In-core limitations L1  Throughput limits:µops, LD/ST, ADD/MULT per cycle  Pipeline depths L2  Cache hierarchy 𝑗 : 𝑈 𝑀1𝑀2 , 𝑈 𝑀2𝑀3 , 𝑈 𝑀3𝑁𝑓𝑛 , 𝐶 𝐷  ECM : Cycles per CL transfer Data flow analysis  RL : measured max bandwidths for all L3 cache levels, core counts  Memory interface  ECM : measured saturated BW MEM  RL : measured max bandwidths for all core counts Automated loop performance model construction | G. Hager 11

  10. 2D 5-PT JACOBI STENCIL (DOUBLE PRECISION) for(j=1; j < Nj-1; ++j) for(i=1; i < Ni-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] + a[j-1][ i ] + a[j+1][ i ] ) * s; Unit of work (1 CL): 8 LUPs Data transfer per unit:  5 CL if layer condition violated in higher cache level  3 CL if layer condition satisfied

  11. ECM Model for 2D Jacobi (AVX) on SNB 2.7 GHz Radius- 𝑠 stencil  (2 𝑠 +1) layers have to fit Cache 𝑙 has size 𝐷 𝑙 Layer condition: for(j=1; j < Nj-1; ++j) (2𝑠 + 1) ∙ 𝑂 𝑗 ∙ 8 B < 𝐷 𝑙 for(i=1; i < Ni-1; ++i) b[j][i] = (a[ j ][i-1] + a[ j ][i+1] 2 + a[j-1][ i ] + a[j+1][ i ] ) * s; 2D 5-pt: 𝑠 = 1 LC = layer condition satisfied in … Automated loop performance model construction | G. Hager 18

  12. 2D 5-pt serial in-memory performance and layer conditions SNB 2.7 GHz Automated loop performance model construction | G. Hager 19

  13. 3D LONG-RANGE STENCIL (SINGLE PRECISION) #pragma omp parallel for for(int k=4; k < N-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { float lap = c0 * %V%[k][j][i] + c1 * ( V[ k ][ j ][i+1]+ V[ k ][ j ][i-1]) + c1 * ( V[ k ][j+1][ i ]+ V[ k ][j-1][ i ]) + c1 * ( V[k+1][ j ][ i ]+ V[k-1][ j ][ i ]) ... Source: + c4 * ( V[ k ][ j ][i+4]+ V[ k ][ j ][i-4]) http://goo.gl/dqOlnI + c4 * ( V[ k ][j+4][ i ]+ V[ k ][j-4][ i ]) + c4 * ( V[k+4][ j ][ i ]+ V[k-4][ j ][ i ]); U[k][j][i] = 2.f * V[k][j][i] - U[k][j][i] + ROC[k][j][i] * lap; }}}

  14. 3D long-range SP stencil ECM model Layer condition in L3 at problem size 𝑂 𝑗 × 𝑂 𝑘 × 𝑂 𝑙 : 𝑘 ∙ 𝑜 𝑢ℎ𝑠𝑓𝑏𝑒𝑡 ∙ 4 B < 𝐷 3 9 ∙ 𝑂 𝑗 ∙ 𝑐 2  | | | 68 62 24 24 17 cy 68 86 110 127 cy ECM Model: 𝑈 𝑀3𝑁𝑓𝑛 plays minor part 127 Saturation at 𝑜 𝑡 = = 8 cores. 17 Consequences:  Temporal blocking will not yield substantial speedup  Improve low-level code first (semi-stencil …?) Automated loop performance model construction | G. Hager 29

  15. 3D long-range SP stencil results (SNB) Roofline too optimistic due to overlapping assumption Automated loop performance model construction | G. Hager 30

  16. KERNCRAFT First steps towards automated model construction

  17. kerncraft: ECM/Roofline modeling toolkit Automated loop performance model construction | G. Hager 32

  18. Towards automated model generation Manual Automated Registers Code inspection IACA or and/or IACA direct analysis L1 Reuse distance Traffic analysis w/ analysis, cache layer conditions simulation L2 L3 HW limits: micro- HW limits: benchmarking likwid-bench & docs & docs MEM Automated loop performance model construction | G. Hager 33

  19. kerncraft vmovsd (%rsi,%rbx,8), %xmm1 #define N 1000 #define M 2000 vaddsd 16(%rsi,%rbx,8), %xmm1, %xmm2 vaddsd 8(%rdx,%rbx,8), %xmm2, %xmm3 for(j=1; j < N-1; ++j) Compiler vaddsd 8(%rcx,%rbx,8), %xmm3, %xmm4 for(i=1; i < M-1; ++i) vaddsd 8(%r8,%rbx,8), %xmm4, %xmm5 b[j][i] = (a[ j ][i-1] + a[ j ][i+1] vaddsd 8(%r9,%rbx,8), %xmm5, %xmm6 + a[j-1][ i ] + a[j+1][ i ] ) * s; vmulsd %xmm6, %xmm0, %xmm7 IACA pycparser TP/CP 𝑈 𝑃𝑀 , 𝑈 𝑜𝑃𝑀 AST Registers LOAD Cache simulator/ ADD L2- L1 L1 reuse distance Roofline / ECM model L3- L2 L2 𝑈 𝑁𝑓𝑛 𝐹𝐷𝑁 analysis Mem-L3 L3 𝑈 𝑀3𝑁𝑓𝑛 MEM 𝑊 𝑐 𝑈 𝑀1𝑀2 , … , 𝑈 𝑀3𝑁𝑓𝑛 Traffic volumes 𝑈 = Machine description docs likwid-bench (yaml file) Automated loop performance model construction | G. Hager 34

  20. Restrictions on code input (selection)  Only doubles and ints supported  Array declarations may use fixed sizes or constants, with an optional offset (e.g., double u1[M+3][N-2][23], but not double u[M*N])  Only the innermost loop may contain assignment statements  Array references must either use index variables from for-loops, with optional addition or subtraction, constant or fixed values  All for-loops must use a declaration as initial statement and an increment or a decrement assignment operation as the next statement (e.g., i++, i -= 2)  Function calls and the use of pointers is not allowed anywhere in the kernel code  Write access to any data is assumed to use “normal” STORE instructions (e.g., no non-temporal stores) Automated loop performance model construction | G. Hager 35

  21. Operating modes  ECM  Full ECM model including in-core analysis  ECMData  Data traffic analysis only (works on any system)  ECMCPU  In-core part of ECM model (IACA)  Roofline  Full Roofline model using CPU peak performance as in-core limit  RooflineIACA  Full Roofline model using IACA analysis for in-core  Benchmark  Run the actual benchmark for model validation Automated loop performance model construction | G. Hager 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend