map driven performance analysis for local memory usage
play

MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - PowerPoint PPT Presentation

The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the


  1. The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the Netherlands May 12-13, 2014 Bristol, England 1

  2. Outline  Introducing local memory (background)  Our research question (why)  Our approach (how)  Our key findings (results) 2

  3. OpenCL and Local Memory  Like-a-cache: on-chip and faster  Not-a-Cache: user-managed  Data elements are shared by work-items of a work-group 3

  4. Rules of Thumb  Using local memory on GPUs is preferred (e.g., data reuse)  Using local memory on CPUs is not recommended 4

  5. The Reality is ...  A counter-intuitive example  3x3 convolution  On Intel Xeon E5620 (6 cores) Better 16 14 12 Bandwidth (GB/s) 10 8 w/o w/ 6 4 2 0 128x128 256x256 512x512 1024x1024 2048x2048 Dataset 5

  6. The Reality is ...  A counter-intuitive example  3x3 convolution  On Intel Xeon E5620 (6 cores) Better LM on CPUs ≠ Performance loss 16 14 12 Bandwidth (GB/s) 10 8 w/o w/ 6 4 2 0 128x128 256x256 512x512 1024x1024 2048x2048 Dataset 6

  7. When to Use Local Memory? 7

  8. Our Approach 8

  9. MAP Description  OpenCL organizes parallelism at two levels  We describe a MAP at two levels  eMAP: work-group level  iMAP: work-item level iMAP eMAP 9

  10. 33 MAP Citizens iMAP eMAP  eMAP  M00, M01, M10, M11 ∈ {0,1}  iMAP  Single, Row, Column, Block, Neighbor 10

  11. Micro-Benchmarks  We generate 2 micro-benchmarks (in OCL) for each MAP  The kernel code  Local space allocation  Local data staging  Local data access (specified by the MAP)  We provides a tool (Aristotle*) to facilitate this process *https://github.com/haibo031031/aristotle 11

  12. Experimental Platforms  Devices  SPM-only: NVIDIA C1060  SPM+Cache: AMD HD7970, NVIDIA C2050, K20  Cache-only: Intel Xeon X5650. E5-2620, Intel Phi 5110P  Software environments  AMD APP v2.8  Intel OpenCL SDK v3.0  NVIDIA CUDA v5.5 (not updated for long time ) 12

  13. Experimental Setup  Metrics: bandwidth  Datasets  128, 256, 512, 1024, 2048, 4096  Block MAPs: r=3  Run each measurement for 21 iterations  1 iteration to warm-up  Measure 20 iterations  Flush caches between iterations 13

  14. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 14

  15. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 15

  16. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 16

  17. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 17

  18. 100% 80% SPM Processors 60% 40% 20% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650  Memory bandwidth increase factors  Data reuse (A)  Changed memory access orders (B) 18

  19. SPM Processors: w/o Caches VS. w/ Caches Better  The performance gain canceled  Disabling local memory is better  MAP-514 w/ LM w/o LM C1060 19 C2050 K20

  20. 100% 80% Cache-only Processors 60% 40% 20% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650  Emulating local memory on global memory  Using local memory might utilize caches better  MAP-302 w/ LM Better w/o LM Phi-5110P E5-2620 X5650 20

  21. Cache-only Processors  Using local memory on MAP-302 leads to a BW increase  Profile the number of cache-line replacements on E5-2620 Better L1 cache replacement L2 cache replacement 21

  22. Performance Database Use-Scenario 22

  23. Summary  Data reuse and access order changes are positive factors  Unpredictable local memory performance is due to caches  Our query-based approach to decide local memory usage  Architecture design indications  SPM and caches co-exist !? 23

  24. Follow-up Questions  Evaluations on more platforms, e.g., tiny GPUs  How to identify MAPs for a given kernel?  Visual inspection  Automatic tools  How to 'predict' the performance impact of using local memory in the presence of multiple MAPs?  How to enable/disable local memory usage?  Enabler*: w/o → w/  Disabler**: remove the use of local memory *J. Fang, et al., "ELMO: A User-Friendly API to enable local memory in OpenCL kernels," in PDP2013. **J. Fang, et al., "Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels," ICPP2014 (in submission). 24

  25. Questions Jianbin Fang PhD student at TU Delft j.fang@tudelft.nl Jianbin Fang Ana Lucia Varbanescu Henk Sips 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend