MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - PowerPoint PPT Presentation

The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the Netherlands May 12-13, 2014 Bristol, England 1

Outline  Introducing local memory (background)  Our research question (why)  Our approach (how)  Our key findings (results) 2

OpenCL and Local Memory  Like-a-cache: on-chip and faster  Not-a-Cache: user-managed  Data elements are shared by work-items of a work-group 3

Rules of Thumb  Using local memory on GPUs is preferred (e.g., data reuse)  Using local memory on CPUs is not recommended 4

The Reality is ...  A counter-intuitive example  3x3 convolution  On Intel Xeon E5620 (6 cores) Better 16 14 12 Bandwidth (GB/s) 10 8 w/o w/ 6 4 2 0 128x128 256x256 512x512 1024x1024 2048x2048 Dataset 5

The Reality is ...  A counter-intuitive example  3x3 convolution  On Intel Xeon E5620 (6 cores) Better LM on CPUs ≠ Performance loss 16 14 12 Bandwidth (GB/s) 10 8 w/o w/ 6 4 2 0 128x128 256x256 512x512 1024x1024 2048x2048 Dataset 6

When to Use Local Memory? 7

Our Approach 8

MAP Description  OpenCL organizes parallelism at two levels  We describe a MAP at two levels  eMAP: work-group level  iMAP: work-item level iMAP eMAP 9

33 MAP Citizens iMAP eMAP  eMAP  M00, M01, M10, M11 ∈ {0,1}  iMAP  Single, Row, Column, Block, Neighbor 10

Micro-Benchmarks  We generate 2 micro-benchmarks (in OCL) for each MAP  The kernel code  Local space allocation  Local data staging  Local data access (specified by the MAP)  We provides a tool (Aristotle*) to facilitate this process *https://github.com/haibo031031/aristotle 11

Experimental Platforms  Devices  SPM-only: NVIDIA C1060  SPM+Cache: AMD HD7970, NVIDIA C2050, K20  Cache-only: Intel Xeon X5650. E5-2620, Intel Phi 5110P  Software environments  AMD APP v2.8  Intel OpenCL SDK v3.0  NVIDIA CUDA v5.5 (not updated for long time ) 12

Experimental Setup  Metrics: bandwidth  Datasets  128, 256, 512, 1024, 2048, 4096  Block MAPs: r=3  Run each measurement for 21 iterations  1 iteration to warm-up  Measure 20 iterations  Flush caches between iterations 13

A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 14

100% 80% SPM Processors 60% 40% 20% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650  Memory bandwidth increase factors  Data reuse (A)  Changed memory access orders (B) 18

SPM Processors: w/o Caches VS. w/ Caches Better  The performance gain canceled  Disabling local memory is better  MAP-514 w/ LM w/o LM C1060 19 C2050 K20

100% 80% Cache-only Processors 60% 40% 20% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650  Emulating local memory on global memory  Using local memory might utilize caches better  MAP-302 w/ LM Better w/o LM Phi-5110P E5-2620 X5650 20

Cache-only Processors  Using local memory on MAP-302 leads to a BW increase  Profile the number of cache-line replacements on E5-2620 Better L1 cache replacement L2 cache replacement 21

Performance Database Use-Scenario 22

Summary  Data reuse and access order changes are positive factors  Unpredictable local memory performance is due to caches  Our query-based approach to decide local memory usage  Architecture design indications  SPM and caches co-exist !? 23

Follow-up Questions  Evaluations on more platforms, e.g., tiny GPUs  How to identify MAPs for a given kernel?  Visual inspection  Automatic tools  How to 'predict' the performance impact of using local memory in the presence of multiple MAPs?  How to enable/disable local memory usage?  Enabler*: w/o → w/  Disabler**: remove the use of local memory *J. Fang, et al., "ELMO: A User-Friendly API to enable local memory in OpenCL kernels," in PDP2013. **J. Fang, et al., "Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels," ICPP2014 (in submission). 24

Questions Jianbin Fang PhD student at TU Delft j.fang@tudelft.nl Jianbin Fang Ana Lucia Varbanescu Henk Sips 25

MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - PowerPoint PPT Presentation

The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the

Memory Management Chester Rebeiro IIT Madras Memory map of process 1 Process 1 Memory map of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

Physics plans and and ILDG ILDG usage usage Physics plans in Italy Italy in Francesco Di

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

False fasting is driven by pride False fasting is driven by pride False fasting is

T3-Memory Index Memory management concepts Basic Services Program loading in

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Interactive Map on Migration (i-Map) platform Extension of the i-Map to the region of the Rabat

What are these tools? Google suite and Office 365 suite are both: Integrated suites of secure,

Implementing IPv6 in an Organization Marya Steenman Jan van Lith Date: 09-02-2005 Topics

1 IMAP Committee 11Mar2011 by sharing the overall cost with an equity partner in the form of

Developing Paths and Trails in Washington Cynthia Welti Kirk Harris, PE, PMP Dave Rogers, PE,

Information System (SEIS) principles and practices in the ENP South region ENI SEIS South Support

Antitrust Risks in E-Commerce Pricing and Distribution Avoiding RPM, MFN and IMAP Pitfalls With

DCCVB Financial Partners In Kind & Event Sponsors Mission Statement To generate economic

MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - PowerPoint PPT Presentation

The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the

Memory Management Chester Rebeiro IIT Madras Memory map of process 1 Process 1 Memory map of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

Physics plans and and ILDG ILDG usage usage Physics plans in Italy Italy in Francesco Di

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

False fasting is driven by pride False fasting is driven by pride False fasting is

T3-Memory Index Memory management concepts Basic Services Program loading in

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Interactive Map on Migration (i-Map) platform Extension of the i-Map to the region of the Rabat

What are these tools? Google suite and Office 365 suite are both: Integrated suites of secure,

Implementing IPv6 in an Organization Marya Steenman Jan van Lith Date: 09-02-2005 Topics

1 IMAP Committee 11Mar2011 by sharing the overall cost with an equity partner in the form of

Developing Paths and Trails in Washington Cynthia Welti Kirk Harris, PE, PMP Dave Rogers, PE,

Information System (SEIS) principles and practices in the ENP South region ENI SEIS South Support

Antitrust Risks in E-Commerce Pricing and Distribution Avoiding RPM, MFN and IMAP Pitfalls With

DCCVB Financial Partners In Kind &amp; Event Sponsors Mission Statement To generate economic

DCCVB Financial Partners In Kind & Event Sponsors Mission Statement To generate economic