MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - - PowerPoint PPT Presentation

map driven performance analysis for local memory usage
SMART_READER_LITE
LIVE PREVIEW

MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - - PowerPoint PPT Presentation

The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the


slide-1
SLIDE 1

1

MAP-Driven Performance Analysis for Local Memory Usage

Jianbin Fang+, Henk Sips+, Ana Lucia Varbanescu++

+ Delft University of Technology, the Netherlands ++ University of Amsterdam, the Netherlands May 12-13, 2014 Bristol, England The 2nd International Workshop on OpenCL (IWOCL'14)

slide-2
SLIDE 2

2

Outline

Introducing local memory (background)

Our research question (why)

Our approach (how)

Our key findings (results)

slide-3
SLIDE 3

3

OpenCL and Local Memory

Like-a-cache: on-chip and faster

Not-a-Cache: user-managed

Data elements are shared by work-items of a work-group

slide-4
SLIDE 4

4

Rules of Thumb

Using local memory on GPUs is preferred (e.g., data reuse)

Using local memory on CPUs is not recommended

slide-5
SLIDE 5

5

128x128 256x256 512x512 1024x1024 2048x2048 2 4 6 8 10 12 14 16 w/o w/

Dataset Bandwidth (GB/s)

The Reality is ...

A counter-intuitive example

 3x3 convolution  On Intel Xeon E5620 (6 cores)

Better

slide-6
SLIDE 6

6

128x128 256x256 512x512 1024x1024 2048x2048 2 4 6 8 10 12 14 16 w/o w/

Dataset Bandwidth (GB/s)

The Reality is ...

A counter-intuitive example

 3x3 convolution  On Intel Xeon E5620 (6 cores)

LM on CPUs ≠ Performance loss Better

slide-7
SLIDE 7

7

When to Use Local Memory?

slide-8
SLIDE 8

8

Our Approach

slide-9
SLIDE 9

9

MAP Description

OpenCL organizes parallelism at two levels

We describe a MAP at two levels

 eMAP: work-group level  iMAP: work-item level

eMAP iMAP

slide-10
SLIDE 10

10

33 MAP Citizens

eMAP

 M00, M01, M10, M11∈{0,1} 

iMAP

 Single, Row, Column, Block, Neighbor

eMAP iMAP

slide-11
SLIDE 11

11

Micro-Benchmarks

We generate 2 micro-benchmarks (in OCL) for each MAP

The kernel code

 Local space allocation  Local data staging  Local data access (specified by the MAP) 

We provides a tool (Aristotle*) to facilitate this process

*https://github.com/haibo031031/aristotle

slide-12
SLIDE 12

12

Experimental Platforms

Devices

 SPM-only: NVIDIA C1060  SPM+Cache: AMD HD7970, NVIDIA C2050, K20  Cache-only: Intel Xeon X5650. E5-2620, Intel Phi 5110P 

Software environments

 AMD APP v2.8  Intel OpenCL SDK v3.0  NVIDIA CUDA v5.5 (not updated for long time )

slide-13
SLIDE 13

13

Experimental Setup

Metrics: bandwidth

Datasets

 128, 256, 512, 1024, 2048, 4096  Block MAPs: r=3 

Run each measurement for 21 iterations

 1 iteration to warm-up  Measure 20 iterations  Flush caches between iterations

slide-14
SLIDE 14

14

A Bird's-eye View

C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Performance gain/loss distribution

similar loss gain

Devices

Datasets: 4096x4096 NVIDIA GPUs AMD GPU Intel CPUs & Phi

slide-15
SLIDE 15

15

A Bird's-eye View

C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Performance gain/loss distribution

similar loss gain

Devices

Datasets: 4096x4096 NVIDIA GPUs AMD GPU Intel CPUs & Phi

slide-16
SLIDE 16

16

A Bird's-eye View

C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Performance gain/loss distribution

similar loss gain

Devices

Datasets: 4096x4096 NVIDIA GPUs AMD GPU Intel CPUs & Phi

slide-17
SLIDE 17

17

A Bird's-eye View

C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Performance gain/loss distribution

similar loss gain

Devices

Datasets: 4096x4096 NVIDIA GPUs AMD GPU Intel CPUs & Phi

slide-18
SLIDE 18

18

SPM Processors

Memory bandwidth increase factors

 Data reuse (A)  Changed memory access orders (B)

C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650

0% 20% 40% 60% 80% 100%

slide-19
SLIDE 19

19 w/ LM w/o LM

SPM Processors: w/o Caches VS. w/ Caches

The performance gain canceled

Disabling local memory is better

 MAP-514

Better C1060 C2050 K20

slide-20
SLIDE 20

20

C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650

0% 20% 40% 60% 80% 100%

Cache-only Processors

Emulating local memory on global memory

Using local memory might utilize caches better

 MAP-302

Phi-5110P E5-2620 X5650 Better

w/ LM w/o LM

slide-21
SLIDE 21

21

Cache-only Processors

Using local memory on MAP-302 leads to a BW increase

Profile the number of cache-line replacements on E5-2620

Better L1 cache replacement L2 cache replacement

slide-22
SLIDE 22

22

Performance Database Use-Scenario

slide-23
SLIDE 23

23

Summary

Data reuse and access order changes are positive factors

Unpredictable local memory performance is due to caches

Our query-based approach to decide local memory usage

Architecture design indications

 SPM and caches co-exist !?

slide-24
SLIDE 24

24

Follow-up Questions

Evaluations on more platforms, e.g., tiny GPUs

How to identify MAPs for a given kernel?

 Visual inspection  Automatic tools 

How to 'predict' the performance impact of using local memory in the presence of multiple MAPs?

How to enable/disable local memory usage?

 Enabler*: w/o

w/ →

 Disabler**: remove the use of local memory

*J. Fang, et al., "ELMO: A User-Friendly API to enable local memory in OpenCL kernels," in PDP2013. **J. Fang, et al., "Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels," ICPP2014 (in submission).

slide-25
SLIDE 25

25

Questions

Jianbin Fang PhD student at TU Delft j.fang@tudelft.nl

Jianbin Fang Ana Lucia Varbanescu Henk Sips