Quantifying the Performance Impacts of Using Local Memory for - - PowerPoint PPT Presentation

quantifying the performance impacts of using local memory
SMART_READER_LITE
LIVE PREVIEW

Quantifying the Performance Impacts of Using Local Memory for - - PowerPoint PPT Presentation

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1 , Ana Lucia Varbanescu 2 , Henk Sips 1 1 Delft University of Technology 2 University of Amsterdam The Netherlands MuCoCoS'13: Quantifying the


slide-1
SLIDE 1

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

1

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors

Jianbin Fang1, Ana Lucia Varbanescu2, Henk Sips1

1 Delft University of Technology 2 University of Amsterdam The Netherlands

slide-2
SLIDE 2

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

2

Looking Back on OpenCL

 OpenCL- Open Computing Language

 An open programming framework by Khronos group  Heterogeneous platforms CPUs, GPU, MIC, FPGA, DSPs, …

 OpenCL platform model  An OpenCL program

 Kernel: a language based C99  Host: a set of APIs

 Adopted by many vendors

 Current version: v2.0 (July 2013)

slide-3
SLIDE 3

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

3

OpenCL and Local Memory

 Local memory is a key performance factor

 FAST: On-chip  Not-a-Cache: User-managed

 Current status: using local memory is a trial-and-error process

 Work hard to enable it …  and hope for performance gain.

slide-4
SLIDE 4

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

4

Our Idea

 Performance impact estimation  How can we estimate the benefits of using local memory?

 Assess the necessity of using local memory  Facilitate performance modeling of OpenCL platforms

slide-5
SLIDE 5

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

5

Local Memory “Myths”

 Local memory assumptions for performance gain:

 Data sharing is mandatory  Using LM on GPUs is mandatory  Using LM on CPUs must be avoided

 We contradict these myths!

 Data reuse is not equivalent with LM performance gain  Enabling LM on GPUs can be skipped  Enabling LM on CPUs can be beneficial

slide-6
SLIDE 6

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

6

Data reuse ≠ Performance gain

 NBody on GTX580

 Threads share exactly the same data set

 Results (in GB/s)  Conclusion

 Using local memory performs worse by 18% on average

slide-7
SLIDE 7

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

7

 Describe analysis  Conclusion

 Besides data reuse (D ), access order change matters (W )!  Matrix transpose is a good example.

No data reuse ≠ Performance loss

Wg Wg’ Wl D D’ D

slide-8
SLIDE 8

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

8

LM on CPUs ≠ Performance loss

 Image convolution on CPU

 Intel Xeon E5620 (6 cores)  Filter radius is 3

 Results (in GB/s)  Conclusion

 Using local memory delivers (2x) better performance

slide-9
SLIDE 9

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

9

Performance Impact Estimation

 Not an easy job

 No assumptions hold for all cases  Application-dependent  Platform-dependent

 Our approach:

  • 1. Enumerate and analyze all feasible memory access patterns
  • 2. Quantify and log local memory impacts for each MAP on each

platform (in terms of bandwidth)

  • 3. Model applications as (compositions of) MAPs
  • 4. Quantify application’s gain by search and compose
slide-10
SLIDE 10

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

10

Our Approach

slide-11
SLIDE 11

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

11

Stage I: Quantification

MAP=eMAP+iMAP

16 cases eMAP-14

MAP Description:

slide-12
SLIDE 12

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

12

s =

Stage I: Quantification

MAP=eMAP+iMAP

16 cases eMAP-14

dy=ty dx=ty+tx

slide-13
SLIDE 13

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

13

Stage I: Quantification

MAP=eMAP+iMAP

16 cases MAP-14 5 cases

Single, Row, Column, Block, Neighbor

Block (4)

34 patterns MAP Description:

slide-14
SLIDE 14

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

14

s = dy=tx dx=tx

Stage I: Quantification

Generating Benchmarks (MAP-407)

eMAP-07 Block (4), r=1

Max vs. Min:

slide-15
SLIDE 15

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

15

Stage I: Quantification

Min/Max Comparison (MAP-407) better mbr=B/b

slide-16
SLIDE 16

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

16

Stage I: Quantification

Performance Database Overview

slide-17
SLIDE 17

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

17

Performance Database (MAP-407)

GTX580 GTX280 HD6970 E5620

slide-18
SLIDE 18

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

18

Stage I: Quantification

Performance Database Summary

Data reuse Access order change

slide-19
SLIDE 19

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

19

Our Approach

slide-20
SLIDE 20

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

20

Stage II: A Query-based Performance Prediction

 Kernel performance gain due to LM = memory bandwidth ratio before (b) and after (B) using LM  Predicting bandwidth when using LM

 Identify MAPs (manually)  Query bandwidth information (B, b) from DB  Compose the bandwidths of individual MAPs

 IC, MM, MT, SOR on GTX580

slide-21
SLIDE 21

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

21

Stage II: A Query-based Performance Prediction

 Case I: MT, SOR

 The kernel has one input matrix (and MAP)  Use the corresponding mbr in DB

 Case II: MM  Case III: IC

 Assume the filter is small and allocated on on-chip memory  Use mbr of MAP-408

slide-22
SLIDE 22

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

22

Stage II: A Query-based Performance Prediction

slide-23
SLIDE 23

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

23

Conclusion

 Quantifying the performance impact of using local memory

  • n many-cores is possible

 Not easy expected => well-known assumptions don’t always hold  MAP-based => application-agnostic  Query-based => prediction-friendly  Database-based => easy to extend  Composition-based => applicable for fairly complex kernels

slide-24
SLIDE 24

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

24

On-going Work

 More MAPs and tests (on more diverse platforms, e.g. MIC)  Investigate further the performance interference between MAPs  An auto-tuner to automatically enable local memory

slide-25
SLIDE 25

MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory

25

Questions

Jianbin Fang PhD student at TU Delft Email: j.fang@tudelft.nl WWW: http://www.pds.ewi.tudelft.nl/fang/