Quantifying the Performance Impacts of Using Local Memory for - PowerPoint PPT Presentation

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1 , Ana Lucia Varbanescu 2 , Henk Sips 1 1 Delft University of Technology 2 University of Amsterdam The Netherlands MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 1

Looking Back on OpenCL  OpenCL- Open Computing Language  An open programming framework by Khronos group  Heterogeneous platforms CPUs, GPU, MIC, FPGA, DSPs, …  OpenCL platform model  An OpenCL program  Kernel: a language based C99  Host: a set of APIs  Adopted by many vendors  Current version: v2.0 (July 2013) MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 2

OpenCL and Local Memory  Local memory is a key performance factor  FAST: On-chip  Not-a-Cache: User-managed  Current status: using local memory is a trial-and-error process  Work hard to enable it …  and hope for performance gain. MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 3

Our Idea  Performance impact estimation  How can we estimate the benefits of using local memory?  Assess the necessity of using local memory  Facilitate performance modeling of OpenCL platforms MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 4

Local Memory “Myths”  Local memory assumptions for performance gain:  Data sharing is mandatory  Using LM on GPUs is mandatory  Using LM on CPUs must be avoided  We contradict these myths!  Data reuse is not equivalent with LM performance gain  Enabling LM on GPUs can be skipped  Enabling LM on CPUs can be beneficial MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 5

Data reuse ≠ Performance gain  NBody on GTX580  Threads share exactly the same data set  Results (in GB/s)  Conclusion  Using local memory performs worse by 18% on average MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 6

No data reuse ≠ Performance loss D W l  Describe analysis W g D W g ’ D ’  Conclusion  Besides data reuse (D ), access order change matters (W )!  Matrix transpose is a good example. MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 7

LM on CPUs ≠ Performance loss  Image convolution on CPU  Intel Xeon E5620 (6 cores)  Filter radius is 3  Results (in GB/s)  Conclusion  Using local memory delivers (2x) better performance MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 8

Performance Impact Estimation  Not an easy job  No assumptions hold for all cases  Application-dependent  Platform-dependent  Our approach: 1. Enumerate and analyze all feasible memory access patterns 2. Quantify and log local memory impacts for each MAP on each platform (in terms of bandwidth) 3. Model applications as (compositions of) MAPs 4. Quantify application’s gain by search and compose MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 9

Our Approach MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 10

Stage I: Quantification MAP Description: MAP=eMAP+iMAP 16 cases eMAP-14 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 11

Stage I: Quantification MAP=eMAP+iMAP 16 cases d y =t y s = d x =t y +t x eMAP-14 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 12

Stage I: Quantification 34 patterns MAP Description: MAP=eMAP+iMAP 16 cases 5 cases Single, Row, Column, Block, Neighbor MAP-14 Block (4) MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 13

Stage I: Quantification Generating Benchmarks (MAP-407) Block (4), r=1 eMAP-07 d y =t x s = d x =t x Max vs. Min: MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 14

Stage I: Quantification mbr=B/b Min/Max Comparison (MAP-407) better MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 15

Stage I: Quantification Performance Database Overview MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 16

Performance Database (MAP-407) GTX280 HD6970 GTX580 E5620 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 17

Stage I: Quantification Performance Database Summary Access order change Data reuse MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 18

Our Approach MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 19

Stage II: A Query-based Performance Prediction  Kernel performance gain due to LM = memory bandwidth ratio before (b) and after (B) using LM  Predicting bandwidth when using LM  Identify MAPs (manually)  Query bandwidth information (B, b) from DB  Compose the bandwidths of individual MAPs  IC, MM, MT, SOR on GTX580 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 20

Stage II: A Query-based Performance Prediction  Case I: MT, SOR  The kernel has one input matrix (and MAP)  Use the corresponding mbr in DB  Case II: MM  Case III: IC  Assume the filter is small and allocated on on-chip memory  Use mbr of MAP-408 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 21

Stage II: A Query-based Performance Prediction MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 22

Conclusion  Quantifying the performance impact of using local memory on many-cores is possible  Not easy expected => well- known assumptions don’t always hold  MAP-based => application-agnostic  Query-based => prediction-friendly  Database-based => easy to extend  Composition-based => applicable for fairly complex kernels MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 23

On-going Work  More MAPs and tests (on more diverse platforms, e.g. MIC)  Investigate further the performance interference between MAPs  An auto-tuner to automatically enable local memory MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 24

Questions Jianbin Fang PhD student at TU Delft Email: j.fang@tudelft.nl WWW: http://www.pds.ewi.tudelft.nl/fang/ MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 25

Quantifying the Performance Impacts of Using Local Memory for - PowerPoint PPT Presentation

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1 , Ana Lucia Varbanescu 2 , Henk Sips 1 1 Delft University of Technology 2 University of Amsterdam The Netherlands MuCoCoS'13: Quantifying the

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Quantifying Program Complexity and Comprehension Quantifying Program Complexity and Comprehension

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Quantifying the Necessity of Quantifying the Necessity of Risk Mitigation Strategies Risk

Hi Hierarchical Models for hi l M d l f Quantifying Uncertainty in Quantifying Uncertainty in

Quantifying error and Quantifying error and modeling accuracy & uncertainty modeling

Quantifying relative effects of Quantifying relative effects of protecting different stages

Quantifying Surface Brightness Quantifying SB profiles Non-Parametric Parametric CSB : 0

Quantifying Temporal and Spatial Quantifying Temporal and Spatial Localities Localities Florida

Quantifying the incompatibility of Quantifying the incompatibility of quantum measurements

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress

Towards a theory of Undo Aaron Brown UC Berkeley June 2002 ROC Retreat Outline Recap of

Analysis of wide area user mobility patterns Kevin Simler*, Steven E. Czerwinski , Anthony

Automated Key Management for End-To-End Encrypted Email Communication Intermediate talk for the

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

Problems in Software Composition Stephen Kell Stephen.Kell@cl.cam.ac.uk Problems in Software

Outline SSH SSL/TLS CSci 5271 DNSSEC Introduction to Computer Security Announcements

Quantifying the Performance Impacts of Using Local Memory for - PowerPoint PPT Presentation

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1 , Ana Lucia Varbanescu 2 , Henk Sips 1 1 Delft University of Technology 2 University of Amsterdam The Netherlands MuCoCoS'13: Quantifying the

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Quantifying Program Complexity and Comprehension Quantifying Program Complexity and Comprehension

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Quantifying the Necessity of Quantifying the Necessity of Risk Mitigation Strategies Risk

Hi Hierarchical Models for hi l M d l f Quantifying Uncertainty in Quantifying Uncertainty in

Quantifying error and Quantifying error and modeling accuracy &amp; uncertainty modeling

Quantifying relative effects of Quantifying relative effects of protecting different stages

Quantifying Surface Brightness Quantifying SB profiles Non-Parametric Parametric CSB : 0

Quantifying Temporal and Spatial Quantifying Temporal and Spatial Localities Localities Florida

Quantifying the incompatibility of Quantifying the incompatibility of quantum measurements

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress

Towards a theory of Undo Aaron Brown UC Berkeley June 2002 ROC Retreat Outline Recap of

Analysis of wide area user mobility patterns Kevin Simler*, Steven E. Czerwinski , Anthony

Automated Key Management for End-To-End Encrypted Email Communication Intermediate talk for the

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

Problems in Software Composition Stephen Kell Stephen.Kell@cl.cam.ac.uk Problems in Software

Outline SSH SSL/TLS CSci 5271 DNSSEC Introduction to Computer Security Announcements

Quantifying error and Quantifying error and modeling accuracy & uncertainty modeling