Performance Gaps between OpenMP and OpenCL for Multi-core CPUs Jie - PowerPoint PPT Presentation

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs Jie Shen, Jianbin Fang, Henk Sips, and Ana Lucia Varbanescu Parallel and Distributed Systems Group Delft University of Technology, The Netherlands P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 1

Introduction • Multi-core CPU and GPU programming keeps gaining popularity for parallel computing • OpenCL has been proposed to tackle multi-/many-core diversity in a unified way • OpenCL (Open Computing Language), KHRONOS Group • The first open standard for cross-platform parallel programming P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 2

Introduction • OpenCL programming model A host program Compute kernels P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 3

Introduction • OpenCL programming model A host program Compute kernels P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 4

Motivation • OpenCL shares core parallelism approach with CUDA • A research hotspot in GPGPU -> A large amount of free OpenCL code • E.g., Parboil, SHOC, Rodinia benchmarks AMD SDK 2.7 • Major CPU vendors’ support Apr 2012 ARM 1 st SDK May 2012 OpenCL 1.2 Intel SDK 2012 Feb 2012 AMD/ATI Jun 2011 SDK 2.0 Nov 2011 Dec 2008 Intel SDK 1.1 Dec 2009 OpenCL 1.0 P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 5

Motivation • OpenCL cross-platform portability When porting OpenCL code from GPUs to CPUs ? • Functional correctness ? • Parallelized performance ? • Compared with sequential code • Similar/better performance ? • Compared with a regular CPU parallel programming model (e.g., OpenMP) P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 6

Motivation • Reference: Regular parallel OpenMP code • Not aggressively optimized • Comparison: OpenCL and OpenMP performance on CPUs • Target: Where do the performance gaps come from? Host-Device data transfers Memory access patterns and cache utilization Floating-point operations Implicit and explicit vectorization P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 7

Experimental Setup • Benchmark • Rodinia benchmark suite • Equivalent implementations in OpenMP, CUDA and OpenCL • Hardware platforms Name Processor # Cores # HW Threads N8 2.40GHz Intel Xeon E5620 (2x hyper-threaded) 2x quad-core 16 D6 2.67GHz Intel Xeon X5650 (2x hyper-threaded) 2x six-core 24 MC 2.10GHz AMD Opteron 6172 (Magnycours) 4x twelve-core 48 • OpenCL SDKs • Intel OpenCL SDK 1.1 • AMD APP SDK 2.5 • We have updated the compilers to Intel OpenCL SDK 2012 / AMD APP SDK 2.7 in the extended version of P2S2 P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 8

Compare parallel part wall clock time? • Wall clock time = Initialization + H2D (Host to Device data transfer) + kernel executiont + D2H (Device to Host data transfer) One time warm-up INIT Sequential OpenCL H2D Parallel Kernel Execution D2H Sequential ≈ H2D Implicit OpenMP Parallel Parallel Section Implicit ≈ D2H P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 9

Initial Results • H2D + kernel execution + D2H OpenMP performs better OpenCL performs better P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 10

H2D and D2H on CPUs GPU Device Device CPU CPU Host Host CPUs: H2D and D2H are not necessary GPUs: Explicit H2D and D2H • Use zero copy • Zero copy memory objects: accessible for both the host and the device • H2D: (1) CL_MEM_ ALLOC _HOST_PTR; (2) CL_MEM_ USE _HOST_PTR • D2H: CL_MEM_ ALLOC _HOST_PTR P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 11

H2D and D2H on CPUs • Use zero copy After After Before Before (a) Intel OpenCL SDK 1.1 (b) AMD APP SDK 2.5 Fig.1 Execution time (ms) comparison with/without zero copy. P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 12

Compare Kernel Execution time ! • Data transfers use zero copy INIT Sequential Zero copy OpenCL H2D Parallel Kernel Execution Zero copy D2H P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 13

K-means Results • OpenCL: a swap kernel remaps the data array from row-major to column major • OpenMP: no data layout swapping Before Table 1 K-means performance differences I ntel AMD Dataset SDK SDK 200K 52.1% 52.2% After 482K 76.1% 80.4% 800K 79.6% 81.6% Fig.2 K-means OpenCL execution time (ms) with/without the swap kernel. P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 14

K-means Results • Process a 2D dataset element by element • Column-major: GPU-friendly (memory coalescing) • Row-major: CPU-friendly (cache locality) • Tune the memory access patterns according to the target platforms (a) N8 (c) MC (b) D6 Fig.3 Execution time (ms) comparison of K-means after removing the swap kernel in OpenCL: (a) N8, (b) D6, (c) MC. P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 15

CFD Results • CFD also changes row-major to column-major • Change back to row-major • Improve performance only slightly (within 10%) • Apply -cl-fast-relaxed-math compiler option Before After • Intel and AMD have different specific Before implementations of -cl-fast-relaxed math After • Performance improvements • OpenCL(Intel): 11%~ 47.7% • OpenMP(similar options): 20%~ 40% Fig.4 CFD OpenCL execution time (ms) with/without –cl-fast-relaxed-math. P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 16

CFD Results • Effect of branching • OpenCL: Intel implicit vectorization module • Make N work-items execute in parallel in the SIMD unit -> Speedup: 1.6x~ 1.8x • Kernels with divergent data-dependent branches -> executing all branch paths • OpenMP: Dedicated branch prediction (in hardware) Table 2 OpenCL and OpenMP have different performance ratios between two datasets with similar sizes fvcorr.domn.193K missile.domn.0.2M Performnace Dataset (aricraft wings) (missile) Ratio OpenCL Intel 42438.00 ms 80339.00 ms 1.89 OpenMP 45065.88 ms 62589.57 ms 1.38 P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 17

PathFinder Results • OpenMP: Coarse-grained parallelization • Each thread processes consecutive data elements • OpenCL: Fine-grained parallelization One work-item processes one data element • Fig.4 PathFinder OpenMP/OpenCL performance ratio and OpenMP execution time (ms) with different dataset sizes. P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 18

PathFinder Results • Improve cache utilization explicitly • MergeN: Merge N work-items into one • VectorN: Explicit vectorization (using the vector type) Before After Fig.5 PathFinder OpenCL with MergeN optimization and execution time comparison with OpenMP on N8. P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 19

Conclusion • Where do the performance gaps come from? • Incorrect usage of the multi-core CPUs (Users are negligent) • Explicit H2D and D2H data transfers • Column-major memory accesses • Parallelism granularity (OpenCL is not properly mappted on CPUs) • Fine-grained parallelism approach can lead to poor CPU cache utilization • OpenCL compilers are not fully mature • Intel implicit vectorization module with branches • Intel and AMD have different fast floating-point optimizations • OpenCL code can be tuned to match OpenMP’s regular performance • More than 80% of the test cases • OpenCL is, performance-wise, a good alternative for mutli-core CPUs P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 20

Conclusion • OpenCL and OpenMP can act as performance indicators • OpenMP: locality-friendly coarse-grained parallelism • OpenCL: fine-grained parallelism, vectorization • This paper: OpenMP is an indicator, and OpenCL is tuned • Future work • Tune OpenMP to match the performance indicated by OpenCL • Develop user-friendly performance (semi-)auto-tuning tools for OpenCL P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 21

Contacts: J.Shen@tudelft.nl http://www.pds.ewi.tudelft.nl/ Parallel and Distributed Systems Group Delft University of Technology, The Netherlands P2S2 2012: Performance Gaps between OpenMP and OpenCL for Multi-core CPUs 22

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs Jie - PowerPoint PPT Presentation

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs Jie Shen, Jianbin Fang, Henk Sips, and Ana Lucia Varbanescu Parallel and Distributed Systems Group Delft University of Technology, The Netherlands P2S2 2012: Performance Gaps

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois,

Baumgartner, POLI 203 Spring 2016 Botched executions April 20, 2016 Catching Up Prison

Database System Architecture Instructor: Matei Zaharia cs245.stanford.edu Outline System R

BTS Group Holdings PCL August 2011 Sector 1 Disclaimer This document has been prepared and

Det Detec ectin ing An Anom omal alou ous Com omputat ation ion wit ith RN RNNs on on

Usable assembly language for GPUs D. J. Bernstein University of Illinois at Chicago 319 ms:

A Characterization and Analysis of PTX Kernels Andrew Kerr*, Gregory Diamos, and Sudhakar

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,