OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan - PowerPoint PPT Presentation

12-13 MAY, 2014 Studying Energy Consumption of an OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh, Unmesh Bordoloi

eGPU Compute Applications Conventional Domains Potential Domains GPU ADAS Media Codecs Augmented Reality Security Graphics Gaming Radar Systems: Computational Pattern Detection Photography 2

Choice of Application • Aho-Corasick – a pattern matching algorithm – Is utilized in security domain among others – Relevant for embedded systems – intrusion detection in vehicular systems, mobile devices – Not studied for embedded GPUs – State of the art parallel implementation available for high-end GPUs 3

Goal of Our Study • Study the energy consumption of OpenCL components • Optimize for embedded GPUs – Energy – Running times • Compare with multi-core implementations – Tradeoffs 4

Aho-Corasick (AC) • It locates patterns of strings in an input text Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers} 5

Aho-Corasick (AC) • How it works: – Combines all input patterns (dictionary) and generates a finite state machine – Uses the finite machine to find all the matches in the input text in a single traverse • Open-source implementation available ¬{h,s} h e r s 0 1 2 3 4 i Patterns: {she, he, his, hers} s 5 6 s Input text: ushers Output: {she, he, hers} h e 7 8 9 6

Parallel Failureless AC (PFAC) • One thread for every input character – 10 M threads for a 10 MB input • Each thread identifies the pattern that begins on that character • No failure node Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers} 6 threads are launched 7

PFAC GPU Implementation • Optimization on high-end GPUs – Load input text partially from global to local memory Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers} 8

PFAC GPU Implementation • Optimization on high-end GPUs – Load input text partially from global to local memory – Uses transition table – Load first row of the table into local memory Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers} 9

PFAC GPU Implementation • Optimization on high-end GPUs – Load input text partially from global to local memory – Uses transition table – Load first row of the table into local memory – Convert transition table into an array – Store transition table in texture memory (cache optimized) 10

Optimizations on Embedded GPU Local memory usage • Local memory is emulated in global memory • Using local memory adds an extra overhead • Exception – Adreno 330 has 8KB local memory – May benefit only limited applications 11

Optimizations on Embedded GPU Reduce data communication time High-end GPU Embedded GPU CPU memory GPU memory CPU Unified memory GPU copy copy data data data data clEnqueueWriteBuffer clEnqueueWriteBuffer CPU Unified memory GPU No copying is data required clEnqueueMapBuffer 12

Optimizations on Embedded GPU Thread granularity 1-char per thread 2-chars per thread Warp execution Warp execution Total Total time time Implicit Synch. point Reduce load imbalance between threads/warps 13

Implementation Remarks • Scalar variables used in the kernel • Appropriate work-group sizes chosen • Kernel included integer and memory operations • Memory bound kernel 14

Experimental Platforms Samsung Arndale Board Sony Xperia Z Ultra SoC : Exynos 5250 SoC : Snapdragon 800 CPU : 1.7 GHz dual-core CPU : up to 2.26 GHz quad- ARM Cortex-A15 core ARM Cortex-A15 GPU : ARM Mali-T604, GPU : Adreno 330 4 cores at 533 MHz, 4 cores at 450/578 MHz, 68 GFLOPS 115 to 148 GFLOPS 15

Experimental Setup Amplifier 16

Experimental Input • Input parameters: – 1000 Test patterns with maximum size of 128 characters and input text of size 10 MB – Extracted from Snort V2.8 – FSM included 27570 nodes – GPU consumed 44 MB of memory 17

Energy Measurement • Sample snapshot from the Oscilloscope Kernel execution Current (OpenCL) (amp) Initialization Data writing (OpenCL) (OpenCL) AC on CPU Data reading (OpenCL) Preparation Preparation (CPU) (CPU) 0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0 Time (sec) Experiment was performed on Arndale board 18

ARNDALE BOARD Kernel execution (OpenCL) Initialization (OpenCL) Data writing Current AC on CPU (OpenCL) (amp) Data reading (OpenCL) Preparation Preparation (CPU) (CPU) 0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0 Time (sec) Before optimization Current (amp) 0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0 Time (sec) After optimization 19

SONY XPERIA Z ULTRA Current (amp) Data reading Data writing Initialization (OpenCL) AC on CPU (OpenCL) Kernel execution (OpenCL) (OpenCL) Preparation Preparation (CPU) (CPU) Before optimization Current (amp) After optimization 20

Experimental Results time units are in milliseconds SONY XPERIA Z ULTRA OPTIMIZATIONS RESULTS ENG DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 0 14% 242 4,3 7,2 21

Experimental Results time units are in milliseconds SONY XPERIA Z ULTRA OPTIMIZATIONS RESULTS ENG DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 0 14% 242 4,3 7,2 map yes 128 1 34 208 0 14% 242 4,3 7,2 map no 128 1 34 150 0 18% 184 5,7 10,7 22

Experimental Results time units are in milliseconds SONY XPERIA Z ULTRA OPTIMIZATIONS RESULTS ENG DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 0 14% 242 4,3 7,2 map yes 128 1 34 208 0 14% 242 4,3 7,2 map no 128 1 34 150 0 18% 184 5,7 10,7 map no 64 1 34 168 0 17% 202 5,2 10,0 map no 128 1 34 150 0 18% 184 5,7 10,7 map no 256 1 34 140 0 20% 174 6,0 11,3 23

Experimental Results time units are in milliseconds SONY XPERIA Z ULTRA OPTIMIZATIONS RESULTS ENG DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 0 14% 242 4,3 7,2 map yes 128 1 34 208 0 14% 242 4,3 7,2 map no 128 1 34 150 0 18% 184 5,7 10,7 map no 64 1 34 168 0 17% 202 5,2 10,0 map no 128 1 34 150 0 18% 184 5,7 10,7 map no 256 1 34 140 0 20% 174 6,0 11,3 map no 256 4 34 99 0 26% 133 7,9 13,3 map no 256 8 34 80 0 30% 114 9,2 15,1 map no 256 12 34 202 0 14% 236 4,4 10,6 map no 256 16 34 198 0 15% 232 4,5 10,7 24

Experimental Results SONY XPERIA Z ULTRA OPTIMIZATIONS RESULTS ENG DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 0 14% 242 4,3 7,2 map yes 128 1 34 208 0 14% 242 4,3 7,2 map no 128 1 34 150 0 18% 184 5,7 10,7 map no 64 1 34 168 0 17% 202 5,2 10,0 map no 128 1 34 150 0 18% 184 5,7 10,7 map no 256 1 34 140 0 20% 174 6,0 11,3 map no 256 4 34 99 0 26% 133 7,9 13,3 map no 256 8 34 80 0 30% 114 9,2 15,1 map no 256 12 34 202 0 14% 236 4,4 10,6 map no 256 16 34 198 0 15% 232 4,5 10,7 ARNDALE BOARD OPTIMIZATIONS RESULTS ENG DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP IMPROV. no map yes 128 1 91 295 60 34% 446 4,7 3,3 map yes 128 1 91 295 6 25% 392 5,4 3,6 map yes 128 1 91 295 6 25% 392 5,4 3,6 map no 128 1 91 150 6 39% 247 8,5 8,2 map no 64 1 91 155 6 38% 252 8,3 8,2 map no 128 1 91 150 6 39% 247 8,5 8,2 map no 256 1 91 143 6 40% 240 8,8 8,3 map no 256 4 91 114 6 46% 211 10,0 9,0 map no 256 8 91 104 6 48% 201 10,4 9,3 map no 256 12 91 101 6 49% 198 10,6 9,3 map no 256 16 91 97 6 50% 194 10,8 9,5 25

GPU vs. Multi-core • PFAC implemented on multi-core with OpenMP PFAC OPENMP SONY Z ULTRA MOST OPTIMIZED on GPU 1 CORE 2 CORE 3 CORE 4 CORE TIME (ms) 348 175 118 89 KERNEL SPEED UP 4,4 2,2 1,5 1,1 GPU_KERNEL TIME = 80 (ms) OVERALL SPEED UP 3,0 1,5 1,0 0,8 GPU_OVERALL TIME = 114 (ms) ENERGY IMPROV. 5 4 4 4 PFAC OPENMP ARNDALE MOST OPTIMIZED on GPU 1 CORE 2 CORE TIME (ms) 680 620 KERNEL SPEED UP 7,0 6,4 GPU_KERNEL TIME = 97 (ms) OVERALL SPEED UP 3,5 3,1 GPU_OVERALL TIME = 194 (ms) ENERGY IMPROV. 3,3 4,9 26

Takeaways • Embedded GPUs – alternative to save energy • Nonconventional applications may benefit from GPU computing in embedded systems • Micro-architecture specific optimizations are required to get efficient performance 27

Aknowledgments • Sony Mobile Lund • Adrian Horga 28

Questions? 29

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan - PowerPoint PPT Presentation

12-13 MAY, 2014 Studying Energy Consumption of an OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh, Unmesh Bordoloi eGPU Compute Applications Conventional Domains Potential Domains GPU ADAS

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

MOBILE ADVERTISING Agenda Get off to a mobile start with Media Impact! Why mobile? MI

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni Optjmize code for GPUs Optjmize

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier

SUPERCOMPUTERS TO SUPERCARS Bill Veenhuis Sr. Solutions Architect, Automotive

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang,

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of

Lattice Measurement of the Delta I=1/2 Contribution to Standard Model Direct CP-Violation in K

Adaptation and Water, Wastewater and Stormwater: Milwaukee and the Milwaukee Metropolitan

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan - PowerPoint PPT Presentation

12-13 MAY, 2014 Studying Energy Consumption of an OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh, Unmesh Bordoloi eGPU Compute Applications Conventional Domains Potential Domains GPU ADAS

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

MOBILE ADVERTISING Agenda Get off to a mobile start with Media Impact! Why mobile? MI

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni Optjmize code for GPUs Optjmize

GPU Servers for Research in Quantum Fluids L. Galantucci HPC &amp; Quantum Summit QEII Centre,

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier

SUPERCOMPUTERS TO SUPERCARS Bill Veenhuis Sr. Solutions Architect, Automotive

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang,

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of

Lattice Measurement of the Delta I=1/2 Contribution to Standard Model Direct CP-Violation in K

Adaptation and Water, Wastewater and Stormwater: Milwaukee and the Milwaukee Metropolitan

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,