OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan - - PowerPoint PPT Presentation

opencl application on mobile gpu
SMART_READER_LITE
LIVE PREVIEW

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan - - PowerPoint PPT Presentation

12-13 MAY, 2014 Studying Energy Consumption of an OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh, Unmesh Bordoloi eGPU Compute Applications Conventional Domains Potential Domains GPU ADAS


slide-1
SLIDE 1

Studying Energy Consumption of an OpenCL Application on Mobile GPU: A Case Study

Elena Barreras, Juan M. Jimenez, Arian Maghazeh, Unmesh Bordoloi

12-13 MAY, 2014

slide-2
SLIDE 2

2

eGPU Compute Applications

Augmented Reality Media Codecs Computational Photography Gaming Graphics GPU Radar Systems: Pattern Detection ADAS Security Conventional Domains Potential Domains

slide-3
SLIDE 3

Choice of Application

  • Aho-Corasick – a pattern matching algorithm

– Is utilized in security domain among others – Relevant for embedded systems – intrusion detection in vehicular systems, mobile devices – Not studied for embedded GPUs – State of the art parallel implementation available for high-end GPUs

3

slide-4
SLIDE 4

Goal of Our Study

  • Study the energy consumption of OpenCL

components

  • Optimize for embedded GPUs

– Energy – Running times

  • Compare with multi-core implementations

– Tradeoffs

4

slide-5
SLIDE 5
  • It locates patterns of strings in an input text

5

Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers}

Aho-Corasick (AC)

slide-6
SLIDE 6
  • How it works:

– Combines all input patterns (dictionary) and generates a finite state machine – Uses the finite machine to find all the matches in the input text in a single traverse

  • Open-source implementation available

6

Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers}

1 2 3 4 h e r s 5 6 s i 7 8 9 h e s ¬{h,s}

Aho-Corasick (AC)

slide-7
SLIDE 7

Parallel Failureless AC (PFAC)

  • One thread for every input character

– 10 M threads for a 10 MB input

  • Each thread identifies the pattern that begins on

that character

  • No failure node

7

Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers} 6 threads are launched

slide-8
SLIDE 8

PFAC GPU Implementation

  • Optimization on high-end GPUs

– Load input text partially from global to local memory

8

Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers}

slide-9
SLIDE 9

PFAC GPU Implementation

  • Optimization on high-end GPUs

– Load input text partially from global to local memory – Uses transition table – Load first row of the table into local memory

9

Patterns: {she, he, his, hers} Input text: ushers Output: {she, he, hers}

slide-10
SLIDE 10

PFAC GPU Implementation

  • Optimization on high-end GPUs

– Load input text partially from global to local memory – Uses transition table – Load first row of the table into local memory – Convert transition table into an array – Store transition table in texture memory (cache

  • ptimized)

10

slide-11
SLIDE 11
  • Local memory is emulated in global memory
  • Using local memory adds an extra overhead
  • Exception

– Adreno 330 has 8KB local memory – May benefit only limited applications

11

Optimizations on Embedded GPU

Local memory usage

slide-12
SLIDE 12

Optimizations on Embedded GPU

High-end GPU Embedded GPU

12

CPU data data clEnqueueWriteBuffer GPU Unified memory CPU memory GPU memory data data clEnqueueWriteBuffer copy copy CPU data clEnqueueMapBuffer GPU Unified memory No copying is required

Reduce data communication time

slide-13
SLIDE 13

Optimizations on Embedded GPU

1-char per thread 2-chars per thread

Warp execution Warp execution Implicit

  • Synch. point

Total time Total time

Reduce load imbalance between threads/warps

13

Thread granularity

slide-14
SLIDE 14

Implementation Remarks

  • Scalar variables used in the kernel
  • Appropriate work-group sizes chosen
  • Kernel included integer and memory
  • perations
  • Memory bound kernel

14

slide-15
SLIDE 15

Experimental Platforms

15

SoC: Snapdragon 800 CPU: up to 2.26 GHz quad- core ARM Cortex-A15 GPU: Adreno 330 4 cores at 450/578 MHz, 115 to 148 GFLOPS SoC: Exynos 5250 CPU: 1.7 GHz dual-core ARM Cortex-A15 GPU: ARM Mali-T604, 4 cores at 533 MHz, 68 GFLOPS

Samsung Arndale Board Sony Xperia Z Ultra

slide-16
SLIDE 16

Experimental Setup

16

Amplifier

slide-17
SLIDE 17

Experimental Input

  • Input parameters:

– 1000 Test patterns with maximum size of 128 characters and input text of size 10 MB – Extracted from Snort V2.8 – FSM included 27570 nodes – GPU consumed 44 MB of memory

17

slide-18
SLIDE 18

Energy Measurement

  • Sample snapshot from the Oscilloscope

18 Current (amp)

0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0

Time (sec)

Experiment was performed on Arndale board

Initialization (OpenCL) Preparation (CPU) Data writing (OpenCL) Kernel execution (OpenCL) Data reading (OpenCL) Preparation (CPU) AC on CPU

slide-19
SLIDE 19

19

0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0

Current (amp)

0,0 2,0 4,0 6,0 8,0 10,0 12,0 14,0

Time (sec) Initialization (OpenCL) Preparation (CPU) Data writing (OpenCL) Kernel execution (OpenCL) Data reading (OpenCL) Preparation (CPU) AC on CPU Time (sec) Current (amp)

Before optimization After optimization ARNDALE BOARD

slide-20
SLIDE 20

20 Current (amp) Initialization (OpenCL) Preparation (CPU) Data writing (OpenCL) Kernel execution (OpenCL) Data reading (OpenCL) Preparation (CPU) AC on CPU

Before optimization After optimization

Current (amp)

SONY XPERIA Z ULTRA

slide-21
SLIDE 21

21

SONY XPERIA Z ULTRA

OPTIMIZATIONS RESULTS DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP ENG IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 14% 242 4,3 7,2

Experimental Results

time units are in milliseconds

slide-22
SLIDE 22

22

SONY XPERIA Z ULTRA

OPTIMIZATIONS RESULTS DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP ENG IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 14% 242 4,3 7,2 map yes 128 1 34 208 14% 242 4,3 7,2 map no 128 1 34 150 18% 184 5,7 10,7

Experimental Results

time units are in milliseconds

slide-23
SLIDE 23

23

SONY XPERIA Z ULTRA

OPTIMIZATIONS RESULTS DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP ENG IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 14% 242 4,3 7,2 map yes 128 1 34 208 14% 242 4,3 7,2 map no 128 1 34 150 18% 184 5,7 10,7 map no 64 1 34 168 17% 202 5,2 10,0 map no 128 1 34 150 18% 184 5,7 10,7 map no 256 1 34 140 20% 174 6,0 11,3

Experimental Results

time units are in milliseconds

slide-24
SLIDE 24

24

SONY XPERIA Z ULTRA

OPTIMIZATIONS RESULTS DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP ENG IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 14% 242 4,3 7,2 map yes 128 1 34 208 14% 242 4,3 7,2 map no 128 1 34 150 18% 184 5,7 10,7 map no 64 1 34 168 17% 202 5,2 10,0 map no 128 1 34 150 18% 184 5,7 10,7 map no 256 1 34 140 20% 174 6,0 11,3 map no 256 4 34 99 26% 133 7,9 13,3 map no 256 8 34 80 30% 114 9,2 15,1 map no 256 12 34 202 14% 236 4,4 10,6 map no 256 16 34 198 15% 232 4,5 10,7

Experimental Results

time units are in milliseconds

slide-25
SLIDE 25

Experimental Results

25 SONY XPERIA Z ULTRA

OPTIMIZATIONS RESULTS DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP ENG IMPROV. no map yes 128 1 113 208 171 58% 492 2,1 2,7 map yes 128 1 34 208 14% 242 4,3 7,2 map yes 128 1 34 208 14% 242 4,3 7,2 map no 128 1 34 150 18% 184 5,7 10,7 map no 64 1 34 168 17% 202 5,2 10,0 map no 128 1 34 150 18% 184 5,7 10,7 map no 256 1 34 140 20% 174 6,0 11,3 map no 256 4 34 99 26% 133 7,9 13,3 map no 256 8 34 80 30% 114 9,2 15,1 map no 256 12 34 202 14% 236 4,4 10,6 map no 256 16 34 198 15% 232 4,5 10,7

ARNDALE BOARD

OPTIMIZATIONS RESULTS DATA_TX USE_LOCAL WG_SIZE THR_GRAN WRDEV KERNEL_EXE RDDEV TX_OVH GPU_TOT SPEED UP ENG IMPROV. no map yes 128 1 91 295 60 34% 446 4,7 3,3 map yes 128 1 91 295 6 25% 392 5,4 3,6 map yes 128 1 91 295 6 25% 392 5,4 3,6 map no 128 1 91 150 6 39% 247 8,5 8,2 map no 64 1 91 155 6 38% 252 8,3 8,2 map no 128 1 91 150 6 39% 247 8,5 8,2 map no 256 1 91 143 6 40% 240 8,8 8,3 map no 256 4 91 114 6 46% 211 10,0 9,0 map no 256 8 91 104 6 48% 201 10,4 9,3 map no 256 12 91 101 6 49% 198 10,6 9,3 map no 256 16 91 97 6 50% 194 10,8 9,5

slide-26
SLIDE 26

GPU vs. Multi-core

26 SONY Z ULTRA PFAC OPENMP MOST OPTIMIZED on GPU 1 CORE 2 CORE 3 CORE 4 CORE TIME (ms) 348 175 118 89 KERNEL SPEED UP 4,4 2,2 1,5 1,1 GPU_KERNEL TIME = 80 (ms) OVERALL SPEED UP 3,0 1,5 1,0 0,8 GPU_OVERALL TIME = 114 (ms) ENERGY IMPROV. 5 4 4 4 ARNDALE PFAC OPENMP MOST OPTIMIZED on GPU 1 CORE 2 CORE TIME (ms) 680 620 KERNEL SPEED UP 7,0 6,4 GPU_KERNEL TIME = 97 (ms) OVERALL SPEED UP 3,5 3,1 GPU_OVERALL TIME = 194 (ms) ENERGY IMPROV. 3,3 4,9

  • PFAC implemented on multi-core with OpenMP
slide-27
SLIDE 27

Takeaways

  • Embedded GPUs – alternative to save energy
  • Nonconventional applications may benefit

from GPU computing in embedded systems

  • Micro-architecture specific optimizations are

required to get efficient performance

27

slide-28
SLIDE 28

Aknowledgments

  • Sony Mobile Lund
  • Adrian Horga

28

slide-29
SLIDE 29

Questions?

29