GPU research in the ES-group Henk Corporaal (professor) Gert-Jan - - PowerPoint PPT Presentation

gpu research in the es group
SMART_READER_LITE
LIVE PREVIEW

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan - - PowerPoint PPT Presentation

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD) PARsE research


slide-1
SLIDE 1

GPU research in the ES-group

Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD)

slide-2
SLIDE 2

PARsE research – http://parse.ele.tue.nl/

/ Department of Electrical Engineering

3 December 2015 1

slide-3
SLIDE 3

PARsE Parallel Architecture Research Eindhoven

  • Using advanced heterogeneous platforms
  • Multi-core CPUs
  • GPUs
  • DSPs
  • FPGAs
  • Efficient code generation
  • Code transformation & generation
  • Compilers
  • Even more efficient: new architectures
  • SIMD, CGRA, R-GPU
  • Accelerators

− Neural networks (CNNs)

/ Department of Electrical Engineering

3 December 2015 2

slide-4
SLIDE 4

GPU research – overview (selection)

  • Application mapping
  • Histogram, CNN
  • Understanding GPUs
  • Modeling of GPU L1 cache
  • Cache bypassing
  • Architecture modification
  • Hash functions in scratchpad memory
  • Code generation
  • Bones source-to-source tools

/ Department of Electrical Engineering

3 December 2015 3

Hash function

0000 00001 00001 00

lock bank

C-code

slide-5
SLIDE 5

Application mapping

  • Histogram,
  • Convolutional Neural Networks (CNN)

/ Department of Electrical Engineering

3 December 2015 4

slide-6
SLIDE 6

Application mapping: histogram

  • Load pixel
  • Update votes

/ Department of Electrical Engineering

3 December 2015

CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE

5

[1] High Performance Predictable Histogramming on GPUs: Exploring and Evaluating Algorithm Trade-offs

slide-7
SLIDE 7

Histogram – replication

  • Load pixel
  • Update votes

/ Department of Electrical Engineering

3 December 2015

CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE

6

[2] GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU

slide-8
SLIDE 8

Scratchpad memory layout

  • Scratchpad memory
  • Divided in 32 banks
  • Each bank has 32 lock-bits, 1024 in total

/ Department of Electrical Engineering

3 December 2015

BANK 0 BANK 1 BANK 2 BANK 3 BANK 30

BANK 31

… … … … … … … … … … … …

7

LD/ST CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE WARP SCHEDULER WARP SCHEDULER DISPATCH UNIT DISPATCH UNIT SFU SFU SFU SFU REGISTER FILE SCRATCHPAD MEMORY / L1 CACHE INSTRUCTION CACHE

LOCK-BITS

slide-9
SLIDE 9

Application mapping: CNN

  • Convolutional Neural Network (CNN)
  • GTX 460: 35fps
  • Tegra X1: ~20fps

/ Department of Electrical Engineering

3 December 2015 9

input 720 x 1280 Layer 1 6x358x638

6x6 conv. with 2x2 subsample

Layer 2 16x177x317

6x6 conv. with 2x2 subsample

Layer 3 80x173x313

5x5 conv.

Layer 4 8x173x313

1x1 conv.

Object Category + Position

at(x,y) at(x,y)

[4] Speed Sign Detection and Recognition by Convolutional Neural Networks

slide-10
SLIDE 10

Understanding GPUs

  • Modeling of GPU L1 cache
  • Cache bypassing
  • Transit model

/ Department of Electrical Engineering

3 December 2015 10

slide-11
SLIDE 11

Understanding GPUs: L1 cache modeling

  • GPU Cache model:
  • Execution model (threads, thread blocks)
  • Memory latencies
  • MSHRs (pending memory requests)
  • Cache associativity

/ Department of Electrical Engineering

3 December 2015 11

[5] A Detailed GPU Cache Model Based

  • n Reuse Distance Theory
slide-12
SLIDE 12

L1 cache model – results

/ Department of Electrical Engineering

3 December 2015 12

Mean absolute error of 6.4%

slide-13
SLIDE 13

Understanding GPUs: Cache bypassing

/ Department of Electrical Engineering

3 December 2015 13

[6] Adaptive and Transparent Cache Bypassing for GPUs

slide-14
SLIDE 14

Cache bypassing – results

/ Department of Electrical Engineering

3 December 2015 14

[6] Adaptive and Transparent Cache Bypassing for GPUs

slide-15
SLIDE 15

Understanding GPUs: Transit model

  • Transit model: computation and memory sub-systems

/ Department of Electrical Engineering

3 December 2015 15

[7] Transit: A Visual Analytical Model for Multithreaded Machines

slide-16
SLIDE 16

Architecture modifications

  • Scratchpad memory hash functions
  • R-GPU

/ Department of Electrical Engineering

3 December 2015 16

Hash function

0000 00001 00001 00

lock bank

CR CR CORE LD/ST

slide-17
SLIDE 17

GPU modifications: bank & lock conflicts

/ Department of Electrical Engineering

3 December 2015 17

CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE

BANK 0 BANK 1 BANK 2 BANK 3 BANK 30

BANK 31

addr = 32 * id

address

0000 00001 00000 00

lock bank

all addresses in bank 0

slide-18
SLIDE 18

Resolving bank conflicts: hash functions

/ Department of Electrical Engineering

3 December 2015 18

CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE

BANK 0 BANK 1 BANK 2 BANK 3 BANK 30

BANK 31

addr = 32 * id

Hash function

0000 00001 00000 00

lock bank

hash function

[8] Simulation and Architecture Improvements of Atomic Operations on GPU Scratchpad Memory

slide-19
SLIDE 19

Resolving bank conflicts: hash functions

/ Department of Electrical Engineering

3 December 2015 19

CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE

BANK 0 BANK 1 BANK 2 BANK 3 BANK 30

BANK 31

addr = 33 * id

Hash function

0000 00001 00001 00

lock bank

configurable hash function

[9] Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs

slide-20
SLIDE 20

Architecture modifications: R-GPU

/ Department of Electrical Engineering

3 December 2015 20

CR CR CR CR CR CR

A B C E D

CORE CORE CORE CORE LD/ST LD/ST

LD/ST CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE SHARED MEMORY / L1 CACHE

slide-21
SLIDE 21

How to generate efficient code for all these devices?

Code generation: ASET & Bones

/ Department of Electrical Engineering

sequential C code CPU-OpenMP GPU-OpenCL-AMD CPU-OpenCL-AMD CPU-OpenCL-Intel XeonPhi-OpenCL GPU-CUDA Multi-GPU (CUDA / OpenCL) FPGA

21

[10] Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

slide-22
SLIDE 22

Code generation: ASET & Bones

/ Department of Electrical Engineering

sequential C code Algorithmic Species Extraction Tool species-annotated C code skeleton-based compiler CPU-OpenMP GPU-OpenCL-AMD CPU-OpenCL-AMD CPU-OpenCL-Intel XeonPhi-OpenCL GPU-CUDA

‘ASET’ ‘Bones’

Multi-GPU (CUDA / OpenCL) FPGA PET (llvm)

22

[10] Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

slide-23
SLIDE 23

Example C to CUDA transformation

Example 1: Sum

int sum = 0; for (int i=0; i<N; i++){ sum = sum + in[i]; }

/ Department of Electrical Engineering

template <unsigned int blockSize> __device__ void warpReduce(volatile int *sm, unsigned int tid) { if (blockSize >= 64) sm[tid] += sm[tid + 32]; if (blockSize >= 32) sm[tid] += sm[tid + 16]; if (blockSize >= 16) sm[tid] += sm[tid + 8]; if (blockSize >= 8) sm[tid] += sm[tid + 4]; if (blockSize >= 4) sm[tid] += sm[tid + 2]; if (blockSize >= 2) sm[tid] += sm[tid + 1]; } template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sm[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sm[tid] = 0; while (i < n) { sm[tid] += g_idata[i] sm[tid] += g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sm[tid] += sm[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sm[tid] += sm[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sm[tid] += sm[tid + 64]; } __syncthreads(); } if (tid < 32) { warpReduce<blockSize>(sm, tid); } if (tid == 0) { g_odata[blockIdx.x] = sm[0]; } }

23

3 1 7 4 1 6 3

slide-24
SLIDE 24

Second example: maximum

Example 1: Sum

int sum = 0; for (int i=0; i<N; i++){ sum = sum + in[i]; }

Example 2: Max

int max = 0; for (int i=0; i<N; i++){ max = (max>in[i])?max:in[i]; }

/ Department of Electrical Engineering

template <unsigned int blockSize> __device__ void warpReduce(volatile int *sm, unsigned int tid) { if (blockSize >= 64) sm[tid] = (sm[tid]>sm[tid+32]) ? sm[tid] : sm[tid+32]; if (blockSize >= 32) sm[tid] = (sm[tid]>sm[tid+16]) ? sm[tid] : sm[tid+16]; if (blockSize >= 16) sm[tid] = (sm[tid]>sm[tid+ 8]) ? sm[tid] : sm[tid+ 8]; if (blockSize >= 8) sm[tid] = (sm[tid]>sm[tid+ 4]) ? sm[tid] : sm[tid+ 4]; if (blockSize >= 4) sm[tid] = (sm[tid]>sm[tid+ 2]) ? sm[tid] : sm[tid+ 2]; if (blockSize >= 2) sm[tid] = (sm[tid]>sm[tid+ 1]) ? sm[tid] : sm[tid+ 1]; } template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sm[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sm[tid] = 0; while (i < n) { sm[tid] = (sm[tid]>g_idata[i]) ? sm[tid] : g_idata[i]; sm[tid] = (sm[tid]>g_idata[i+blockSize]) ? sm[tid] : g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sm[tid] = (sm[tid]>sm[tid+256]) ? sm[tid] : sm[tid+256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sm[tid] = (sm[tid]>sm[tid+128]) ? sm[tid] : sm[tid+128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sm[tid] = (sm[tid]>sm[tid+ 64]) ? sm[tid] : sm[tid+ 64]; } __syncthreads(); } if (tid < 32) { warpReduce<blockSize>(sm, tid); } if (tid == 0) { g_odata[blockIdx.x] = sm[0]; } }

24

3 1 7 4 1 6 3

slide-25
SLIDE 25

Algorithmic species

  • Matrix-vector multiplication:

for (i=0; i<64; i++) { r[i] = 0; for (j=0; j<128; j++) { r[i] += M[i][j] * v[j]; } }

  • Stencil computation:

for (i=1; i<128-1; i++) { m[i] = 0.33 * (a[i-1] + a[i] + a[i+1]); }

/ Department of Electrical Engineering

0:63,0:127|chunk(0:0,0:127) ^ 0:127|full  0:63|element 1:126|neighbourhood(-1:1)  1:126|element

26

[11] Algorithmic Species: An Algorithm Classification

  • f Affine Loop Nests for Parallel Programming
slide-26
SLIDE 26

What do we gain in performance?

/ Department of Electrical Engineering

27

[12] Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GPUs

slide-27
SLIDE 27

Education – mapping assignments

  • Master course
  • contrast enhancement
  • Viola-Jones Face Detection
  • SIFT object recognition
  • convolutional neural network (CNN)
  • ‘bitcoin’ mining
  • Post-master (PDEng) course:
  • GPU & cluster computing

/ Department of Electrical Engineering

3 December 2015 28

slide-28
SLIDE 28

Student projects

  • Accelerating AURORA on Multi-Core and Many-Core

Processor Architectures – VITO, Belgium

  • Advanced ultrasound beam forming using GPGPU

technology – esaote, Maastricht

  • Domain Transform Acceleration for the GPU-Based

Real-Time Planar Near-Field Acoustic Holography

  • Analysis and Modeling of the Timing Behavior of GPU

Architectures, TU/e

/ Department of Electrical Engineering

3 December 2015 29

slide-29
SLIDE 29

Summary

  • Research topics:
  • Application mapping
  • Understanding GPUs
  • Architecture modifications
  • Code generations
  • MSc. students, PDEngs & PhDs
  • More on the website:
  • http://parse.ele.tue.nl/

/ Department of Electrical Engineering

3 December 2015 30