GPU research in the ES-group Henk Corporaal (professor) Gert-Jan - PowerPoint PPT Presentation

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD)

PARsE research – http://parse.ele.tue.nl/ / Department of Electrical Engineering 3 December 2015 1

PARsE Parallel Architecture Research Eindhoven • Using advanced heterogeneous platforms • Multi-core CPUs • GPUs • DSPs • FPGAs • Efficient code generation • Code transformation & generation • Compilers • Even more efficient: new architectures • SIMD, CGRA, R-GPU • Accelerators − Neural networks (CNNs) / Department of Electrical Engineering 3 December 2015 2

GPU research – overview (selection) • Application mapping • Histogram, CNN • Understanding GPUs Hash function • Modeling of GPU L1 cache 0000 0000 1 0000 1 00 • Cache bypassing • Architecture modification lock bank • Hash functions in scratchpad memory C-code • Code generation • Bones source-to-source tools / Department of Electrical Engineering 3 December 2015 3

Application mapping • Histogram, • Convolutional Neural Networks (CNN) / Department of Electrical Engineering 3 December 2015 4

Application mapping: histogram • Load pixel • Update votes CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE [1] High Performance Predictable Histogramming on GPUs: / Department of Electrical Engineering 3 December 2015 5 Exploring and Evaluating Algorithm Trade-offs

Histogram – replication • Load pixel • Update votes CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE [2] GPU-Vote: A Framework for Accelerating / Department of Electrical Engineering 3 December 2015 6 Voting Algorithms on GPU

7 WARP SCHEDULER CORE CORE CORE CORE CORE CORE CORE CORE DISPATCH UNIT SCRATCHPAD MEMORY / L1 CACHE CORE CORE CORE CORE CORE CORE CORE CORE INSTRUCTION CACHE REGISTER FILE CORE CORE CORE CORE CORE CORE CORE CORE 3 December 2015 CORE CORE CORE CORE CORE CORE CORE CORE WARP SCHEDULER DISPATCH UNIT LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST SFU SFU SFU SFU Scratchpad memory layout • Each bank has 32 lock-bits, 1024 in total … … BANK 31 … … BANK 30 … • Scratchpad memory • Divided in 32 banks … … BANK 3 … … BANK 2 / Department of Electrical Engineering … … BANK 1 … … BANK 0 LOCK-BITS

Application mapping: CNN • Convolutional Neural Network (CNN) • GTX 460: 35fps • Tegra X1: ~20fps Layer 3 Object 80x173x313 Layer 4 Layer 1 Layer 2 input Category + Position 8x173x313 6x358x638 16x177x317 720 x 1280 at(x,y) at(x,y) 1x1 conv. 6x6 conv. with 6x6 conv. with 5x5 conv. 2x2 subsample 2x2 subsample [4] Speed Sign Detection and Recognition / Department of Electrical Engineering 3 December 2015 9 by Convolutional Neural Networks

Understanding GPUs • Modeling of GPU L1 cache • Cache bypassing • Transit model / Department of Electrical Engineering 3 December 2015 10

Understanding GPUs: L1 cache modeling • GPU Cache model: • Execution model (threads, thread blocks) • Memory latencies • MSHRs (pending memory requests) • Cache associativity [5] A Detailed GPU Cache Model Based / Department of Electrical Engineering 3 December 2015 11 on Reuse Distance Theory

L1 cache model – results Mean absolute error of 6.4% / Department of Electrical Engineering 3 December 2015 12

Understanding GPUs: Cache bypassing [6] Adaptive and Transparent / Department of Electrical Engineering 3 December 2015 13 Cache Bypassing for GPUs

Cache bypassing – results [6] Adaptive and Transparent / Department of Electrical Engineering 3 December 2015 14 Cache Bypassing for GPUs

Understanding GPUs: Transit model • Transit model: computation and memory sub-systems [7] Transit: A Visual Analytical Model / Department of Electrical Engineering 3 December 2015 15 for Multithreaded Machines

Architecture modifications Hash function • Scratchpad memory hash functions 0000 0000 1 0000 1 00 • R-GPU LD / ST CR lock bank CORE CR / Department of Electrical Engineering 3 December 2015 16

GPU modifications: bank & lock conflicts CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 32 * id address 0000 0000 1 00000 00 lock bank … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 all addresses in bank 0 / Department of Electrical Engineering 3 December 2015 17

Resolving bank conflicts: hash functions CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 32 * id hash function Hash function 0000 0000 1 00000 00 … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 lock bank [8] Simulation and Architecture Improvements of / Department of Electrical Engineering 3 December 2015 18 Atomic Operations on GPU Scratchpad Memory

Resolving bank conflicts: hash functions CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE addr = 33 * id configurable hash function Hash function 0000 0000 1 0000 1 00 … BANK 30 BANK 31 BANK 0 BANK 1 BANK 2 BANK 3 lock bank [9] Configurable XOR Hash Functions for / Department of Electrical Engineering 3 December 2015 19 Banked Scratchpad Memories in GPUs

Architecture modifications: R-GPU LD / ST CR LD / ST CR CORE CORE CORE CORE CORE CORE CORE CORE A SHARED MEMORY / L1 CACHE CORE CORE CORE CORE CORE CORE CORE CORE B CORE CORE CORE CORE CORE CORE CORE CORE C CORE CORE CORE CORE CORE CORE CORE CORE D LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST E LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST LD / ST CORE CORE CORE CORE CR CR CR CR / Department of Electrical Engineering 3 December 2015 20

Code generation: ASET & Bones sequential C code How to generate efficient code for all these devices? GPU-OpenCL-AMD Multi-GPU CPU-OpenMP GPU-CUDA FPGA CPU-OpenCL-AMD (CUDA / OpenCL) CPU-OpenCL-Intel XeonPhi-OpenCL [10] Automatic Skeleton-Based Compilation through / Department of Electrical Engineering 21 Integration with an Algorithm Classification

Code generation: ASET & Bones sequential C code Algorithmic Species PET ‘ASET’ Extraction Tool (llvm) species-annotated C code skeleton-based ‘Bones’ compiler GPU-OpenCL-AMD Multi-GPU CPU-OpenMP GPU-CUDA FPGA CPU-OpenCL-AMD (CUDA / OpenCL) CPU-OpenCL-Intel XeonPhi-OpenCL [10] Automatic Skeleton-Based Compilation through / Department of Electrical Engineering 22 Integration with an Algorithm Classification

Example C to CUDA transformation template <unsigned int blockSize> 3 1 7 0 4 1 6 3 __device__ void warpReduce(volatile int *sm, unsigned int tid) { if (blockSize >= 64) sm[tid] += sm[tid + 32]; if (blockSize >= 32) sm[tid] += sm[tid + 16]; Example 1: Sum if (blockSize >= 16) sm[tid] += sm[tid + 8]; if (blockSize >= 8) sm[tid] += sm[tid + 4]; int sum = 0; if (blockSize >= 4) sm[tid] += sm[tid + 2]; if (blockSize >= 2) sm[tid] += sm[tid + 1]; for (int i=0; i<N; i++){ } sum = sum + in[i]; template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { } extern __shared__ int sm[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sm[tid] = 0; while (i < n) { sm[tid] += g_idata[i] sm[tid] += g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sm[tid] += sm[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sm[tid] += sm[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sm[tid] += sm[tid + 64]; } __syncthreads(); } if (tid < 32) { warpReduce<blockSize>(sm, tid); } if (tid == 0) { g_odata[blockIdx.x] = sm[0]; } } / Department of Electrical Engineering 23

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan - PowerPoint PPT Presentation

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD) PARsE research

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Basic Animation Techniques Graphics & Visualization: Principles & Algorithms

EVOLUTION OF BIRDS Fastovsky Chapter 10 & 11 Is it a Dinosaur or Bird? DINOSAUR!

Computer Graphics 1 Ludwig-Maximilians-Universitt Mnchen Summer semester 2020 Prof. Dr.-Ing.

Animation Models 3D Graphics What we see today Skeleton-based animations o Suitable to

Introduction to Computer Graphics Animation (1) May 23, 2019 Kenshi Takayama

Puppet Introduction Dan Olejko Summary Jamie Gennis Motivation / Goals /

Skeletal representations of orthogonal shapes PhD. defense Advisors Candidate Dra. N uria

Activity Slides to Accompany: What can the Fossil Record Tell us? By Sarah Davis and Lisa

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan - PowerPoint PPT Presentation

GPU research in the ES-group Henk Corporaal (professor) Gert-Jan van den Braak (postdoc) Roel Jordans (postdoc) Erkan Diken (PhD) Rik Jongerius (PhD) Ang Li (PhD) Maurice Peemen (PhD) Luc Waeijen (PhD) Mark Wijtvliet (PhD) PARsE research

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Basic Animation Techniques Graphics &amp; Visualization: Principles &amp; Algorithms

EVOLUTION OF BIRDS Fastovsky Chapter 10 &amp; 11 Is it a Dinosaur or Bird? DINOSAUR!

Computer Graphics 1 Ludwig-Maximilians-Universitt Mnchen Summer semester 2020 Prof. Dr.-Ing.

Animation Models 3D Graphics What we see today Skeleton-based animations o Suitable to

Introduction to Computer Graphics Animation (1) May 23, 2019 Kenshi Takayama

Puppet Introduction Dan Olejko Summary Jamie Gennis Motivation / Goals /

Skeletal representations of orthogonal shapes PhD. defense Advisors Candidate Dra. N uria

Activity Slides to Accompany: What can the Fossil Record Tell us? By Sarah Davis and Lisa

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Basic Animation Techniques Graphics & Visualization: Principles & Algorithms

EVOLUTION OF BIRDS Fastovsky Chapter 10 & 11 Is it a Dinosaur or Bird? DINOSAUR!