Neural Acceleration for GPU Throughput Processors Hardik Sharma - PowerPoint PPT Presentation

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh Pejman Lotfi-Kamran * Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology * The Institute for Research in Fundamental Sciences NGPU SM SM Neurally Accelerated GPU SM SM

Approximate computing Embracing imprecision Relax the abstraction of “ near perfect” accuracy in Data Processing Storage Communication Accept imprecision to improve performance energy dissipation resource utilization efficiency

Opportunity Many GPU applications are amenable to approximation Augmented Machine Reality SM SM Learning Computer Sensor Vision Processing SM SM Robotics Multimedia

Opportunity 100%# 90%# Approximable 80%# Non-Approximable 70%# Runtime 60%# 50%# 40%# 30%# 20%# 10%# 0%# binariza'on) blackscholes) convolu'on) inversek2j) jmeint) laplacian) meanfilter) newton9raph) sobel) sram) gmean) More than 55% of application runtime and energy is in neurally approximable regions

Neural Transformation for GPUs Neural Neural Network Network

Challenges core core core core core core core core core core core core core core core core Many core core core core core core core core Core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core

Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core

Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core Augmenting the CPU based neural processing units to each SIMD lane imposes 31.2% area overhead

dst_reg src_reg2 src_reg1 src_reg3 NGPU Neurally-Accelerated GPU Architecture Memory Interconnection Network Partition L2 Decode Fetch Cache Issue Operand Write Collection back Active Mask SIMD Lane I-$ Off-chip SIMT DRAM Stack LSU Pipeline D-$ Streaming Multiprocessor (SM)

Neuronal Network Operations ... ... y j = x j,i x j, 0 x j,n sigmoid ( w j, 0 w j,i w j,n w j, 0 × x j, 0 + w j, 0 . . . w j,i × x j,i + . . . w j,n × x j,n + ) y j 11

Input FIFO Output FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane

Output FIFO Input FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane NGPU reuses the existing ALU in each SIMD lane

Output FIFO Input FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane Weight FIFO is shared among all the SIMD lanes

Input FIFO Output FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane Overall NGPU has ≤1% area overhead

NGPU Execution Model ld.global %r0, [addr0]; in 0 (%r 0 ) in 1 (%r 1 ) ld.global %r1, [addr1]; w0 w1 w2 w3 send.n_data %r0; n0 n1 send.n_data %r1; w4 w5 n2 recv.n_data %r2; st.global [addr2], %r2; out 0 (%r 2 ) Neurally Accelerated Neural Network GPU Application

NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; …

NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD lanes are in normal mode and performs precise computation

NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD lanes enter neural mode

NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD starts the calculation of the neural network

NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The neurally accelerated SIMD lanes autonomously calculate the neural outputs in lock-step

NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step

Neural Acceleration for GPU Throughput Processors Hardik Sharma - PowerPoint PPT Presentation

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh Pejman Lotfi-Kamran * Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology * The Institute for Research in

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals:

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Acceleration of Genetic Algorithms for Acceleration of Genetic Algorithms for Sudoku Solution on

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Neural Acceleration for GPU Throughput Processors Hardik Sharma - PowerPoint PPT Presentation

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh Pejman Lotfi-Kamran * Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology * The Institute for Research in

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals:

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Acceleration of Genetic Algorithms for Acceleration of Genetic Algorithms for Sudoku Solution on

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &amp;

EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Gradient for Cross-Entropy Loss with Sigmoid For a single example ( x , y ): K

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &