neural acceleration for gpu throughput processors
play

Neural Acceleration for GPU Throughput Processors Hardik Sharma - PowerPoint PPT Presentation

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh Pejman Lotfi-Kamran * Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology * The Institute for Research in


  1. Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh Pejman Lotfi-Kamran * Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology * The Institute for Research in Fundamental Sciences NGPU SM SM Neurally Accelerated GPU SM SM

  2. Approximate computing Embracing imprecision Relax the abstraction of “ near perfect” accuracy in Data Processing Storage Communication Accept imprecision to improve performance energy dissipation resource utilization efficiency

  3. Opportunity Many GPU applications are amenable to approximation Augmented Machine Reality SM SM Learning Computer Sensor Vision Processing SM SM Robotics Multimedia

  4. Opportunity 100%# 90%# Approximable 80%# Non-Approximable 70%# Runtime 60%# 50%# 40%# 30%# 20%# 10%# 0%# binariza'on) blackscholes) convolu'on) inversek2j) jmeint) laplacian) meanfilter) newton9raph) sobel) sram) gmean) More than 55% of application runtime and energy is in neurally approximable regions

  5. Neural Transformation for GPUs Neural Neural Network Network

  6. Challenges core core core core core core core core core core core core core core core core Many core core core core core core core core Core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core core

  7. Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core

  8. Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core

  9. Challenges core core core core core core core core core core core core core core core core Many Simple core core core core core core core core Core In Order core core core core core core core core core core core core core core core core Data-level core core core core core core core core Parallelism core core core core core core core core core core core core core core core core Augmenting the CPU based neural processing units to each SIMD lane imposes 31.2% area overhead

  10. dst_reg src_reg2 src_reg1 src_reg3 NGPU Neurally-Accelerated GPU Architecture Memory Interconnection Network Partition L2 Decode Fetch Cache Issue Operand Write Collection back Active Mask SIMD Lane I-$ Off-chip SIMT DRAM Stack LSU Pipeline D-$ Streaming Multiprocessor (SM)

  11. Neuronal Network Operations ... ... y j = x j,i x j, 0 x j,n sigmoid ( w j, 0 w j,i w j,n w j, 0 × x j, 0 + w j, 0 . . . w j,i × x j,i + . . . w j,n × x j,n + ) y j 11

  12. Input FIFO Output FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane

  13. Output FIFO Input FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane NGPU reuses the existing ALU in each SIMD lane

  14. Output FIFO Input FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane Weight FIFO is shared among all the SIMD lanes

  15. Input FIFO Output FIFO Weight FIFO NGPU Neurally-Accelerated GPU Architecture 3 1 6 Acc Reg 5 Controller Sig. Unit 4 2 SIMD Lane Overall NGPU has ≤1% area overhead

  16. NGPU Execution Model ld.global %r0, [addr0]; in 0 (%r 0 ) in 1 (%r 1 ) ld.global %r1, [addr1]; w0 w1 w2 w3 send.n_data %r0; n0 n1 send.n_data %r1; w4 w5 n2 recv.n_data %r2; st.global [addr2], %r2; out 0 (%r 2 ) Neurally Accelerated Neural Network GPU Application

  17. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; …

  18. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD lanes are in normal mode and performs precise computation

  19. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD lanes enter neural mode

  20. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … SIMD starts the calculation of the neural network

  21. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The neurally accelerated SIMD lanes autonomously calculate the neural outputs in lock-step

  22. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step

  23. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step

  24. NGPU Execution Model ld.global %r0, [addr0]; … ld.global %r1, [addr1]; … send.n_data %r0; ( in 0 , in 0 , …, in 0 ) in 0 (%r 0 ) in 1 (%r 1 ) send.n_data %r1; ( in 1 , in 1 , …, in 1 ) recv.n_data %r2; w 0 × ( in 0 , in 0 , …, in 0 ) w0 w1 w2 w3 + w 2 × ( in 1 , in 1 , …, in 1 ) n0 n1 sigmoid … w 1 × ( in 0 , in 0 , …, in 0 ) w4 w5 + w 3 × ( in 1 , in 1 , …, in 1 ) n2 sigmoid … w 4 × ( n 0 , n 0 , …, n 0 ) out 0 (%r 2 ) + w 5 × ( n 1 , n 1 , …, n 1 ) sigmoid … ( out 0 , out 0 , …, out 0 ) st.global [addr2], %r2; … The accelerated SIMD lanes autonomously calculate the neural outputs in lock-step

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend