dram access reduction by node fusion with tvm
play

DRAM Access Reduction by Node Fusion with TVM Chia-Wei Chang, - PowerPoint PPT Presentation

DRAM Access Reduction by Node Fusion with TVM Chia-Wei Chang, Jing-Jia Liou, Chih-Tsun Huang, Wei-Chung Hsu & Juin-Ming Lu National Tsing Hua University & Industrial Technology Research Institute Dec 5th, 2019 1 DRAM Access Consumes


  1. DRAM Access Reduction by Node Fusion with TVM Chia-Wei Chang, Jing-Jia Liou, Chih-Tsun Huang, Wei-Chung Hsu & Juin-Ming Lu National Tsing Hua University & Industrial Technology Research Institute Dec 5th, 2019 1

  2. DRAM Access Consumes More Energy • Energy efficiency is the key to DNN computation • Hardware accelerators • DRAM consumes 50-100x more energy per byte than SRAM • Node fusion is used to save DRAM accesses DRAM SRAM Register Energy 250x 4x 1x 2

  3. TVM only Fuses Elementwise OP BatchNorm Elementwise TopLevel Relu Conv TVMOP Elementwise OutElementwieFusable • Currently, TVM only supports fusion of elementwise OP into Conv • Each OP has an attribute to indicate whether to fuse • Generate TVMOP, which includes nodes to share data in SRAM 3

  4. Our Node Fusion Merges Multiple Convs Fusion Fus Tensor data Te 1 st 2 nd 1 st 2 nd DNN DRAM DRAM DRAM DRAM DRAM Operator SRAM for ( n = 0 ; n < N ; n ++) for ( n = 0 ; n < N ; n ++) # 1st Conv for ( k = 0 ; k < C2 ; k ++) for ( k = 0 ; k < C1 ; k ++) for ( y = 0 ; y < H2 ; y ++) for ( y = 0 ; y < H1 ; y ++) for ( x = 0 ; x < W2 ; x ++) for ( x = 0 ; x < W1 ; x ++) # Internal SRAM buffer int sram [ C1 ][ R2 ][ S2 ] for ( c = 0 ; c < C0 ; c ++) for ( r = 0 ; r < R1 ; r ++) for ( c = 0 ; c < C1 ; c1 ++) for ( s = 0 ; s < S1 ; s ++) for ( r = 0 ; r < R2 ; r ++) O1 [ n ][ k ][ y ][ x ] += W1 [ k ][ c ][ r ][ s ] * I [ n ][ c ][ y + r ][ x + s ] for ( s = 0 ; s < S2 ; s ++) for ( c2 = 0 ; c2 < C0 ; c ++) for ( n = 0 ; n < N ; n ++) # 2nd Conv for ( r2 = 0 ; r2 < R1 ; r ++) for ( k = 0 ; k < C2 ; k ++) for ( s2 = 0 ; s2 < S1 ; s ++) for ( y = 0 ; y < H2 ; y ++) sram [ c ][ r ][ s ] += W1 [ c ][ c2 ][ r2 ][ s2 ] * I [ n ][ c2 ][ y + r + r2 ][ x + s + s2 ] for ( x = 0 ; x < W2 ; x ++) for ( c = 0 ; c < C1 ; c ++) for ( c = 0 ; c < C1 ; c ++) for ( r = 0 ; r < R2 ; r ++) for ( r = 0 ; r < R2 ; r ++) for ( s = 0 ; s < S2 ; s ++) for ( s = 0 ; s < S2 ; s ++) O2 [ n ][ k ][ y ][ x ] += W2 [ k ][ c ][ r ][ s ] * O1 [ n ][ c ][ y + r ][ x + s ] O [ n ][ k ][ y ][ x ] += W2 [ k ][ c ][ r ][ s ] * sram [ c ][ r ][ s ] 4

  5. Experiment Settings: Hardware Controller • Eyeriss-like architecture ifmap • 256MB DRAM PE PE PE ... PE weights • 108KB SRAM ipsum Buffer PE PE PE ... PE • 12x14 PE ... opsum ... ... ... • Runs AlexNet PE PE PE ... PE • Due to hardware limitation, only Conv is DRAM evaluated 5

  6. Experimental Results Energy (mJ) MCycle Energy-Delay (KCycle.J) 5 7 35 4.5 16% 6 30 23% 4 5 40% 3.5 25 3 4 20 2.5 3 15 2 1.5 10 2 1 5 1 0.5 0 0 0 Engergy*Cycle Energy Cycle w/o Fusion Fusion w/o Fusion Fusion w/o Fusion Fusion 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend