 
              InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing Shengwen Liang 1,2 , Ying Wang 1,2 , Cheng Liu 1,2 Huawei Li 1,2,3 , Xiaowei Li 1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China 1/11
Outline – InS-DLA SATA/PCIe GPU/FPGA PCIe SSD CPU DRAM Long data High energy PCIe peer to peer movement path consumption Multiple memory hierarchies Energy SSD Deep Shorten data Controller Learning efficiency movement path Data analysis DRAM NAND FLASH InS-DLA 2/11
Opportunities & Challenges 10000 The internal bandwidth of SSD can be 16x higher than NAND FLASH Bandwidth SSD (8 channel) the external SSD bandwidth. (MB/s) SATA In-SSD data access is more efficient than accessing from SAS(2port) PCIe(x4) the external interface. Fully utilize the high internal bandwidth of SSD. 0 2006 2011 2016 Year Open-Channel SSD SSD Host System Host System Block Metadata Logical Address Write Buffering Block Metadata Wear-leveling Write Buffering Physical Address Wear-leveling Expose Physical address to the Error handling Error handling host to specify the data layout. Media Controller Media Controller Non-volatile media Non-volatile media 3/11
Opportunities & Challenges NAND FLASH DRAM • Coarse-grained data operation • Slower than DRAM 1. How to rearrange data layout in the NAND flash to balance the flash Bytes Page (4~16KB) bandwidth and DLA throughput? Error Correction Engine NAND FLASH PHY 2. How to configure Error Correction Engine to provide more hardware NAND resources for deep learning accelerator FLASH SSD PHY in the area and power limited SSD Error 75.3% Controller Correction controller chip? Engine Considerable hardware resource 4/11
Overview of InS-DLA ARM User Space D-ECC (SATA, PCIe) I/O interface PHY FMC Kernel Space Scheduling Error Handling Wear Garbage FLASH Bad Block Management … Leveling Collection … D-ECC Data Placement PHY FMC InS-DLA FLASH • Directly access data PE PE PE PE D-ECC PHY FMC from NAND FLASH PE PE PE PE … instead of DRAM. • Output stationary D-ECC PE PE PE PE PHY FMC dataflow introduced Weight In/Out in the Eyeriss. Weight path Buffer In/out path Buffer 5/11
Flash-aware data layout 1. How to rearrange data layout in the NAND flash to balance the flash bandwidth and DLA throughput? Physical Page Address command Channel LUN BLOCK PAGE SECTOR ( 0,1 ) PE PE ( 1,1 ) PE ( 2,1 ) Maximize data parallelism by using physical page address commands provided by the OpenChannel SSD. ( 0,0 ) PE PE ( 1,0 ) PE ( 2,0 ) Read Page Cache Flash Block Page Page Register Register Controller command Channel-1 Channel-2 Channel-0 Read page Page Cache Flash Block Page Page Register Controller Register Cache command F lash-aware Improve single channel throughput data layout 6/11
Fault-tolerant aware strategy 2. How to configure Error Correction Engine to provide more hardware resources for deep learning accelerator in the area and power limited SSD controller chip? MSB LSB 7 6 5 4 3 2 1 0 The fault-tolerant of deep learning, No accuracy loss when bit error only occurs at the lower position. Low bit buffer Low bit buffer 11100001 10100101 10100001 10101011 10001010 10101010 BCH BCH 10101111 00101101 (1101,1024,15) (1101,1024,15) 10101101 Flash D-ECC-Encoder D-ECC-Decoder Fault-tolerant aware strategy --- change the protection region 7/11
Experiment Setup Baselines CPU GPU FPGA NAND FLASH Intel Xeon E5-2630 CPU - - Flash Controller v4@2.20GHz Intel Xeon E5-2630 Zynq ZC706 CPU+FPGA-1 - v4@2.20GHz Intel Xeon E5-2630 CPU+FPGA-2 - Zynq ZC706 InS-DLA v4@2.20GHz Intel Xeon E5-2630 CPU+GPU NVIDIA GTX - PCIe v4@2.20GHz 1080 Ti OpenSSD Module LUT FF 1. Zynq FPGA Chip – InS-DLA and FMC+ECC 102010 76712 flash controller 1. Dual Cortex A9 -- Firmware FMC+D-ECC 47858(-53%) 66880(-12%) 2. 1GB DRAM NVMe Interface 8585 11456 3. 8-channels NAND flash 4. PCIe Gen 2 (maximum lane = 8) DLA 93232 21929 Total 149675 (68.46%) 100265 (22.93%) 8/11
Experiment Result CPU CGPU C-FPGA-1 C-FPGA-2 InS-DLA + ECC (sim) InS-DLA + ECC (FPGA) 100 Energy (J) 1 AlexNet Squeezenet ResNet-18 GoogleNet 3 GOPS/W 2 1 0 AlexNet Squeezenet ResNet-18 GoogleNet • InS-DLA on simulator and FPGA prototype outperforms four baselines in terms of energy efficiency. 9/11
Experiment Result General Read Read Cache General Read+ Read Cache+ Command Command Flash aware data Flash aware data layout layout Latency (cycle) 60330240 31809810 8541280 3976226 Improvement 1 47.27% 85.84% 93.41% InS-DLA + ECC (sim) InS-DLA + ECC (FPGA) InS-DLA + D-ECC (sim) InS-DLA + D-ECC (FPGA) 4 Performance 4 Energy (J) (GOPS/W 0 0 AN SQ RN GN AN SQ RN GN • The flash-aware data layout with read cache command improves throughput. • The D-ECC reduces the energy cost by 34% on the simulator and 30% on FPGA prototype compared to the ECC hardware. 10/11
More details are shown in the poster, waiting for you. Thank you for your attention InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing Shengwen Liang 1,2 , Ying Wang 1,2 , Cheng Liu 1,2 Huawei Li 1,2,3 , Xiaowei Li 1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China 11/11
Recommend
More recommend