InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data - - PowerPoint PPT Presentation

ins dla an in ssd deep learning accelerator for near data
SMART_READER_LITE
LIVE PREVIEW

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data - - PowerPoint PPT Presentation

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing Shengwen Liang 1,2 , Ying Wang 1,2 , Cheng Liu 1,2 Huawei Li 1,2,3 , Xiaowei Li 1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese


slide-1
SLIDE 1

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing

Shengwen Liang1,2, Ying Wang1,2, Cheng Liu1,2 Huawei Li1,2,3, Xiaowei Li1,2

1State Key Laboratory of Computer Architecture,

Institute of Computing Technology, Chinese Academy of Sciences, Beijing

2University of Chinese Academy of Sciences 3Peng Cheng Laboratory, Shenzhen, China

1/11

slide-2
SLIDE 2

Outline – InS-DLA

2/11

CPU DRAM SSD GPU/FPGA SATA/PCIe PCIe PCIe peer to peer Multiple memory hierarchies Long data movement path High energy consumption Deep Learning NAND FLASH

DRAM

SSD Controller InS-DLA Shorten data movement path

Energy efficiency Data analysis

slide-3
SLIDE 3

Opportunities & Challenges

3/11

10000 2006 2011 2016 Bandwidth (MB/s)

Year

NAND FLASH SSD (8 channel) SATA SAS(2port) PCIe(x4)

The internal bandwidth of SSD can be 16x higher than the external SSD bandwidth. In-SSD data access is more efficient than accessing from the external interface. Fully utilize the high internal bandwidth of SSD. Non-volatile media Block Metadata Write Buffering Wear-leveling Error handling Media Controller Host System SSD Non-volatile media Block Metadata Write Buffering Wear-leveling Error handling Media Controller Host System Open-Channel SSD

Expose Physical address to the host to specify the data layout.

Logical Address Physical Address

slide-4
SLIDE 4

Opportunities & Challenges

4/11

DRAM NAND FLASH Bytes Page (4~16KB)

  • Coarse-grained data operation
  • Slower than DRAM
  • 1. How to rearrange data layout in the

NAND flash to balance the flash bandwidth and DLA throughput?

SSD Controller

NAND FLASH PHY Error Correction Engine

Error Correction Engine NAND FLASH PHY

75.3%

  • 2. How to configure Error Correction

Engine to provide more hardware resources for deep learning accelerator in the area and power limited SSD controller chip?

Considerable hardware resource

slide-5
SLIDE 5

Overview of InS-DLA

5/11

FMC D-ECC PHY FMC D-ECC PHY Scheduling Error Handling Bad Block Management

ARM InS-DLA I/O interface (SATA, PCIe)

Wear Leveling Garbage Collection Data Placement FLASH FLASH

Kernel Space User Space

PE PE PE PE PE PE PE PE PE PE PE PE

Weight Buffer In/Out Buffer

FMC D-ECC PHY FMC D-ECC PHY

  • Directly access data

from NAND FLASH instead of DRAM.

  • Output stationary

dataflow introduced in the Eyeriss.

… …

Weight path In/out path

slide-6
SLIDE 6

Flash-aware data layout

6/11

  • 1. How to rearrange data layout in the NAND flash to balance the flash bandwidth and DLA

throughput? Physical Page Address command

PE

(0,1) PE (1,1) PE (2,1)

PE

(0,0) PE (1,0) PE (2,0)

Channel LUN BLOCK PAGE SECTOR

Block Page Page Page Register Cache Register Flash Controller

Read page Cache command

Block Page Page Page Register Cache Register Flash Controller

Read command

Improve single channel throughput

Flash-aware data layout

Maximize data parallelism by using physical page address commands provided by the OpenChannel SSD.

Channel-0 Channel-2 Channel-1

slide-7
SLIDE 7

Fault-tolerant aware strategy

7/11

  • 2. How to configure Error Correction Engine to provide more hardware resources for deep

learning accelerator in the area and power limited SSD controller chip? MSB

LSB 1 2 3 4 5 6 7 The fault-tolerant of deep learning, No accuracy loss when bit error

  • nly occurs at the lower position.

10100101 10101011 10101111 10100001 10101010 10101101

Low bit buffer BCH (1101,1024,15) Low bit buffer BCH (1101,1024,15)

11100001 10001010 00101101

Flash

D-ECC-Encoder D-ECC-Decoder

Fault-tolerant aware strategy --- change the protection region

slide-8
SLIDE 8

Experiment Setup

8/11

NAND FLASH

Flash Controller

InS-DLA PCIe

  • 1. Zynq FPGA Chip – InS-DLA and

flash controller

  • 1. Dual Cortex A9 -- Firmware
  • 2. 1GB DRAM
  • 3. 8-channels NAND flash
  • 4. PCIe Gen 2 (maximum lane = 8)

OpenSSD

Baselines CPU GPU FPGA

CPU

Intel Xeon E5-2630 v4@2.20GHz

  • CPU+FPGA-1

Intel Xeon E5-2630 v4@2.20GHz

  • Zynq ZC706

CPU+FPGA-2

Intel Xeon E5-2630 v4@2.20GHz

  • Zynq ZC706

CPU+GPU

Intel Xeon E5-2630 v4@2.20GHz

NVIDIA GTX 1080 Ti

  • Module

LUT FF FMC+ECC 102010 76712 FMC+D-ECC 47858(-53%) 66880(-12%) NVMe Interface 8585 11456 DLA 93232 21929 Total 149675 (68.46%) 100265 (22.93%)

slide-9
SLIDE 9

Experiment Result

9/11

1 100 AlexNet Squeezenet ResNet-18 GoogleNet

Energy (J)

CPU CGPU C-FPGA-1 C-FPGA-2 InS-DLA + ECC (sim) InS-DLA + ECC (FPGA) 1 2 3 AlexNet Squeezenet ResNet-18 GoogleNet

GOPS/W

  • InS-DLA on simulator and FPGA prototype outperforms four

baselines in terms of energy efficiency.

slide-10
SLIDE 10

Experiment Result

10/11

General Read Command Read Cache Command General Read+ Flash aware data layout Read Cache+ Flash aware data layout Latency (cycle)

60330240 31809810 8541280 3976226

Improvement

1 47.27% 85.84% 93.41%

4 AN SQ RN GN

Energy (J)

InS-DLA + ECC (sim) InS-DLA + ECC (FPGA) InS-DLA + D-ECC (sim) InS-DLA + D-ECC (FPGA)

4 AN SQ RN GN

Performance (GOPS/W

  • The flash-aware data layout with read cache command improves throughput.
  • The D-ECC reduces the energy cost by 34% on the simulator and 30% on FPGA

prototype compared to the ECC hardware.

slide-11
SLIDE 11

InS-DLA: An In-SSD Deep Learning Accelerator for Near-Data Processing

Shengwen Liang1,2, Ying Wang1,2, Cheng Liu1,2 Huawei Li1,2,3, Xiaowei Li1,2

1State Key Laboratory of Computer Architecture,

Institute of Computing Technology, Chinese Academy of Sciences, Beijing

2University of Chinese Academy of Sciences 3Peng Cheng Laboratory, Shenzhen, China 11/11

More details are shown in the poster, waiting for you.

Thank you for your attention