NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - - PowerPoint PPT Presentation

nmax
SMART_READER_LITE
LIVE PREVIEW

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - - PowerPoint PPT Presentation

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required Public Information 1 Leader in eFPGA TSMC IP Alliance Member eFPGA working Silicon TSMC 40/28/16/12 eFPGA in design for GF14 and TSMC 7/7+


slide-1
SLIDE 1

No NDA Required – Public Information

NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing

1

slide-2
SLIDE 2

No NDA Required – Public Information

Leader in eFPGA

  • TSMC IP Alliance Member
  • eFPGA working Silicon TSMC 40/28/16/12
  • eFPGA in design for GF14 and TSMC 7/7+
  • >>10 customer deals/design/fab/silicon: 16-180nm; more in negotiation
  • First 6 announced agreements
slide-3
SLIDE 3

No NDA Required – Public Information

NMAX Value Proposition ▪ Inferencing: 8x8 or 16x8 natively; 16x16 at ½ throughput ▪ High MAC utilization: more throughput out of less silicon ▪ Low DRAM bandwidth: more throughput at less system cost and power ▪ Low latency: high performance, low latency at batch = 1 ▪ Modular: can respond quickly to customer needs ▪ Scalable doubling MACs means doubling throughput ▪ Flexible: run any NN or multiple NN ▪ Tensorflow/Caffe: easy to program

3

slide-4
SLIDE 4

No NDA Required – Public Information

NMAX Inferencing Applications ▪ Edge

  • Automotive
  • Aerospace
  • Cameras
  • Cell Phones
  • Drones
  • PCs
  • Retail
  • Robotics
  • Speech
  • Surveillance

▪ Data Centers

4

slide-5
SLIDE 5

No NDA Required – Public Information

Comparing Neural Engine Alternatives

5

slide-6
SLIDE 6

No NDA Required – Public Information

Neural Inferencing Challenges & Terminology

  • An inferencing chip with 1000 MACs @ 1GHz has 1 TMACs/sec = 2 TOPS

▪ This is a peak number: no one uses all of the MACs

  • Challenge: move data at low power where needed to keep MACs utilized
  • Challenge: maintain high performance at batch=1 for lowest latency

6

slide-7
SLIDE 7

No NDA Required – Public Information

How do you get the NN throughput you need?

  • 1. Determine how many OPS/MACs are needed for each image

▪ YOLOv3 2MP = 400 Billion MACs/image = 800 Billion Operations/image

  • 2. Determine how many images/second you need to process

▪ YOLOv3 2MP autonomous driving: 30 image/sec = 24 TOPS throughput

  • 3. How many MACs you need is determined by this formula:

▪ Y TOPS Peak = X TOPS Throughput ➗MAC utilization ▪ MAC utilization will vary based on NN, image size, batch size ▪ Batch=1 is what you need at the edge ▪ Number of MACs required = Y TOPS Peak ➗Frequency of MAC Completion ▪ NOTE: no short cuts in the above for pruning model, Winograd, compression

7

slide-8
SLIDE 8

No NDA Required – Public Information

MAC Utilization / MAC Efficiency

  • A MAC can only do a useful calculation if both the activation and the weight are

available on the inputs; if not it stalls

  • MAC Utilization = (# of useful MAC calculations) ➗ (# of MACs Available)
  • Example:

▪ Nvidia Tesla T4 claims 3920 images/second on ResNet-50 @ Batch=28 ▪ Each ResNet-50 image takes 7 Billion Operations (3.5 Billion MACs) ▪ So T4 is doing 3920 * 7 Billion Ops = 27.44 Trillion Ops/sec = 27.44 TOPS ▪ T4 data sheet claims 130 TOPS (int8) ▪ So T4 MAC utilization, for Resnet-50 @ Batch=28, is 21%

8

slide-9
SLIDE 9

No NDA Required – Public Information

Microsoft BrainWave Slide from HotChips 2018: Batch Size Latency at 99th

Maximum Allowed Latency

Batch Size Hardware Utilization (%)

9

Batching improves HW utilization but increases latency

IDEAL Existing Solutions

slide-10
SLIDE 10

No NDA Required – Public Information

ResNet-50 Int8 ▪ Image classification ▪ 224 x 224 pixels ▪ 50 stage neural network ▪ 22.7 Million weights ▪ 3.5 Billion MACs per image = 7 Billion Operations per image

10

slide-11
SLIDE 11

No NDA Required – Public Information

ResNet-50 Images/Second vs Batch Size: NMAX utilization high at batch=1

2000 4000 6000 8000 10000 12000 14000 16000 Batch=1 Batch=5 Batch=10 Batch=28 Nvidia Tesla T4 Habana Goya NMAX 6x6 NMAX6x12 NMAX12x12

11

NMAX 12x12 NMAX 12x6 NMAX 6x6 ? Images Per Second EDGE Habana Goya

slide-12
SLIDE 12

No NDA Required – Public Information

Real Time Object Recognition: YOLOv3

12

slide-13
SLIDE 13

No NDA Required – Public Information

YOLOv3 Int8 ▪ Real time object recognition ▪ 2 Megapixel images ▪ >100 stage neural network ▪ 62 Million weights ▪ 400 Billion MACs per image = 800 Billion Operations per image

13

slide-14
SLIDE 14

No NDA Required – Public Information

NMAX: YOLOv3, 2048x1024, Batch=1 using 2 x 4Gbit LPDDR4 DRAM 10x reduction in DRAM BW requirements vs competing solutions (<25 vs >300GB/s)

14

NMAX array size 12x12 12x6 6x6 SRAM Size 64MB 64MB 32MB TOPS Peak 147 73 37 Throughput (@1GHz) 124 fps 72 fps 27 fps Latency 8 ms 14 ms 37 ms

  • Avg. DRAM BW

12 GB/s 14 GB/s 10 GB/s

  • Avg. SRAM BW

177 GB/s 103 GB/s 34 GB/s XFLX & ArrayLINX BW 18 TB/s 10 TB/s 4 TB/s MAC Efficiency (max useable DRAM BW: 25 GB/s) 67% 98 TOPS Throughput 78% 58 TOPS Throughput 58% 22 TOPS Throughput T4-class performance

slide-15
SLIDE 15

No NDA Required – Public Information

Why is NMAX the right solution for performance inferencing

  • The most efficient implementation of any neural network is a hardwired ASIC
  • But customers want reconfigurability
  • NMAX is the closest reconfigurable architecture to hardwired ASIC

▪ Each stage when configured executes just like an ASIC

  • NMAX can run any neural network running Tensorflow/Caffe

15

slide-16
SLIDE 16

No NDA Required – Public Information

  • 8 NMAX clusters achieves 50-90% MAC efficiency
  • Local eFPGA logic (EFLX) for:

▪ Control logic & management ▪ Reconfigurable data flow ▪ Additional signal processing (e.g. ReLU, Sigmoid, Tanh)

  • Local L1 SRAM for weights & activations
  • L2 SRAM (via RAMLINX)
  • L3 storage through DDR/PCIe
  • High speed XFLX interconnects all blocks within the tile
  • High speed ArrayLINX connects to adjacent NMAX tiles

to create larger NMAX arrays by abutment

NMAX512 Tile Microarchitecture: 1 TOPS @ <2mm2 in TSMC16FFC/12FFC

NMAX512 Tile*

Features

16 *architectural diagram, not to scale

XFLX Interconnect NMAX Cluster L1 SRAM EFLX Logic EFLX IO

ArrayLINXTM to adjacent tiles

NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO

ArrayLINXTM to adjacent tiles DDR, PCIe & SoC connections L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM

slide-17
SLIDE 17

No NDA Required – Public Information

Every Tile is Reconfigured (quickly & differently) every stage

NMAX512 Tile*

17 *architectural diagram, not to scale

XFLX Interconnect NMAX Cluster L1 SRAM EFLX Logic EFLX IO

ArrayLINXTM to adjacent tiles

NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO

ArrayLINXTM to adjacent tiles DDR, PCIe & SoC connections L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM

ACTIVATION

This example does a matrix multiply of a 512 activation vector from the prior stage times a weight matrix which is then activated to produce the activation vector for the next stage Input Activation from L2 SRAM Output Activation to L2 SRAM

slide-18
SLIDE 18

No NDA Required – Public Information

NMAX Clusters Systolically Multiply the Activation Vector by the Weights

  • Example of a 4 input vector multiplying by a 4x4 weight matrix

18

Source: Hardware for Neural networks, page 466, https://page.mi.fu- berlin.de/rojas/neural/chapter/ K18.pdf

slide-19
SLIDE 19

No NDA Required – Public Information NMAX TILE NMAX TILE NMAX TILE NMAX TILE

DDR IF SoC / PCIe connection L2 SRAM L2 SRAM L2 SRAM L2 SRAM

Modular NMAX arrays: easily scales from 1 to >100 TOPS Features

19

2x2 NMAX512 Array*

  • NMAX tiles form arrays by abutment
  • ArrayLINX interconnect on all 4 sides of

NMAX tile automatically connect to provide high bandwidth array-wide interconnect

  • Shared L2 SRAM:

▪ Local, high-capacity SRAMs placed in between NMAX tiles ▪ Holds weights for each layer, as well as activations from one layer to the next ▪ EFLX place-and-route algorithms minimizes interconnect distances between SRAM and NMAX

*architectural diagram, not to scale

slide-20
SLIDE 20

No NDA Required – Public Information

NMAX is dataflow: NMAX Compiler maps from Caffe/TensorFlow

  • Mapped NN automatically “unrolls” onto the NMAX hardware
  • Control logic & data operators maps to EFLX reconfigurable logic

i0 i1 i2 i3 n0 n1 n2 n3

w00 w11 w22 w33

i0 i1 i2 i3 n0 n1 n2 n3

w00 w11 w22 w33

iin i'0 i'1 i'2 i'3

ld ld ld ld

n°0 n°1 n°2 n°3

ld ld ld ld

dout

20

NMAX TILE