No NDA Required – Public Information
NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing
1
NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural - - PowerPoint PPT Presentation
NMAX: Fast, Modular, Low Latency, Low Cost/Power Neural Inferencing No NDA Required Public Information 1 Leader in eFPGA TSMC IP Alliance Member eFPGA working Silicon TSMC 40/28/16/12 eFPGA in design for GF14 and TSMC 7/7+
No NDA Required – Public Information
1
No NDA Required – Public Information
No NDA Required – Public Information
3
No NDA Required – Public Information
4
No NDA Required – Public Information
5
No NDA Required – Public Information
6
No NDA Required – Public Information
▪ YOLOv3 2MP = 400 Billion MACs/image = 800 Billion Operations/image
▪ YOLOv3 2MP autonomous driving: 30 image/sec = 24 TOPS throughput
▪ Y TOPS Peak = X TOPS Throughput ➗MAC utilization ▪ MAC utilization will vary based on NN, image size, batch size ▪ Batch=1 is what you need at the edge ▪ Number of MACs required = Y TOPS Peak ➗Frequency of MAC Completion ▪ NOTE: no short cuts in the above for pruning model, Winograd, compression
7
No NDA Required – Public Information
8
No NDA Required – Public Information
Maximum Allowed Latency
9
IDEAL Existing Solutions
No NDA Required – Public Information
10
No NDA Required – Public Information
2000 4000 6000 8000 10000 12000 14000 16000 Batch=1 Batch=5 Batch=10 Batch=28 Nvidia Tesla T4 Habana Goya NMAX 6x6 NMAX6x12 NMAX12x12
11
NMAX 12x12 NMAX 12x6 NMAX 6x6 ? Images Per Second EDGE Habana Goya
No NDA Required – Public Information
12
No NDA Required – Public Information
13
No NDA Required – Public Information
14
NMAX array size 12x12 12x6 6x6 SRAM Size 64MB 64MB 32MB TOPS Peak 147 73 37 Throughput (@1GHz) 124 fps 72 fps 27 fps Latency 8 ms 14 ms 37 ms
12 GB/s 14 GB/s 10 GB/s
177 GB/s 103 GB/s 34 GB/s XFLX & ArrayLINX BW 18 TB/s 10 TB/s 4 TB/s MAC Efficiency (max useable DRAM BW: 25 GB/s) 67% 98 TOPS Throughput 78% 58 TOPS Throughput 58% 22 TOPS Throughput T4-class performance
No NDA Required – Public Information
15
No NDA Required – Public Information
▪ Control logic & management ▪ Reconfigurable data flow ▪ Additional signal processing (e.g. ReLU, Sigmoid, Tanh)
to create larger NMAX arrays by abutment
NMAX512 Tile*
16 *architectural diagram, not to scale
XFLX Interconnect NMAX Cluster L1 SRAM EFLX Logic EFLX IO
ArrayLINXTM to adjacent tiles
NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO
ArrayLINXTM to adjacent tiles DDR, PCIe & SoC connections L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM
No NDA Required – Public Information
NMAX512 Tile*
17 *architectural diagram, not to scale
XFLX Interconnect NMAX Cluster L1 SRAM EFLX Logic EFLX IO
ArrayLINXTM to adjacent tiles
NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster L1 SRAM EFLX Logic EFLX IO NMAX Cluster NMAX Cluster NMAX Cluster L1 SRAM EFLX Logic EFLX IO
ArrayLINXTM to adjacent tiles DDR, PCIe & SoC connections L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM L2 SRAM via RAMLINXTM
ACTIVATION
This example does a matrix multiply of a 512 activation vector from the prior stage times a weight matrix which is then activated to produce the activation vector for the next stage Input Activation from L2 SRAM Output Activation to L2 SRAM
No NDA Required – Public Information
18
Source: Hardware for Neural networks, page 466, https://page.mi.fu- berlin.de/rojas/neural/chapter/ K18.pdf
No NDA Required – Public Information NMAX TILE NMAX TILE NMAX TILE NMAX TILE
DDR IF SoC / PCIe connection L2 SRAM L2 SRAM L2 SRAM L2 SRAM
19
2x2 NMAX512 Array*
NMAX tile automatically connect to provide high bandwidth array-wide interconnect
▪ Local, high-capacity SRAMs placed in between NMAX tiles ▪ Holds weights for each layer, as well as activations from one layer to the next ▪ EFLX place-and-route algorithms minimizes interconnect distances between SRAM and NMAX
*architectural diagram, not to scale
No NDA Required – Public Information
i0 i1 i2 i3 n0 n1 n2 n3
w00 w11 w22 w33
i0 i1 i2 i3 n0 n1 n2 n3
w00 w11 w22 w33
iin i'0 i'1 i'2 i'3
ld ld ld ld
n°0 n°1 n°2 n°3
ld ld ld ld
dout
20
NMAX TILE