Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks
Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das
M-Bit its Research Gr Group
1
Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - - PowerPoint PPT Presentation
Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1 Can
1
2
CPU GPU
3
GPU
CPU
4
18-core Xeon processor 45 MB LLC
18 LLC slices
5
18-core Xeon processor 45 MB LLC
Way 1 Way 20 Way 2 Way 19
2.5MB LLC slice
CBOX TMU
32kB data bank 8kB array
18 LLC slices 360 ways
6
18-core Xeon processor 45 MB LLC
Way 1 Way 20 Way 19
2.5MB LLC slice
CBOX TMU
32kB data bank 8kB array WL
Row decoder
255 255 BL/BLB
8kB SRAM array
18 LLC slices 360 ways 5760 arrays
Way 2
7
18-core Xeon processor 45 MB LLC
Way 1 Way 20 Way 19
2.5MB LLC slice
CBOX TMU
32kB data bank 8kB array
8kB SRAM array
WL
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Row decoders
255 255
= A + B
BL/BLB Logic
Array A Array B
1 1 1 1 1 1
A + B
18 LLC slices 360 ways 5760 arrays
Way 2
8
18-core Xeon processor 45 MB LLC
Way 1 Way 20 Way 19
2.5MB LLC slice
CBOX TMU
32kB data bank 8kB array
8kB SRAM array
D EN Q
C A&B A^B S Cout
Cin
Vref C_EN ~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
WL
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Row decoders
255 255
= A + B
BL/BLB Logic
Array A Array B
1 1 1 1 1 1
A + B
Way 2
Way 2
9
18-core Xeon processor 45 MB LLC
Way 1 Way 20 Way 19
2.5MB LLC slice
CBOX TMU
32kB data bank 8kB array
8kB SRAM array
WL
Row decoders
255 255
= A + B
BL/BLB Logic
D EN Q
C A&B A^B S Cout
Cin
Vref C_EN ~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU Array A Array B A + B
Multiply
Divide
Add
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
10
Row decoders
255 255 BL/BLB Logic
11
Row decoders
255 255 BL/BLB Logic
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0
12
WL1
Row decoders
255 255
S
BL/BLB Logic WL2
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0
13
WL1
Row decoders
255 255 BL/BLB Logic WL2
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0
C S S
Carry propagation across bitlines
14
WL1
Row decoders
255 255 BL/BLB Logic WL2
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0
C S S S C
Carry propagation across bitlines
15
WL1
Row decoders
255 255 BL/BLB Logic WL2
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0
C S S S S C C
Carry propagation across bitlines
16
Row decoders
255 255 BL/BLB Logic
17
Row decoders
255 255 BL/BLB Sum Carry
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0
S S S S
Transposed data 0 0 0 0
18
WL1
Row decoders
255 255 BL/BLB Sum WL2 Carry
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0
S S S S
Transposed data
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0
0 0 0 0
Cycle 1
19
WL1
Row decoders
255 255 BL/BLB Sum WL2 Carry
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0
S S S S
Transposed data
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0
C C C C
Cycle 2
20
WL1
Row decoders
255 255 BL/BLB Sum WL2 Carry
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0
S S S S
Transposed data
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0
C C C C
Cycle 3
21
WL1
Row decoders
255 255 BL/BLB Sum WL2 Carry
Array A Array B A + B
Word 3 Word 2 Word 1 Word 0
S S S S
Transposed data
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0
C C C C
Cycle 4
22
23
18-core Xeon processor 45 MB LLC
Way 1 Way 20 Way 19
2.5MB LLC slice
CBOX TMU
32kB data bank 8kB array
8kB SRAM array
D EN Q
C A&B A^B S Cout
Cin
Vref C_EN ~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
WL
Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Row decoders
255 255
= A + B
BL/BLB Logic
Array A Array B
1 1 1 1 1 1
A + B
Way 2
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
BLB0 BL0 BLBn BLn SA
Row Decoder
SA
Differential Sense Amplifiers Bitlines Wordlines Row Decoder-O
SA SA Vref SA SA Vref
Single-ended Sense Amplifiers Additional row decoder Reconfigurable sense amplifiers
24
SA SA Vref
BLB0 BL0 BLBn BLn
Row Decoder Row Decoder
SA SA Vref
Single-ended Sense Amplifiers
B A
0 1 0 1 1 0 0 1
25
SA SA Vref BLB0 BL0 BLBn BLn
Row Decoder Row Decoder
SA SA Vref
Single-ended Sense Amplifiers
B A
0 1 0 1 1 0 0 1
26
SA SA Vref
BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum
1 1 1
Row Decoder B Row Decoder A
P 256 Bitlines
D EN Q
C A&B A^B S Cout
Cin
Vref C_EN ~A & ~B
SA SA
BL BLB
DR
S = A^B^C
27
1 SA SA Vref
BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum 1
1 1 1 1
Row Decoder B Row Decoder A
P
28
1 1 SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref 1 1 1 1 1 1 1 1 Carry Sum 1 1
Row Decoder B Row Decoder A
P
29
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref 1 1 1 1 1 1 1 1 1 1 Carry Sum 1
Row Decoder P Row Decoder
30
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum
1 1 1
Row Decoder Row Decoder
Tag
31
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum 1 1 1
Row Decoder Row Decoder
Tag 1
P0 P1 P2
32
1 SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum 1 1 1
Row Decoder Row Decoder
1 Tag 1
P0 <- A0B0
P0 P1 P2
33
1 SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum 1 1 1 1
Row Decoder Row Decoder
1 Tag 1
P0 <- A0B0 P1 <- A1B0
P0 P1 P2
34
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum 1 1 1 1 1
Row Decoder Row Decoder
Tag 1 1
P0 <- A0B0 P1 <- A1B0
P0 P1 P2
35
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 Carry Sum 1 1 1 1 1
Row Decoder Row Decoder
1 1 1 Tag 1 1 1
P0 <- A0B0 P1 <- A1B0 + A0B1
P0 P1 P2
36
P1 <- P1 + A0B1 If(B1), P1 <- P1 + A0 Else, P1 <- P1
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 1 Carry Sum 1 1 1 1 1
Row Decoder Row Decoder
1 1 1 Tag
P0 <- A0B0 P1 <- A1B0 + A0B1 P2 <- A1B1
P0 P1 P2
37
SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref
1 1 1 1 Carry Sum 1 1 1 1 1
Row Decoder Row Decoder
1 Tag 1
P0 <- A0B0 P1 <- A1B0 + A0B1 P2 <- A1B1 P3 <- Cin
P0 P1 P2
38
Operation Cycles ADD N+1 SUB 2N+1 MUL N2 + 5N -2 DIV 1.5N2 + 5.5N Comparison 2N+1
39
40
41
Way 1 Way 20 Way 2 Way 19
CBOX TMU
Row Decoder
A0[MSB] A1[MSB] A2[MSB] A0[LSB] A1[LSB] A2[LSB] ... ... ... ... ... ... ... ... ... ...
Col Decoder
SA SA SA DR DR DR SA DR SA SA SA DR DR DR ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Control
SA SA SA SA SA SA SA SA SA SA DR DR DR DR DR DR DR DR DR DR DR B0[MSB] B1[MSB] B2[MSB] B0[LSB] B1[LSB] B2[LSB]
Regular read/write Transpose read/write
42
43
C W H M E F S 3D Filters (M) each filter: C channels each channel: RxS weights 1 C R S M C R Input Activations (C channels) Output Activations (M channels)
44
RxS
C
. . . RxS
C
. . .
Partial Sum C 1 Output Activation
MAC
Reduction
Filter Weights 1 C M C R S R S Input Activations C W H Output Activations M E F
. . .
Unroll Unroll
45
256 Wordlines Input Activation
RxSx8 256 Bitlines
8 kB SRAM Array
Weights
RxSx8
Partial Sum
4x8
. . .
C
Output
4x8
. . . . . .
Way 20
2.5 MB LLC Slice
. . . . . . . . . . . . . . .
Way 1 Way 2 Way 3 Quad 1 Quad 2 Quad 3 Quad 4 M = 32 Output Position 1 Output Position 2 . . .
256 Wordlines Input Activation
RxSx8
channel 1
Filter 1 (C = 256)
256 Bitlines
8 kB SRAM Array
Weights
RxSx8
Partial Sum
4x8
channel 2 channel 3 channel 256 channel 4
. . . . . . . . . . . .
M E F
46
. . . . . .
Way 1 -18 Way 19-20 Way 1 -18 Way 19-20 Slice 1 Slice 14
M E F
47
LLC Slice 1 LLC Slice 14 Ring Interconnect Core 14 DRAM
. . .
. . . Filter Weights Input Activations Output Activations
Way 19 (Reserved) 2.5 MB LLC Slice
. . .
Way 1 Way 2 Way 3 Quad 1 Quad 2 Quad 3 Quad 4
. . . . . . . . . . . .
Core 1
48
49
50
CPU (2 sockets) GPU (1 card) Neural Cache
Processor Intel Xeon E5-2597 v3, 2.6GHz, 28 cores, 56 threads Nvidia Titan Xp, 1.6GHz, 3840 cuda cores 2.5GHz Compute SRAM, 1032192 Bit-serial ALUs On-chip memory 78.96 MB 9.14 MB 70 MB (Dual Socket) Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM Profiler / Simulator (Performance) TensorFlow tfprof TensorFlow tfprof Cycle accurate simulator + C Microbench Profiler / Simulator (Energy) Intel RAPL Interface NVIDIA System Management Interface SPICE simulation + Intel RAPL Interface
51
100 200 300 400 500 600 700 1 4 16 64 256 Throughput (Inferences / sec) Batch Size CPU - Xeon E5 GPU - Titan Xp Neural Cache
2.2x Improved throughput over GPU
CPU GPU Neural Cache
7.7x Latency improvement over GPU
52
53
20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10
CPU GPU Neural Cache
Power (Watts) Energy (Joules)
Total Energy Avg Power
54
55