Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - - PowerPoint PPT Presentation

neural cache bit it serial l in in cache
SMART_READER_LITE
LIVE PREVIEW

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - - PowerPoint PPT Presentation

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1 Can


slide-1
SLIDE 1

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bit its Research Gr Group

1

slide-2
SLIDE 2

2

CPU GPU

Can we tr transform CPU in into a neural accelerator?

$

slide-3
SLIDE 3

3

GPU

Can we tr transform CPU in into a neural accelerator?

CPU

Neural Cache

++ Parallelism

  • - Data Movement
slide-4
SLIDE 4

Transforming caches in into massively parallel vector ALUs

4

18-core Xeon processor 45 MB LLC

18 LLC slices

slide-5
SLIDE 5

Transforming caches in into massively parallel vector ALUs

5

18-core Xeon processor 45 MB LLC

Way 1 Way 20 Way 2 Way 19

2.5MB LLC slice

CBOX TMU

32kB data bank 8kB array

18 LLC slices 360 ways

slide-6
SLIDE 6

Transforming caches in into massively parallel vector ALUs

6

18-core Xeon processor 45 MB LLC

Way 1 Way 20 Way 19

2.5MB LLC slice

CBOX TMU

32kB data bank 8kB array WL

Row decoder

255 255 BL/BLB

8kB SRAM array

18 LLC slices 360 ways 5760 arrays

Way 2

slide-7
SLIDE 7

Transforming caches in into massively parallel vector ALUs

7

18-core Xeon processor 45 MB LLC

Way 1 Way 20 Way 19

2.5MB LLC slice

CBOX TMU

32kB data bank 8kB array

8kB SRAM array

WL

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Row decoders

255 255

= A + B

BL/BLB Logic

Array A Array B

1 1 1 1 1 1

A + B

18 LLC slices 360 ways 5760 arrays

Way 2

slide-8
SLIDE 8

Transforming caches in into massively parallel vector ALUs

8

18-core Xeon processor 45 MB LLC

Way 1 Way 20 Way 19

2.5MB LLC slice

CBOX TMU

32kB data bank 8kB array

8kB SRAM array

D EN Q

C A&B A^B S Cout

Cin

Vref C_EN ~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

WL

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Row decoders

255 255

= A + B

BL/BLB Logic

Array A Array B

1 1 1 1 1 1

A + B

Way 2

slide-9
SLIDE 9

Way 2

Transforming caches in into massively parallel vector ALUs

9

18-core Xeon processor 45 MB LLC

Way 1 Way 20 Way 19

2.5MB LLC slice

CBOX TMU

32kB data bank 8kB array

8kB SRAM array

WL

Row decoders

255 255

= A + B

BL/BLB Logic

D EN Q

C A&B A^B S Cout

Cin

Vref C_EN ~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU Array A Array B A + B

Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs ✓

Multiply

Divide

Add

Bit-serial operation @2.5 GHz

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

Configurable Precision

slide-10
SLIDE 10

10

A + B

Row decoders

255 255 BL/BLB Logic

Bit-parallel arithmetic

Why bit it-serial?

slide-11
SLIDE 11

11

A + B

Row decoders

255 255 BL/BLB Logic

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0

} } }

Bit-parallel arithmetic

Why bit it-serial?

slide-12
SLIDE 12

12

A + B

WL1

Row decoders

255 255

S

BL/BLB Logic WL2

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0

} } }

Bit-parallel arithmetic

Why bit it-serial?

slide-13
SLIDE 13

13

A + B

WL1

Row decoders

255 255 BL/BLB Logic WL2

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0

} } }

C S S

Bit-parallel arithmetic

Carry propagation across bitlines

Why bit it-serial?

slide-14
SLIDE 14

14

A + B

WL1

Row decoders

255 255 BL/BLB Logic WL2

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0

} } }

C S S S C

Bit-parallel arithmetic

Carry propagation across bitlines

Why bit it-serial?

slide-15
SLIDE 15

15

A + B

WL1

Row decoders

255 255 BL/BLB Logic WL2

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0 Word 3 Word 2 Word 1 Word 0

} } }

C S S S S C C

Carry propagation across bitlines

High complexity Loss of throughput and efficiency

! !

Bit-parallel arithmetic

Why bit it-serial?

slide-16
SLIDE 16

16

A + B

Row decoders

255 255 BL/BLB Logic

Bit-serial arithmetic

Why bit it-serial?

slide-17
SLIDE 17

17

A + B

Row decoders

255 255 BL/BLB Sum Carry

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0

} } }

S S S S

Bit-serial arithmetic

Transposed data 0 0 0 0

Why bit it-serial?

slide-18
SLIDE 18

18

A + B

WL1

Row decoders

255 255 BL/BLB Sum WL2 Carry

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0

} } }

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0

0 0 0 0

Cycle 1

Why bit it-serial?

slide-19
SLIDE 19

19

A + B

WL1

Row decoders

255 255 BL/BLB Sum WL2 Carry

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0

} } }

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0

C C C C

Cycle 2

Why bit it-serial?

slide-20
SLIDE 20

20

A + B

WL1

Row decoders

255 255 BL/BLB Sum WL2 Carry

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0

} } }

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0

C C C C

Cycle 3

Why bit it-serial?

slide-21
SLIDE 21

21

A + B

WL1

Row decoders

255 255 BL/BLB Sum WL2 Carry

Array A Array B A + B

Word 3 Word 2 Word 1 Word 0

} } }

S S S S

Bit-serial arithmetic

Transposed data

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0

C C C C

Cycle 4

Low area complexity High throughput Configurable & High precision

✓ ✓ ✓

Why bit it-serial?

slide-22
SLIDE 22

Outline

  • Motivation
  • Bit-Serial Arithmetic
  • Transpose
  • Mapping of Convolution to Array
  • Methodology
  • Results

22

slide-23
SLIDE 23

23

18-core Xeon processor 45 MB LLC

Way 1 Way 20 Way 19

2.5MB LLC slice

CBOX TMU

32kB data bank 8kB array

8kB SRAM array

D EN Q

C A&B A^B S Cout

Cin

Vref C_EN ~A & ~B

SA SA

BL BLB

DR

S = A^B^C

Bitline ALU

WL

Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Bit-Slice 3 Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 Row decoders

255 255

= A + B

BL/BLB Logic

Array A Array B

1 1 1 1 1 1

A + B

Way 2

In In-SRAM Ari rithmetic

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

slide-24
SLIDE 24

Lo Logical Operations In In-SRAM

BLB0 BL0 BLBn BLn SA

Row Decoder

SA

Differential Sense Amplifiers Bitlines Wordlines Row Decoder-O

Changes

SA SA Vref SA SA Vref

Single-ended Sense Amplifiers Additional row decoder Reconfigurable sense amplifiers

24

slide-25
SLIDE 25

SA SA Vref

Lo Logical Operations In In-SRAM

BLB0 BL0 BLBn BLn

Row Decoder Row Decoder

SA SA Vref

Single-ended Sense Amplifiers

A AND B

A B

B A

0 1 0 1 1 0 0 1

A AND B 1

25

slide-26
SLIDE 26

SA SA Vref BLB0 BL0 BLBn BLn

Row Decoder Row Decoder

SA SA Vref

Single-ended Sense Amplifiers

A B

B A

0 1 0 1 1 0 0 1

A AND B 1 A NOR B 1

26

Lo Logical Operations In In-SRAM

slide-27
SLIDE 27

SA SA Vref

Addition In In-SRAM

BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum

P0

1 1 1

A1 B0 B1 P1 P2

Row Decoder B Row Decoder A

P 256 Bitlines

D EN Q

C A&B A^B S Cout

Cin

Vref C_EN ~A & ~B

SA SA

BL BLB

DR

S = A^B^C

27

slide-28
SLIDE 28

1 SA SA Vref

Addition [C [Cycle 1]

BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum 1

P0

1 1 1 1

A1 B0 B1 P1 P2

Row Decoder B Row Decoder A

P

28

slide-29
SLIDE 29

1 1 SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref 1 1 1 1 1 1 1 1 Carry Sum 1 1

A0 P0 A1 B0 B1 P1 P2

Row Decoder B Row Decoder A

P

Addition [C [Cycle 2]

29

slide-30
SLIDE 30

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref 1 1 1 1 1 1 1 1 1 1 Carry Sum 1

A0 P0 A1 B0 B1 P1 P2

Row Decoder P Row Decoder

Addition [C [Cycle 3]

30

slide-31
SLIDE 31

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum

P0

1 1 1

A1 B0 B1 P1

Row Decoder Row Decoder

P2 P3

Tag

Mult ltiplication In In-SRAM

31

slide-32
SLIDE 32

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum 1 1 1

A1 B0 B1

Row Decoder Row Decoder

P0 P1 P2 P3

Tag 1

Multiplication [C [Cycle 1]

A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

32

slide-33
SLIDE 33

1 SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum 1 1 1

A1 B0 B1

Row Decoder Row Decoder

1 Tag 1

Mult ltiplication [C [Cycle 2]

P0 <- A0B0

P0 P1 P2 P3 A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

33

slide-34
SLIDE 34

1 SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum 1 1 1 1

A1 B0 B1

Row Decoder Row Decoder

1 Tag 1

Mult ltiplication [C [Cycle 3]

P0 <- A0B0 P1 <- A1B0

P0 P1 P2 P3 A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

34

slide-35
SLIDE 35

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum 1 1 1 1 1

A1 B0 B1

Row Decoder Row Decoder

Tag 1 1

Multiplication [C [Cycle 4]

P0 <- A0B0 P1 <- A1B0

P0 P1 P2 P3 A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

35

slide-36
SLIDE 36

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 Carry Sum 1 1 1 1 1

A1 B0 B1

Row Decoder Row Decoder

1 1 1 Tag 1 1 1

Mult ltiplication [C [Cycle 5]

P0 <- A0B0 P1 <- A1B0 + A0B1

P0 P1 P2 P3 A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

36

P1 <- P1 + A0B1 If(B1), P1 <- P1 + A0 Else, P1 <- P1

slide-37
SLIDE 37

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 1 Carry Sum 1 1 1 1 1

A1 B0 B1

Row Decoder Row Decoder

1 1 1 Tag

Mult ltiplication [C [Cycle 6]

P0 <- A0B0 P1 <- A1B0 + A0B1 P2 <- A1B1

P0 P1 P2 P3 A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

37

slide-38
SLIDE 38

SA SA Vref BLB0 BL0 BLBn BLn SA SA Vref

A0

1 1 1 1 Carry Sum 1 1 1 1 1

A1 B0 B1

Row Decoder Row Decoder

1 Tag 1

Mult ltiplication [C [Cycle 7]

P0 <- A0B0 P1 <- A1B0 + A0B1 P2 <- A1B1 P3 <- Cin

P0 P1 P2 P3 A1B0 A0B0 A1B1 A0B1 A1 A0 X B1 B0

P0 P1 P2

38

slide-39
SLIDE 39

Operation Cycles ADD N+1 SUB 2N+1 MUL N2 + 5N -2 DIV 1.5N2 + 5.5N Comparison 2N+1

Supported Ari rithmetic

39

Synthesized array—7.5% area overhead Processor Chip— 2% area overhead

slide-40
SLIDE 40

Outline

  • Motivation
  • Bit-Serial Arithmetic
  • Transpose
  • Mapping of Convolution to Array
  • Methodology
  • Results

40

slide-41
SLIDE 41

41

Way 1 Way 20 Way 2 Way 19

CBOX TMU

Transpose

Row Decoder

A0[MSB] A1[MSB] A2[MSB] A0[LSB] A1[LSB] A2[LSB] ... ... ... ... ... ... ... ... ... ...

Col Decoder

SA SA SA DR DR DR SA DR SA SA SA DR DR DR ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Control

SA SA SA SA SA SA SA SA SA SA DR DR DR DR DR DR DR DR DR DR DR B0[MSB] B1[MSB] B2[MSB] B0[LSB] B1[LSB] B2[LSB]

Regular read/write Transpose read/write

8-T transpose bit-cell

slide-42
SLIDE 42

A2 A1 A0 B2 B1 B0 C2 C1 C0

TMU

A0 A1 A2 C0 C1 C2 B0 B1 B2

Transpose

42

slide-43
SLIDE 43

Outline

  • Motivation
  • Transpose
  • Bit-Serial Arithmetic
  • Mapping of Convolution to Array
  • Methodology
  • Results

43

slide-44
SLIDE 44

C W H M E F S 3D Filters (M) each filter: C channels each channel: RxS weights 1 C R S M C R Input Activations (C channels) Output Activations (M channels)

44

A Convolutional La Layer

slide-45
SLIDE 45

RxS

C

. . . RxS

C

. . .

Partial Sum C 1 Output Activation

MAC

Reduction

Filter Weights 1 C M C R S R S Input Activations C W H Output Activations M E F

. . .

Unroll Unroll

Mapping CNN to Neural Cache

45

256 Wordlines Input Activation

RxSx8 256 Bitlines

8 kB SRAM Array

Weights

RxSx8

Partial Sum

4x8

. . .

C

Output

4x8

. . . . . .

slide-46
SLIDE 46

Way 20

2.5 MB LLC Slice

. . . . . . . . . . . . . . .

Way 1 Way 2 Way 3 Quad 1 Quad 2 Quad 3 Quad 4 M = 32 Output Position 1 Output Position 2 . . .

Mapping CNN to Neural Cache

256 Wordlines Input Activation

RxSx8

channel 1

Filter 1 (C = 256)

256 Bitlines

8 kB SRAM Array

Weights

RxSx8

Partial Sum

4x8

channel 2 channel 3 channel 256 channel 4

. . . . . . . . . . . .

M E F

46

. . . . . .

slide-47
SLIDE 47

Way 1 -18 Way 19-20 Way 1 -18 Way 19-20 Slice 1 Slice 14

Mapping of f Convolution to Array

M E F

47

slide-48
SLIDE 48

LLC Slice 1 LLC Slice 14 Ring Interconnect Core 14 DRAM

. . .

. . . Filter Weights Input Activations Output Activations

Way 19 (Reserved) 2.5 MB LLC Slice

. . .

Way 1 Way 2 Way 3 Quad 1 Quad 2 Quad 3 Quad 4

. . . . . . . . . . . .

Put t it it together

Core 1

Fil ilter Lo Loading 1 In Input Lo Loading 2 Outp tput Transfer 4 MAC + Reduction 3

48

slide-49
SLIDE 49

Outline

  • Motivation
  • Transpose
  • Bit-Serial Arithmetic
  • Mapping of Convolution to Array
  • Methodology
  • Results

49

slide-50
SLIDE 50

50

Evaluation Methodology

CPU (2 sockets) GPU (1 card) Neural Cache

Processor Intel Xeon E5-2597 v3, 2.6GHz, 28 cores, 56 threads Nvidia Titan Xp, 1.6GHz, 3840 cuda cores 2.5GHz Compute SRAM, 1032192 Bit-serial ALUs On-chip memory 78.96 MB 9.14 MB 70 MB (Dual Socket) Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM Profiler / Simulator (Performance) TensorFlow tfprof TensorFlow tfprof Cycle accurate simulator + C Microbench Profiler / Simulator (Energy) Intel RAPL Interface NVIDIA System Management Interface SPICE simulation + Intel RAPL Interface

DNN Models

  • Inception V3
  • 8-bit weights and

inputs

slide-51
SLIDE 51

Outline

  • Motivation
  • Transpose
  • Bit-Serial Arithmetic
  • Mapping of Convolution to Array
  • Methodology
  • Results

51

slide-52
SLIDE 52

100 200 300 400 500 600 700 1 4 16 64 256 Throughput (Inferences / sec) Batch Size CPU - Xeon E5 GPU - Titan Xp Neural Cache

Th Throughput

2.2x Improved throughput over GPU

20 40 60 80 100

CPU GPU Neural Cache

Latency (ms)

La Latency

7.7x Latency improvement over GPU

52

slide-53
SLIDE 53

Power/Energy Comparison

53

20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10

CPU GPU Neural Cache

Power (Watts) Energy (Joules)

Total Energy Avg Power

slide-54
SLIDE 54

12x 20x .. over server class CPU at 2% area overhead

Neural Cache Summary ry

Repurpose Cache to Data Parallel DNN Accelerator

.. over server class GPU 2x 16x

54

Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs

slide-55
SLIDE 55

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bit its Research Gr Group

55