neural cache bit it serial l in in cache
play

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - PowerPoint PPT Presentation

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1 Can


  1. Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1

  2. Can we tr transform CPU in into a neural accelerator? CPU GPU $ 2

  3. Can we tr transform CPU in into a neural accelerator? GPU CPU Neural Cache ++ Parallelism -- Data Movement 3

  4. Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 45 MB LLC 18 LLC slices 4

  5. Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 45 MB LLC TMU CBOX Way 19 Way 20 Way 1 Way 2 32kB data 8kB array bank 18 LLC slices 360 ways 5

  6. Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array 45 MB LLC BL/BLB 255 0 WL Row TMU decoder CBOX Way 19 Way 20 Way 1 Way 2 255 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 6

  7. Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 Bit-Slice 1 1 Bit-Slice 0 0 Bit-Slice 3 Array B Row 0 Bit-Slice 2 TMU 0 decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 1 A + B 0 0 1 Way 19 Way 20 Way 1 Way 2 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 7

  8. Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 BL BLB Bit-Slice 1 1 Bit-Slice 0 0 Vref Bit-Slice 3 SA SA Array B Row 0 Bit-Slice 2 TMU 0 A&B ~A & ~B decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 A^B DR 1 A + B 0 0 S 1 Cout S = A^B^C C_EN EN Way 19 Way 20 D Way 1 Way 2 C Q Cin 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 8

  9. Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 WL BL BLB Array A TMU Array B Vref CBOX SA SA Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs Row A&B ~A & ~B decoders A + B ✓ ✓ ✓ Multiply Divide Add A^B DR Way 19 Way 20 Way 1 Way 2 S Cout Configurable Precision S = A^B^C C_EN EN D C Q Cin 255 Bit-serial operation @2.5 GHz = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 9

  10. Why bit it-serial? A + B BL/BLB 255 0 Row decoders Bit-parallel arithmetic 255 Logic 10

  11. Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 } A + B 255 Logic 11

  12. Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S Logic 12

  13. Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S S Logic Carry propagation across bitlines C 13

  14. Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S S S Logic Carry propagation across bitlines C C 14

  15. Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 ! High complexity Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 ! Loss of throughput and efficiency } A + B 255 S S S S Logic Carry propagation across bitlines C C C 15

  16. Why bit it-serial? A + B BL/BLB 255 0 Row decoders Bit-serial arithmetic 255 Logic 16

  17. Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Array A } Array B Row } decoders Bit-serial arithmetic } A + B 255 S S S S Sum 0 0 0 0 Carry 17

  18. Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 WL1 Array B Row } decoders Bit-serial arithmetic WL2 } A + B 255 S S S S Sum 0 0 0 0 Carry Cycle 1 18

  19. Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 Bit-Slice 1 WL1 Bit-Slice 0 Array B Row } decoders Bit-serial arithmetic WL2 } A + B 255 S S S S Sum C C C C Carry Cycle 2 19

  20. Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 WL1 Bit-Slice 1 Bit-Slice 0 Array B Row } WL2 decoders Bit-serial arithmetic } A + B 255 S S S S Sum C C C C Carry Cycle 3 20

  21. Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A WL1 } Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 ✓ Low area complexity Array B WL2 Row } decoders Bit-serial arithmetic ✓ High throughput } A + B ✓ Configurable & High precision 255 S S S S Sum C C C C Carry Cycle 4 21

  22. Outline • Motivation • Bit-Serial Arithmetic • Transpose • Mapping of Convolution to Array • Methodology • Results 22

  23. In-SRAM Ari In rithmetic 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 BL BLB Bit-Slice 1 1 Bit-Slice 0 0 Vref Bit-Slice 3 SA SA Array B Row 0 Bit-Slice 2 TMU 0 A&B ~A & ~B decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 A^B DR 1 A + B 0 0 S 1 Cout S = A^B^C C_EN EN Way 19 Way 20 D Way 1 Way 2 C Q Cin 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 23

  24. Logical Operations In Lo In-SRAM Bitlines Changes BLn BLBn BLB0 BL0 Row Decoder-O Row Decoder Additional Wordlines row decoder Single-ended Vref Vref Sense Amplifiers SA SA SA SA Reconfigurable SA SA sense amplifiers Differential Sense Amplifiers 24

  25. Lo Logical Operations In In-SRAM A AND B BLn B BLBn BLB0 BL0 A A Row Decoder Row Decoder 0 1 0 1 B 1 0 0 1 Vref Vref Single-ended SA SA SA SA Sense Amplifiers 0 1 A AND B 25

  26. Logical Operations In Lo In-SRAM BLn B BLBn BLB0 BL0 A A Row Decoder Row Decoder 0 1 0 1 B 1 0 0 1 Vref Vref Single-ended SA SA SA SA Sense Amplifiers 1 0 0 1 A NOR B A AND B 26

  27. Addition In In-SRAM 256 Bitlines B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder BL BLB B 0 0 1 B 1 1 1 Vref SA SA P 0 0 0 A&B ~A & ~B P 1 0 0 P 2 0 0 A^B DR Vref Vref SA SA SA SA S Cout S = A^B^C Carry 0 0 C_EN EN D C Q Sum 0 0 Cin 27

  28. Addition [C [Cycle 1] B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 0 P 1 0 0 P 2 0 0 Vref Vref SA SA SA SA Carry 1 0 0 Sum 0 1 28

  29. Addition [C [Cycle 2] B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 P 1 1 0 1 0 P 2 0 0 Vref Vref SA SA SA SA Carry 1 0 1 1 Sum 29

  30. Addition [C [Cycle 3] P BLn BLBn BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 P 1 1 1 P 2 1 0 0 Vref Vref SA SA SA SA Carry 1 0 Sum 30

  31. Mult ltiplication In In-SRAM BLBn BLn BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 0 0 P 1 0 0 P 2 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 Carry 0 Sum Tag 0 0 31

  32. Multiplication [C [Cycle 1] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 0 0 P 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry Sum 1 0 Tag 0 32

  33. Mult ltiplication [C [Cycle 2] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 1 0 0 P 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry 0 Sum 1 0 Tag 1 33

  34. Mult ltiplication [C [Cycle 3] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 1 0 P 1 P 1 <- A 1 B 0 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry 0 Sum 1 0 Tag 1 34

  35. Multiplication [C [Cycle 4] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 0 1 P 1 P 1 <- A 1 B 0 0 1 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry Sum 1 0 Tag 0 1 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend