compute cache
play

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, - PowerPoint PPT Presentation

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1 Agenda Motivation Goal Design Evaluation Discussion 2


  1. Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1

  2. Agenda ● Motivation Goal ● Design ● ● Evaluation ● Discussion 2

  3. Motivation ● Data-centric applications High degree data-parallelism of applications ● Time and energy spent on moving data >> actual computing ● 3

  4. Goal In-place computation in caches Massive parallelism: new vector instructions (SIMD) ● ● Less data movement: new cache organization ● Low area overhead: reuse existing circuits 4

  5. Design Overview Cache hierarchy Cache Geometry In-place compute 4. execution management 1. ISA 5. cache coherence 2. bit-line computing 3. operand locality 5

  6. Design: ISA 6

  7. Design: bit-line computing Background: 6 transistors bit-cell, two stable states Read: 1. Precharge BL=1 BLB=1 2. Assert word line 3. Sense amplifier detects voltage difference Write: 1. Drive BL, #BL to desired value, ~BL==BLB 2. Assert word line 7

  8. Design: bit-line computing cont. Logical operation: 1. Precharge BL=Vref, BLB=Vref 2. Activate two rows (WLi, WLj) 3. Sensing BL => AND, sensing BLB => NOR ○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’ XOR = wired-NOR(NOR, AND) 8

  9. Design: operand locality Operand locality requirement: physically share the same set of bitlines Design choices: All ways in a set mapped to the ● same block partition. ● Use part of the set-index bits to select bank and block position. Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesize Page-alignment guarantees operand locality 9

  10. Design: operand locality cont. When cannot achieve perfect locality: one additional vector logic unit per cache controller ● High area overhead ○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units High latency (14 cycles vs 22 cycles) ● ● High energy consumption (60%~80% energy spent on H-Tree wire transfer) 10

  11. instruction cc_search Design: execution management key OP key OP key OP key OP Enhanced cache controller: ● Break instruction into operations operations’ operands are at most one single cache block ○ ● Deploy key to block for cc_search ○ ● Split instructions into multiple instructions if the operands are too long ○ ○ by raise a pipeline exception 11

  12. Design: operation cache level Operate on the cache level where: ● that is highest that contains all operands ● Example: ● cc_and ● operands: A, B, C ○ A in L2, B in L2, C in memory ● perform operation in L3 12

  13. Design: cache coherence & memory model No influence to cache coherence ● ○ Cache coherence requests are responded first When a request comes: ○ i. The cache-line is unlocked ii. The cache-line is marked invalid (maybe) iii. The cache-line is re-fetched No influence to memory model ● ○ fence instruction is still usable in CC 13

  14. Evaluation ● Environment SniperSim (simulator) & McPAT (power) ○ ○ 8 core CMP with 32K L1d, 256K L2 and 16M L3 Benchmarks ● ○ WordCount: cc_search ○ StringMatch: cc_cmp ○ DB-BitMap: cc_or, cc_and ○ Bit Matrix Multiplication: cc_clmul ○ Checkpointing: cc_copy 14

  15. Evaluation ● Delay negligible impact on the baseline read/write ○ ○ and/or/xor: 3x latency than simple access the rest: 2x ○ ● Energy ○ cmp/search/clmul: 1.5x ○ copy/buz/not: 2x ○ rest: 2.5x Area ● ○ 8% overhead 15

  16. Evaluation: Benefits data movement 90% 89% 71% 92% is reduced 53x data parallelism latency reduction runtime is key distribution BMM: 3.2x shortened Wordcount: 2x StringMatch: 1.5x DB-BitMap: 1.6x 16

  17. Evaluation: different configuration CC_L3 still accesses L1 and L2 in-place vs. near-place L1 vs. L2 vs. L3 ● compute inside cache ● compute inside cache controller 17

  18. Summary & Weakness ● Compute in cache! Compute inside cache with bit-line hacking ○ ○ A bunch of instructions High throughput ○ ○ Great power efficiency Few area overhead ○ ● Weakness: how about moving data to L3? This may cost more energy... ○ 18

  19. Discussion ● In-Cache Computing vs. In-Memory Computing Which do you think is better? ○ ● Any idea to perform more complex operation? ○ For example, add? See Prof Das’ following work. How about make cache computing more programmable? ● ○ Something like FPGA? 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend