Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1
Agenda ● Motivation Goal ● Design ● ● Evaluation ● Discussion 2
Motivation ● Data-centric applications High degree data-parallelism of applications ● Time and energy spent on moving data >> actual computing ● 3
Goal In-place computation in caches Massive parallelism: new vector instructions (SIMD) ● ● Less data movement: new cache organization ● Low area overhead: reuse existing circuits 4
Design Overview Cache hierarchy Cache Geometry In-place compute 4. execution management 1. ISA 5. cache coherence 2. bit-line computing 3. operand locality 5
Design: ISA 6
Design: bit-line computing Background: 6 transistors bit-cell, two stable states Read: 1. Precharge BL=1 BLB=1 2. Assert word line 3. Sense amplifier detects voltage difference Write: 1. Drive BL, #BL to desired value, ~BL==BLB 2. Assert word line 7
Design: bit-line computing cont. Logical operation: 1. Precharge BL=Vref, BLB=Vref 2. Activate two rows (WLi, WLj) 3. Sensing BL => AND, sensing BLB => NOR ○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’ XOR = wired-NOR(NOR, AND) 8
Design: operand locality Operand locality requirement: physically share the same set of bitlines Design choices: All ways in a set mapped to the ● same block partition. ● Use part of the set-index bits to select bank and block position. Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesize Page-alignment guarantees operand locality 9
Design: operand locality cont. When cannot achieve perfect locality: one additional vector logic unit per cache controller ● High area overhead ○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units High latency (14 cycles vs 22 cycles) ● ● High energy consumption (60%~80% energy spent on H-Tree wire transfer) 10
instruction cc_search Design: execution management key OP key OP key OP key OP Enhanced cache controller: ● Break instruction into operations operations’ operands are at most one single cache block ○ ● Deploy key to block for cc_search ○ ● Split instructions into multiple instructions if the operands are too long ○ ○ by raise a pipeline exception 11
Design: operation cache level Operate on the cache level where: ● that is highest that contains all operands ● Example: ● cc_and ● operands: A, B, C ○ A in L2, B in L2, C in memory ● perform operation in L3 12
Design: cache coherence & memory model No influence to cache coherence ● ○ Cache coherence requests are responded first When a request comes: ○ i. The cache-line is unlocked ii. The cache-line is marked invalid (maybe) iii. The cache-line is re-fetched No influence to memory model ● ○ fence instruction is still usable in CC 13
Evaluation ● Environment SniperSim (simulator) & McPAT (power) ○ ○ 8 core CMP with 32K L1d, 256K L2 and 16M L3 Benchmarks ● ○ WordCount: cc_search ○ StringMatch: cc_cmp ○ DB-BitMap: cc_or, cc_and ○ Bit Matrix Multiplication: cc_clmul ○ Checkpointing: cc_copy 14
Evaluation ● Delay negligible impact on the baseline read/write ○ ○ and/or/xor: 3x latency than simple access the rest: 2x ○ ● Energy ○ cmp/search/clmul: 1.5x ○ copy/buz/not: 2x ○ rest: 2.5x Area ● ○ 8% overhead 15
Evaluation: Benefits data movement 90% 89% 71% 92% is reduced 53x data parallelism latency reduction runtime is key distribution BMM: 3.2x shortened Wordcount: 2x StringMatch: 1.5x DB-BitMap: 1.6x 16
Evaluation: different configuration CC_L3 still accesses L1 and L2 in-place vs. near-place L1 vs. L2 vs. L3 ● compute inside cache ● compute inside cache controller 17
Summary & Weakness ● Compute in cache! Compute inside cache with bit-line hacking ○ ○ A bunch of instructions High throughput ○ ○ Great power efficiency Few area overhead ○ ● Weakness: how about moving data to L3? This may cost more energy... ○ 18
Discussion ● In-Cache Computing vs. In-Memory Computing Which do you think is better? ○ ● Any idea to perform more complex operation? ○ For example, add? See Prof Das’ following work. How about make cache computing more programmable? ● ○ Something like FPGA? 19
Recommend
More recommend