Compute Cache
Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das
1
Presented by Gefei Zuo and Jiacheng Ma
Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, - - PowerPoint PPT Presentation
Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1 Agenda Motivation Goal Design Evaluation Discussion 2
Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das
1
Presented by Gefei Zuo and Jiacheng Ma
2
3
In-place computation in caches
4
5
Cache hierarchy Cache Geometry In-place compute
coherence
6
Background: 6 transistors bit-cell, two stable states
7
Read: 1. Precharge BL=1 BLB=1 2. Assert word line 3. Sense amplifier detects voltage difference Write: 1. Drive BL, #BL to desired value, ~BL==BLB 2. Assert word line
8
Logical operation: 1. Precharge BL=Vref, BLB=Vref 2. Activate two rows (WLi, WLj) 3. Sensing BL => AND, sensing BLB => NOR ○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’ XOR = wired-NOR(NOR, AND)
Operand locality requirement: physically share the same set of bitlines Design choices:
same block partition.
select bank and block position.
9
Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesize Page-alignment guarantees operand locality
When cannot achieve perfect locality: one additional vector logic unit per cache controller
○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units
10
Enhanced cache controller:
○
○ for cc_search
○ if the operands are too long ○ by raise a pipeline exception
11
instruction OP OP OP OP cc_search key key key key
Operate on the cache level where:
Example:
○ A in L2, B in L2, C in memory
12
○ Cache coherence requests are responded first ○ When a request comes: i. The cache-line is unlocked ii. The cache-line is marked invalid (maybe) iii. The cache-line is re-fetched
○ fence instruction is still usable in CC
13
○ SniperSim (simulator) & McPAT (power) ○ 8 core CMP with 32K L1d, 256K L2 and 16M L3
○ WordCount: cc_search ○ StringMatch: cc_cmp ○ DB-BitMap: cc_or, cc_and ○ Bit Matrix Multiplication: cc_clmul ○ Checkpointing: cc_copy
14
○ negligible impact on the baseline read/write ○ and/or/xor: 3x latency than simple access ○ the rest: 2x
○ cmp/search/clmul: 1.5x ○ copy/buz/not: 2x ○ rest: 2.5x
○ 8% overhead
15
16
data parallelism latency reduction
90% 89% 71% 92%
runtime is shortened data movement is reduced key distribution
BMM: 3.2x Wordcount: 2x StringMatch: 1.5x DB-BitMap: 1.6x
17
in-place vs. near-place
L1 vs. L2 vs. L3
CC_L3 still accesses L1 and L2
○ Compute inside cache with bit-line hacking ○ A bunch of instructions ○ High throughput ○ Great power efficiency ○ Few area overhead
○ how about moving data to L3? This may cost more energy...
18
○ Which do you think is better?
○ For example, add? See Prof Das’ following work.
○ Something like FPGA?
19