Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, - PowerPoint PPT Presentation

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1

Agenda ● Motivation Goal ● Design ● ● Evaluation ● Discussion 2

Motivation ● Data-centric applications High degree data-parallelism of applications ● Time and energy spent on moving data >> actual computing ● 3

Goal In-place computation in caches Massive parallelism: new vector instructions (SIMD) ● ● Less data movement: new cache organization ● Low area overhead: reuse existing circuits 4

Design Overview Cache hierarchy Cache Geometry In-place compute 4. execution management 1. ISA 5. cache coherence 2. bit-line computing 3. operand locality 5

Design: ISA 6

Design: bit-line computing Background: 6 transistors bit-cell, two stable states Read: 1. Precharge BL=1 BLB=1 2. Assert word line 3. Sense amplifier detects voltage difference Write: 1. Drive BL, #BL to desired value, ~BL==BLB 2. Assert word line 7

Design: bit-line computing cont. Logical operation: 1. Precharge BL=Vref, BLB=Vref 2. Activate two rows (WLi, WLj) 3. Sensing BL => AND, sensing BLB => NOR ○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’ XOR = wired-NOR(NOR, AND) 8

Design: operand locality Operand locality requirement: physically share the same set of bitlines Design choices: All ways in a set mapped to the ● same block partition. ● Use part of the set-index bits to select bank and block position. Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesize Page-alignment guarantees operand locality 9

Design: operand locality cont. When cannot achieve perfect locality: one additional vector logic unit per cache controller ● High area overhead ○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units High latency (14 cycles vs 22 cycles) ● ● High energy consumption (60%~80% energy spent on H-Tree wire transfer) 10

instruction cc_search Design: execution management key OP key OP key OP key OP Enhanced cache controller: ● Break instruction into operations operations’ operands are at most one single cache block ○ ● Deploy key to block for cc_search ○ ● Split instructions into multiple instructions if the operands are too long ○ ○ by raise a pipeline exception 11

Design: operation cache level Operate on the cache level where: ● that is highest that contains all operands ● Example: ● cc_and ● operands: A, B, C ○ A in L2, B in L2, C in memory ● perform operation in L3 12

Design: cache coherence & memory model No influence to cache coherence ● ○ Cache coherence requests are responded first When a request comes: ○ i. The cache-line is unlocked ii. The cache-line is marked invalid (maybe) iii. The cache-line is re-fetched No influence to memory model ● ○ fence instruction is still usable in CC 13

Evaluation ● Environment SniperSim (simulator) & McPAT (power) ○ ○ 8 core CMP with 32K L1d, 256K L2 and 16M L3 Benchmarks ● ○ WordCount: cc_search ○ StringMatch: cc_cmp ○ DB-BitMap: cc_or, cc_and ○ Bit Matrix Multiplication: cc_clmul ○ Checkpointing: cc_copy 14

Evaluation ● Delay negligible impact on the baseline read/write ○ ○ and/or/xor: 3x latency than simple access the rest: 2x ○ ● Energy ○ cmp/search/clmul: 1.5x ○ copy/buz/not: 2x ○ rest: 2.5x Area ● ○ 8% overhead 15

Evaluation: Benefits data movement 90% 89% 71% 92% is reduced 53x data parallelism latency reduction runtime is key distribution BMM: 3.2x shortened Wordcount: 2x StringMatch: 1.5x DB-BitMap: 1.6x 16

Evaluation: different configuration CC_L3 still accesses L1 and L2 in-place vs. near-place L1 vs. L2 vs. L3 ● compute inside cache ● compute inside cache controller 17

Summary & Weakness ● Compute in cache! Compute inside cache with bit-line hacking ○ ○ A bunch of instructions High throughput ○ ○ Great power efficiency Few area overhead ○ ● Weakness: how about moving data to L3? This may cost more energy... ○ 18

Discussion ● In-Cache Computing vs. In-Memory Computing Which do you think is better? ○ ● Any idea to perform more complex operation? ○ For example, add? See Prof Das’ following work. How about make cache computing more programmable? ● ○ Something like FPGA? 19

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, - PowerPoint PPT Presentation

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1 Agenda Motivation Goal Design Evaluation Discussion 2

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

lecture 18 cache 2 - TLB (hit and miss) - instruction or data cache - cache (hit and

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

Leptogenesis with small violation of B-L Nuria Rius IFIC, Universidad de Valencia-CSIC with Juan

Distributed Spectral Decomposition in Networks by Complex Diffusion and Quantum Random Walk

Two-Round Secure Multiparty Computation Minimizing Public Key Operations Sanjam Garg

MutaGon Calling: Benchmark 4 (call for parGcipaGon) Adam Ewing, UCSC TCGA 2 nd

Walk Modularity: Graph partitioning based on a generalization of modularity David Mehrle 1 Amy

Constraint Satisfaction Problem s C t i t S ti f ti P bl Reading: Chapter 6 (3 rd ed );

On Standard SBL-Algebras with Added Involutive Negations Zuzana Hanikov a Petr Savick y

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao,