Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, - - PowerPoint PPT Presentation

compute cache
SMART_READER_LITE
LIVE PREVIEW

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, - - PowerPoint PPT Presentation

Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1 Agenda Motivation Goal Design Evaluation Discussion 2


slide-1
SLIDE 1

Compute Cache

Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das

1

Presented by Gefei Zuo and Jiacheng Ma

slide-2
SLIDE 2

Agenda

  • Motivation
  • Goal
  • Design
  • Evaluation
  • Discussion

2

slide-3
SLIDE 3

Motivation

  • Data-centric applications
  • High degree data-parallelism of applications
  • Time and energy spent on moving data >> actual computing

3

slide-4
SLIDE 4

Goal

In-place computation in caches

  • Massive parallelism: new vector instructions (SIMD)
  • Less data movement: new cache organization
  • Low area overhead: reuse existing circuits

4

slide-5
SLIDE 5

Design Overview

5

Cache hierarchy Cache Geometry In-place compute

  • 2. bit-line computing
  • 1. ISA
  • 3. operand locality
  • 4. execution management
  • 5. cache

coherence

slide-6
SLIDE 6

Design: ISA

6

slide-7
SLIDE 7

Design: bit-line computing

Background: 6 transistors bit-cell, two stable states

7

Read: 1. Precharge BL=1 BLB=1 2. Assert word line 3. Sense amplifier detects voltage difference Write: 1. Drive BL, #BL to desired value, ~BL==BLB 2. Assert word line

slide-8
SLIDE 8

Design: bit-line computing cont.

8

Logical operation: 1. Precharge BL=Vref, BLB=Vref 2. Activate two rows (WLi, WLj) 3. Sensing BL => AND, sensing BLB => NOR ○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’ XOR = wired-NOR(NOR, AND)

slide-9
SLIDE 9

Design: operand locality

Operand locality requirement: physically share the same set of bitlines Design choices:

  • All ways in a set mapped to the

same block partition.

  • Use part of the set-index bits to

select bank and block position.

9

Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesize Page-alignment guarantees operand locality

slide-10
SLIDE 10

Design: operand locality cont.

When cannot achieve perfect locality: one additional vector logic unit per cache controller

  • High area overhead

○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units

  • High latency (14 cycles vs 22 cycles)
  • High energy consumption (60%~80% energy spent on H-Tree wire transfer)

10

slide-11
SLIDE 11

Design: execution management

Enhanced cache controller:

  • Break instruction into operations

  • perations’ operands are at most one single cache block
  • Deploy key to block

○ for cc_search

  • Split instructions into multiple instructions

○ if the operands are too long ○ by raise a pipeline exception

11

instruction OP OP OP OP cc_search key key key key

slide-12
SLIDE 12

Design: operation cache level

Operate on the cache level where:

  • that is highest
  • that contains all operands

Example:

  • cc_and
  • perands: A, B, C

○ A in L2, B in L2, C in memory

  • perform operation in L3

12

slide-13
SLIDE 13

Design: cache coherence & memory model

  • No influence to cache coherence

○ Cache coherence requests are responded first ○ When a request comes: i. The cache-line is unlocked ii. The cache-line is marked invalid (maybe) iii. The cache-line is re-fetched

  • No influence to memory model

○ fence instruction is still usable in CC

13

slide-14
SLIDE 14

Evaluation

  • Environment

○ SniperSim (simulator) & McPAT (power) ○ 8 core CMP with 32K L1d, 256K L2 and 16M L3

  • Benchmarks

○ WordCount: cc_search ○ StringMatch: cc_cmp ○ DB-BitMap: cc_or, cc_and ○ Bit Matrix Multiplication: cc_clmul ○ Checkpointing: cc_copy

14

slide-15
SLIDE 15

Evaluation

  • Delay

○ negligible impact on the baseline read/write ○ and/or/xor: 3x latency than simple access ○ the rest: 2x

  • Energy

○ cmp/search/clmul: 1.5x ○ copy/buz/not: 2x ○ rest: 2.5x

  • Area

○ 8% overhead

15

slide-16
SLIDE 16

Evaluation: Benefits

16

53x

data parallelism latency reduction

90% 89% 71% 92%

runtime is shortened data movement is reduced key distribution

BMM: 3.2x Wordcount: 2x StringMatch: 1.5x DB-BitMap: 1.6x

slide-17
SLIDE 17

Evaluation: different configuration

17

in-place vs. near-place

  • compute inside cache
  • compute inside cache controller

L1 vs. L2 vs. L3

CC_L3 still accesses L1 and L2

slide-18
SLIDE 18

Summary & Weakness

  • Compute in cache!

○ Compute inside cache with bit-line hacking ○ A bunch of instructions ○ High throughput ○ Great power efficiency ○ Few area overhead

  • Weakness:

○ how about moving data to L3? This may cost more energy...

18

slide-19
SLIDE 19

Discussion

  • In-Cache Computing vs. In-Memory Computing

○ Which do you think is better?

  • Any idea to perform more complex operation?

○ For example, add? See Prof Das’ following work.

  • How about make cache computing more programmable?

○ Something like FPGA?

19