A Memory System Design Framework: Creating Smart Memories Amin - - PowerPoint PPT Presentation

a memory system design framework creating smart memories
SMART_READER_LITE
LIVE PREVIEW

A Memory System Design Framework: Creating Smart Memories Amin - - PowerPoint PPT Presentation

A Memory System Design Framework: Creating Smart Memories Amin Firoozshahian, Alex Solomatnikov Hicamp Systems Inc. Ofer Shacham, Zain Asgar, http://www.c 2 s 2 .org Stephen Richardson, Christos Kozyrakis, Mark Horowitz Stanford University An


slide-1
SLIDE 1

http://www.c2s2.org

Amin Firoozshahian, Alex Solomatnikov

Hicamp Systems Inc.

Ofer Shacham, Zain Asgar, Stephen Richardson, Christos Kozyrakis, Mark Horowitz

Stanford University

A Memory System Design Framework: Creating Smart Memories

slide-2
SLIDE 2

An Era of Chip-Multiprocessors…

  • Single-thread performance scaling has stopped
  • More processor cores on the same die
  • Claim:

Scale performance Keep design complexity constant

IBM Cell Intel Nehalem Sun Rock 2 Amin Firoozshahian

slide-3
SLIDE 3

Looking a Little More Closely

Sun Rock

slide-4
SLIDE 4

Reality…

  • Replicated cores
  • Incredibly complicated memory system

Large amounts of logic

  • Innovation is in the memory system

Transactions, streaming, fast synchronization, security, etc.

  • Never exactly the same
  • Where all the bugs are!

4 Amin Firoozshahian

slide-5
SLIDE 5

ISA for Memory Systems

  • Can we regularize the memory system hardware?
  • “Program” it rather than “Design” it?
  • Benefits:

Reduce design time Patch errors Run-time tuning

  • How can we do this?

Amin Firoozshahian 5

slide-6
SLIDE 6

Amin Firoozshahian

Shared Memory System

  • Resources:

Local memory

Data, state bits

Interconnect Controllers

  • Operations:

Probing state bits Track requests Communication Data movements (spill / refill)

$ Cache Controller Proc Proc $ Cache Controller Interconnect Memory miss

Msg

6

slide-7
SLIDE 7

Amin Firoozshahian

Streaming Memory System

  • Resources:

Local memory Interconnect Controllers

  • Operations:

Communication Data movements Track outstanding transfers

Local Mem Interconnect

Memory Proc Local Mem DMA DMA Proc 7

slide-8
SLIDE 8

Transactional Memory System

  • Resources

Local memory

More state bits

Interconnect Controllers

  • Operations

Data movements State checks / updates Communication

$ Commit Controller $ Commit Controller Interconnect Memory Addr. FIFO Addr. FIFO Proc Proc 8 Amin Firoozshahian

slide-9
SLIDE 9

Commonalities

  • Same resources and operations
  • Different in:

How the operations are sequenced Interpretation of state bits

  • We need:

Flexible local storage and interconnect Programmable controllers

Amin Firoozshahian 9

slide-10
SLIDE 10

Local Memories

  • Programmable memory mat

Data array State bits PLA logic Comparator

  • Accessed by

Address, Opcode

  • Returns

data, state, compare result

[K. Mai et.al., “Architecture and Circuit Techniques for a Reconfigurable Memory Block,” IEEE International Solid-State Circuits Conference, February 2004 10 Data State Cmp Update Address Opcode

slide-11
SLIDE 11

Programmable Controllers

  • Use an off-the-shelf processor?

FLASH, Typhoon, etc.

  • Too slow

All the way to the L1 cache interface

  • Our approach:

Micro-coded engines (functional units) Each class of operations in a separate engine

Amin Firoozshahian 11

slide-12
SLIDE 12

Programming

  • A set of subroutines

A set of basic operations Executed in a functional unit

  • Each one calls next

Link subroutines to each other

Amin Firoozshahian 12

Msg

Unit 1 Unit 3

Msg

Unit 2

slide-13
SLIDE 13

Microarchitecture

  • A small pipeline
  • Configuration (“program”) memories

Horizontal micro-code Decide what to do Decide how to proceed

13 Amin Firoozshahian

slide-14
SLIDE 14

Organization

Tracking State Update Data Movement

USHR MSHR Line Buffers

Processor Interface Network Interface

Interrupt

DMA DMA DMA

To/From Processors To/From Network To/From local storages

14 Amin Firoozshahian

slide-15
SLIDE 15

Read Miss Example

Amin Firoozshahian 15 Tracking State Update Data Movement

USHR MSHR Line Buffers

Processor Interface Network Interface

Interrupt

DMA DMA DMA Miss

Read Miss Evict

Access Tags

Line Read

Access Data

WB / Miss

Spill Read Miss

Read Miss

Read Miss

slide-16
SLIDE 16

Programming Complexity

  • Cache Coherence

Message types received by controller: 6

From processor: Cache miss, Upgrade miss, Prefetch From network: Coherence request, Refill, Upgrade

Subroutine types in Tracking unit: 11

  • Streaming

Message types: 5

Direct access, Gather, Scatter, Gather reply, Scatter ack.

Subroutine types in Tracking unit: 9

16 Amin Firoozshahian

slide-17
SLIDE 17

Smart Memories

8-core CMP system ST 90nm-GP CMOS technology 5.5 ns cycle time (181MHz) 2.9M gates, 55M transistors

7.77mm 7.77mm 17

slide-18
SLIDE 18

Status

  • System bring-up……………...…..
  • System configuration……….…...
  • JTAG tests…………………….…...
  • Coherent shared memory tests…
  • Transactional tests (TCC)……….
  • Streaming tests……………………
  • More testing in progress
  • Planning for a 32-processor system

Test Chip 18 Amin Firoozshahian

slide-19
SLIDE 19

Evaluation

  • Comparison with a hardwired controller

But which one? You would claim I am cheating!

  • Compare with an “ideal” controller

Assume controller actions occur in zero time Account for external actions

Data read/write Message send/receive

  • Gives an upper bound

Amin Firoozshahian 19

slide-20
SLIDE 20

Average Read Latency

Amin Firoozshahian 20 Coherent Shared Memory Streaming Transactions

1 2 3 4 5 6 7 8 9 FFT MPEG2 Enc Barnes FMM 179.art Bitonic Sort MPEG2 Enc Barnes MP3D Cycles

Average Read Latency - 32 processor system

Real Controllers Ideal controllers

slide-21
SLIDE 21

Execution Time

  • Total average overhead: 15%

Amin Firoozshahian 21 Coherent Shared Memory Streaming Transactions

10.64 14.51 24.29 6.93 7.58 1.88 14.14 8.33 20.03 5 10 15 20 25 30

Overhead (%)

Average Overhead (%)

slide-22
SLIDE 22

Conclusion

  • Strong similarity between memory systems

Common resources and operations

  • A framework for memory systems design

Generate specific “instances”

  • Modest performance overhead

Compared to ideal systems

Amin Firoozshahian 22