[PPT] - Comparing Memory Systems for Chip Multiprocessors Jacob Leverich PowerPoint Presentation

SLIDE 1

ISCA 2007 1

Comparing Memory Systems for Chip Multiprocessors

Jacob Leverich

Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

SLIDE 2

2

Cores are the New GHz

90s: ↑GHz & ↑ILP

Problems: power, complexity, ILP limits

00s: ↑cores

Multicore, manycore, …

M P M M M M M P P P P P P P P P P P M M M M M M

SLIDE 3

3

What is the New Memory System?

M M M M M M P P P P P P P P P P P P M M M M M M

Data Array Tags

Cache Controller DMA Engine

Local Storage

Cache Cache-

based Memory

based Memory Streaming Memory Streaming Memory

SLIDE 4

4

The Role of Local Memory

Exploit spatial & temporal locality Reduce average memory access time

Enable data re-use Amortize latency over several accesses

Minimize off-chip bandwidth

Keep useful data local

Data Array Tags

Cache Controller DMA Engine

Local Storage

Cache Cache-

based Memory

based Memory Streaming Memory Streaming Memory

SLIDE 5

5

Who Manages Local Memory?

Cache-based Stream ing

Locality

Data Fetch Reactive Proactive Placement Limited mapping Arbitrary Replacement Fixed-policy Arbitrary Granularity Cache block Arbitrary

Communication

Coherence Hardware Software

Cache-based: Hardware-managed Streaming: Software-managed

SLIDE 6

6

Potential Advantages of Streaming Memory

Better latency hiding

Overlap DMA transfers with computation Double buffering is macroscopic prefetching

Lower off-chip bandwidth requirements

Avoid conflict misses Avoid superfluous refills for output data Avoid write-back of dead data Avoid fetching whole lines for sparse accesses

Better energy and area efficiency

No tag & associativity overhead Fewer off-chip accesses

SLIDE 7

7

How Much Advantage over Caching?

How do they differ in Performance? How do they differ in Scaling? How do they differ in Energy Efficiency? How do they differ in Programmability?

SLIDE 8

8

Our Contribution: A Head to Head Comparison

Cache Cache -

based Mem ory

based Mem ory vs. vs. Stream ing Mem ory Stream ing Mem ory

Unified set of constraints

Same processor core Same capacity of local storage per core Same on-chip interconnect Same off-chip memory channel

Justification

VLSI constraints (e.g., local storage capacity) No fundamental differences (e.g., core type)

SLIDE 9

9

Our Conclusions

Caching performs & scales as well as Streaming

Well-known cache enhancements eliminate differences

Stream Programming benefits Caching Memory

Enhances locality patterns Improves bandwidth and efficiency of caches

Stream Programming easier with Caches

Makes memory system amenable to irregular &

unpredictable workloads

Streaming Memory likely to be replaced or at

least augmented by Caching Memory

SLIDE 10

10

Simulation Parameters

1 – 16 cores: Tensilica LX, 3-way VLIW, 2 FPUs

Clock frequency: 800 MHz – 3.2 GHz

On-chip data memory

Cache-based:

32kB cache, 32B block, 2-way, MESI

Streaming:

24kB scratch pad DMA engine 8kB cache, 32B block, 2-way

Both:

512kB L2 cache, 32B block, 16-way

System

Hierarchical on-chip interconnect Simple main memory model (3.2 GB/ s – 12.8 GB/ s)

SLIDE 11

11

Benchmark Applications

No “SPEC Streaming”

Few available apps with streaming & caching versions

Selected 10 “streaming” applications

Some used to motivate or evaluate Streaming Memory

Co-developed apps for both systems

Caching: C, threads Streaming: C, threads, DMA library

Optimized both versions as best we could

SLIDE 12

12

Benchmark Applications

Video processing

Stereo Depth Extraction H.264 Encoding MPEG-2 Encoding

Image processing

JPEG Encode/ Decode KD-tree Raytracer 179.art

Scientific and data-intensive

2D Finite Element Method 1D Finite Impulse Response Merge Sort Bitonic Sort

Irregular

Unpredictable

SLIDE 13

13

Our Conclusions

Caching performs & scales as well as Streaming

Well-known cache enhancements eliminate differences

Stream Programming benefits Caching Memory

Enhances locality patterns Improves bandwidth and efficiency of caches

Stream Programming easier with Caches

Makes memory system amenable to irregular &

unpredictable workloads

Streaming Memory likely to be replaced or at

least augmented by Caching Memory

SLIDE 14

14

Parallelism Independent of Memory System

MPEG-2 Encoder @ 3.2 GHz

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 CPUs

Normalized Time

Cache Streaming

6/ 10 apps little affected by local memory choice

FEM @ 3.2 GHz

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 CPUs

Normalized Time

Cache Streaming

12.4x 12.4x 13.8x 13.8x

SLIDE 15

15

Local Memory Not Critical For Compute-Intensive Applications

16 cores @ 3.2 GHz

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cache Stream Cache Stream MPEG-2 FEM Normalized Time

Useful Data Sync

Intuition

Apps limited by compute Good data reuse, even

with large datasets

Low misses/ instruction

Note:

“Sync” includes Barriers

and DMA wait

SLIDE 16

16

Double-Buffering Hides Latency For Streaming Memory Systems

Intuition

Non-local accesses entirely

verlapped with computation

DMAs perform efficient SW

prefetching

Note

The case for memory-

intensive apps not bound by memory BW

179.art, Merge Sort

16 cores @ 3.2 GHz, 12.8 GB/s

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cache Stream FIR Normalized Time Useful Data Sync

SLIDE 17

17

Prefetching Hides Latency For Cache-Based Memory Systems

16 cores @ 3.2 GHz, 12.8 GB/s

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cache Prefetch Stream FIR

Normalized Time Useful Data Sync

Intuition

HW stream prefetcher

verlaps misses with

computation as well

Predictable & regular

access patterns

SLIDE 18

18

Streaming Memory Often Incurs Less Off-Chip Traffic

The case for apps with large output streams

Avoids superfluous refills for output streams Not the case for write-allocate, fetch-on-miss caches

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cac Str Cac Str Cac Str FIR Merge Sort MPEG-2

Normalized Off-chip Traffic

Write Read

SLIDE 19

19

SW-Guided Cache Policies Improve Bandwidth Efficiency

Our system: “Prepare For Store” cache hint

Allocates cache line but avoid refill of old data

Xbox360: write-buffer for non allocating writes

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cac PFS Str Cac PFS Str Cac PFS Str FIR Merge Sort MPEG-2

Normalized Off-chip Traffic

Write Read

SLIDE 20

20

Energy Efficiency Does not Depend on Local Memory

Intuition

Energy dominated by

DRAM accesses and processor core

Local store ~ 2x energy-

efficiency of cache, but small portion of total energy

Note

The case for compute-

intensive applications

16 cores @ 800 MHz

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cache Stream Cache Stream Cache Stream MPEG-2 FEM FIR

Normalized Energy

DRAM L2-cache L-store D-cache I-cache Core

SLIDE 21

21

Optimized Bandwidth Yields Optimized Energy Efficiency

Superfluous off-chip accesses are expensive! Streaming & SW-guided caching reduce them

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cache PFS Stream FIR

Normalized Energy

DRAM L2-cache L-store D-cache I-cache Core

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cache PFS Stream FIR Normalized Off-chip Traffic Write Read

SLIDE 22

22

Our Conclusions

Caching performs & scales as well as Streaming

Well-known cache enhancements eliminate differences

Stream Programming benefits Caching Memory

Enhances locality patterns Improves bandwidth and efficiency of caches

Stream Programming easier with Caches

Makes memory system amenable to irregular &

unpredictable workloads

Streaming Memory likely to be replaced or at

least augmented by Caching Memory

SLIDE 23

23

Stream Programming for Caches: MPEG-2 Example

MPEG-2 example

P() generates a video frame later consumed by T() Whole frame is too large to fit in local memory No temporal locality

Opportunity

Computation on frame blocks are independent

P Predicted Video Frame T

SLIDE 24

24

Stream Programming for Caches: MPEG-2 Example

Introducing temporal locality

Loop fusion for P() and T() at block level Intermediate data are dead once T() done

P Predicted Video Frame T Predicted block

SLIDE 25

25

Stream Programming for Caches: MPEG-2 Example

Exploiting producer-consumer locality

Re-use the predicted block buffer Dynamic working set reduced Fits in local memory; no off-chip traffic

P T Predicted block

SLIDE 26

26

Stream Programming for Caches: MPEG-2 Example

Stream programming

beneficial for any Memory System

Exposes locality that

improves bandwidth and energy efficiency of local memory

Stream programming

toolchains helpful

2 cores 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 U n

p

t i m i z e d O p t i m i z e d S t r e a m MPEG-2 Normalized Off-chip Traffic

Write Read

SLIDE 27

27

Our Conclusions

Caching performs & scales as well as Streaming

Well-known cache enhancements eliminate differences

Stream Programming benefits Caching Memory

Enhances locality patterns Improves bandwidth and efficiency of caches

Stream Programming easier with Caches

Makes memory system amenable to irregular &

unpredictable workloads

Streaming Memory likely to be replaced or at

least augmented by Caching Memory

SLIDE 28

28

Stream Programming is Easier with Caches

Stream programming necessary for correctness

n Streaming Memory

Must refactor all dataflow

Caches can use Stream programming for

performance, not correctness

Incremental tuning Doesn’t require up-front holistic analysis

Why is this important?

Many “streaming apps” include some unpredictable

patterns

SLIDE 29

29

Specific Examples

Raytracing

Unpredictable tree accesses Software caching on Cell (Benthin ’06)

Emulation overhead, DMA latency for refills

Tree accesses have good locality on HW caches

3-D shading

Unpredictable texture accesses Texture accesses have good locality on HW caches Caches are ubiquitous on GPUs

SLIDE 30

30

Our Conclusions

Caching performs & scales as well as Streaming

Well-known cache enhancements eliminate differences

Stream Programming benefits Caching Memory

Enhances locality patterns Improves bandwidth and efficiency of caches

Stream Programming easier with Caches

Makes memory system amenable to irregular &

unpredictable workloads

Streaming Memory likely to be replaced or at

least augmented by Caching Memory

SLIDE 31

31

Limitations of This Work

Did not scale beyond 16 cores

Does cache coherence scale?

Application scope

May not generalize to other domains General-purpose != application-specific

Sensitivity to local storage capacity

Intractable without language/ compiler support

SLIDE 32

32

Future Work

Scale beyond 16 cores

Exploit streaming SW to assist HW coherence

Extend application scope

Generalize to other domains Consider further optimizations

Study sensitivity to local storage capacity

Introduce language/ compiler support

SLIDE 33

33

Thank you!

Questions?

SLIDE 34

34

Streaming Memory and L2 Caches

L2 caches mitigate

verfetch in

Streaming apps

Unstructured meshes Motion estimation

search window reference frames

Motion Estimation

16 cores w/o 512kB L2

0.5 1 1.5 2 2.5

Cache Stream Cache Stream FEM MPEG-2 Normalized Off-chip Traffic Write Read

SLIDE 35

35

Streaming Memory Occasionally Consumes More Bandwidth

The problem

Data-dependent write pattern

Caching

Automatically track modified

state

Write back only dirty data

Streaming

Writes back everything Programming burden &

verhead to track modified

state

Bitonic Sort

0.2 0.4 0.6 0.8 1 1.2 1.4 Cache Stream Normalized Off-chip Traffic Write Read