Memory Systems in the Many-Core Era: Some Challenges and Solution - - PowerPoint PPT Presentation

memory systems in the many core era some challenges and
SMART_READER_LITE
LIVE PREVIEW

Memory Systems in the Many-Core Era: Some Challenges and Solution - - PowerPoint PPT Presentation

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions Onur Mutlu http://www.ece.cmu.edu/~omutlu June 5, 2011 ISMM/MSPC Modern Memory System: A Shared Resource 2 The Memory System n The memory system is a fundamental


slide-1
SLIDE 1

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions

Onur Mutlu http://www.ece.cmu.edu/~omutlu June 5, 2011 ISMM/MSPC

slide-2
SLIDE 2

Modern Memory System: A Shared Resource

2

slide-3
SLIDE 3

The Memory System

n The memory system is a fundamental performance and

power bottleneck in almost all computing systems: server, mobile, embedded, desktop, sensor

n The memory system must scale (in size, performance,

efficiency, cost) to maintain performance and technology scaling

n Recent technology, architecture, and application trends lead

to new requirements from the memory system:

q Scalability (technology and algorithm) q Fairness and QoS-awareness q Energy/power efficiency

3

slide-4
SLIDE 4

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference

n Summary 4

slide-5
SLIDE 5

Technology Trends

n DRAM does not scale well beyond N nm [ITRS 2009, 2010]

q Memory scaling benefits: density, capacity, cost

n Energy/power already key design limiters

q Memory hierarchy responsible for a large fraction of power n

IBM servers: ~50% energy spent in off-chip memory hierarchy [Lefurgy+, IEEE Computer 2003]

n

DRAM consumes power when idle and needs periodic refresh

n More transistors (cores) on chip n Pin bandwidth not increasing as fast as number of transistors

q Memory is the major shared resource among cores q More pressure on the memory hierarchy

5

slide-6
SLIDE 6

Application Trends

n Many different threads/applications/virtual-machines (will)

concurrently share the memory system

q Cloud computing/servers: Many workloads consolidated on-chip to

improve efficiency

q GP-GPU, CPU+GPU, accelerators: Many threads from multiple

applications

q Mobile: Interactive + non-interactive consolidation

n Different applications with different requirements (SLAs)

q Some applications/threads require performance guarantees q Modern hierarchies do not distinguish between applications

n Applications are increasingly data intensive

q More demand for memory capacity and bandwidth

6

slide-7
SLIDE 7

Architecture/System Trends

n Sharing of memory hierarchy n More cores and components

q More pressure on the memory hierarchy

n Asymmetric cores: Performance asymmetry, CPU+GPUs,

accelerators, …

q Motivated by energy efficiency and Amdahl’s Law

n Different cores have different performance requirements

q Memory hierarchies do not distinguish between cores

n Different goals for different systems/users

q System throughput, fairness, per-application performance q Modern hierarchies are not flexible/configurable

7

slide-8
SLIDE 8

Summary: Major Trends Affecting Memory

n Need for main memory capacity and bandwidth increasing n New need for handling inter-application interference;

providing fairness, QoS

n Need for memory system flexibility increasing n Main memory energy/power is a key system design concern n DRAM is not scaling well 8

slide-9
SLIDE 9

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference

n Summary 9

slide-10
SLIDE 10

Requirements from an Ideal Memory System

n Traditional

q High system performance q Enough capacity q Low cost

n New

q Technology scalability q QoS support and configurability q Energy (and power, bandwidth) efficiency

10

slide-11
SLIDE 11

n Traditional

q High system performance: Need to reduce inter-thread interference q Enough capacity: Emerging tech. and waste management can help q Low cost: Other memory technologies can help

n New

q Technology scalability n

Emerging memory technologies (e.g., PCM) can help

q QoS support and configurability n

Need HW mechanisms to control interference and build QoS policies

q Energy (and power, bandwidth) efficiency n

One-size-fits-all design wastes energy; emerging tech. can help?

11

Requirements from an Ideal Memory System

slide-12
SLIDE 12

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference

n Summary 12

slide-13
SLIDE 13

The DRAM Scaling Problem

n DRAM stores charge in a capacitor (charge-based memory)

q Capacitor must be large enough for reliable sensing q Access transistor should be large enough for low leakage and high

retention time

q Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

n DRAM capacity, cost, and energy/power hard to scale 13

slide-14
SLIDE 14

Concerns with DRAM as Main Memory

n Need for main memory capacity and bandwidth increasing

q DRAM capacity hard to scale

n Main memory energy/power is a key system design concern

q DRAM consumes high power due to leakage and refresh

n DRAM technology scaling is becoming difficult

q DRAM capacity and cost may not continue to scale

14

slide-15
SLIDE 15

Possible Solution 1: Tolerate DRAM

n Overcome DRAM shortcomings with

q System-level solutions q Changes to DRAM microarchitecture, interface, and functions

15

slide-16
SLIDE 16

Possible Solution 2: Emerging Technologies

n Some emerging resistive memory technologies are more

scalable than DRAM (and they are non-volatile)

n Example: Phase Change Memory

q Data stored by changing phase of special material q Data read by detecting material’s resistance q Expected to scale to 9nm (2022 [ITRS]) q Prototyped at 20nm (Raoux+, IBM JRD 2008) q Expected to be denser than DRAM: can store multiple bits/cell

n But, emerging technologies have shortcomings as well

q Can they be enabled to replace/augment/surpass DRAM?

16

slide-17
SLIDE 17

Phase Change Memory: Pros and Cons

n Pros over DRAM

q Better technology scaling (capacity and cost) q Non volatility q Low idle power (no refresh)

n Cons

q Higher latencies: ~4-15x DRAM (especially write) q Higher active energy: ~2-50x DRAM (especially write) q Lower endurance (a cell dies after ~108 writes)

n Challenges in enabling PCM as DRAM replacement/helper:

q Mitigate PCM shortcomings q Find the right way to place PCM in the system q Ensure secure and fault-tolerant PCM operation

17

slide-18
SLIDE 18

PCM-based Main Memory (I)

n How should PCM-based (main) memory be organized? n Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09]:

q How to partition/migrate data between PCM and DRAM

n

Energy, performance, endurance

q Is DRAM a cache for PCM or part of main memory? q How to design the hardware and software

n

Exploit advantages, minimize disadvantages of each technology

18

slide-19
SLIDE 19

PCM-based Main Memory (II)

n How should PCM-based (main) memory be organized? n Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

q How to redesign entire hierarchy (and cores) to overcome

PCM shortcomings

n

Energy, performance, endurance

19

slide-20
SLIDE 20

PCM-Based Memory Systems: Research Challenges

n Partitioning

q Should DRAM be a cache or main memory, or configurable? q What fraction? How many controllers?

n Data allocation/movement (energy, performance, lifetime)

q Who manages allocation/movement? q What are good control algorithms? n

Latency-critical, heavily modified à DRAM, otherwise PCM?

n

Preventing denial/degradation of service

n Design of cache hierarchy, memory controllers, OS

q Mitigate PCM shortcomings, exploit PCM advantages

n Design of PCM/DRAM chips and modules

q Rethink the design of PCM/DRAM with new requirements

20

slide-21
SLIDE 21

An Initial Study: Replace DRAM with PCM

n Lee, Ipek, Mutlu, Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” ISCA 2009.

q Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC) q Derived “average” PCM parameters for F=90nm

21

slide-22
SLIDE 22

Results: Naïve Replacement of DRAM with PCM

n Replace DRAM with PCM in a 4-core, 4MB L2 system n PCM organized the same as DRAM: row buffers, banks, peripherals n 1.6x delay, 2.2x energy, 500-hour average lifetime n Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a

Scalable DRAM Alternative,” ISCA 2009.

22

slide-23
SLIDE 23

Architecting PCM to Mitigate Shortcomings

n Idea 1: Use narrow row buffers in each PCM chip

à Reduces write energy, peripheral circuitry

n Idea 2: Use multiple row buffers in each PCM chip

à Reduces array reads/writes à better endurance, latency, energy

n Idea 3: Write into array at

cache block or word granularity

à Reduces unnecessary wear

23

DRAM PCM

slide-24
SLIDE 24

Results: Architected PCM as Main Memory

n 1.2x delay, 1.0x energy, 5.6-year average lifetime n Scaling improves energy, endurance, density

n Caveat 1: Worst-case lifetime is much shorter (no guarantees) n Caveat 2: Intensive applications see large performance and energy hits n Caveat 3: Optimistic PCM parameters?

24

slide-25
SLIDE 25

PCM as Main Memory: Research Challenges

n Many research opportunities from

technology layer to algorithms layer

n Enabling PCM/NVM

q How to maximize performance? q How to maximize lifetime? q How to prevent denial of service?

n Exploiting PCM/NVM

q How to exploit non-volatility? q How to minimize energy consumption? q How to minimize cost? q How to exploit NVM on chip?

25

Microarchitecture ISA Programs Algorithms Problems Logic Devices Runtime System (VM, OS, MM) User

slide-26
SLIDE 26

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference

n Summary 26

slide-27
SLIDE 27

Memory System is the Major Shared Resource

27

threads’ requests interfere

slide-28
SLIDE 28

Inter-Thread/Application Interference

n Problem: Threads share the memory system, but memory

system does not distinguish between threads’ requests

n Existing memory systems

q Free-for-all, shared based on demand q Control algorithms thread-unaware and thread-unfair q Aggressive threads can deny service to others q Do not try to reduce or control inter-thread interference

28

slide-29
SLIDE 29

29

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2

CACHE

L2

CACHE DRAM MEMORY CONTROLLER

DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 Shared DRAM Memory System Multi-Core Chip unfairness

INTERCONNECT stream random

DRAM Bank 3

slide-30
SLIDE 30

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … }

30

A Memory Performance Hog

STREAM

  • Sequential memory access
  • Very high row buffer locality (96% hit rate)
  • Memory intensive

RANDOM

  • Random memory access
  • Very low row buffer locality (3% hit rate)
  • Similarly memory intensive

// initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … }

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

slide-31
SLIDE 31

31

What Does the Memory Hog Do?

Row Buffer Row decoder Column mux Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0 T1: Row 111 T0: Row 0 T0: Row 0 T1: Row 5 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 Memory Request Buffer T0: STREAM T1: RANDOM

Row size: 8KB, cache block size: 64B

128 (8KB/64B) requests of T0 serviced before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

slide-32
SLIDE 32

Effect of the Memory Performance Hog

0.5 1 1.5 2 2.5 3

STREAM RANDOM

32

1.18X slowdown 2.82X slowdown Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slowdown

0.5 1 1.5 2 2.5 3

STREAM gcc

0.5 1 1.5 2 2.5 3

STREAM Virtual PC

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

slide-33
SLIDE 33

Problems due to Uncontrolled Interference

33

n Unfair slowdown of different threads [MICRO’07, ISCA’08, ASPLOS’10] n Low system performance [MICRO’07, ISCA’08, HPCA’10, MICRO’10] n Vulnerability to denial of service [USENIX Security’07] n Priority inversion: unable to enforce priorities/SLAs [MICRO’07] n Poor performance predictability (no performance isolation)

Cores make very slow progress Memory performance hog Low priority High priority

Slowdown

Main memory is the only shared resource

slide-34
SLIDE 34

Problems due to Uncontrolled Interference

34

n Unfair slowdown of different threads [MICRO’07, ISCA’08, ASPLOS’10] n Low system performance [MICRO’07, ISCA’08, HPCA’10, MICRO’10] n Vulnerability to denial of service [USENIX Security’07] n Priority inversion: unable to enforce priorities/SLAs [MICRO’07] n Poor performance predictability (no performance isolation)

slide-35
SLIDE 35

How Do We Solve The Problem?

n Inter-thread interference is uncontrolled in all memory

resources

q Memory controller q Interconnect q Caches

n We need to control it

q i.e., design an interference-aware (QoS-aware) memory system

35

slide-36
SLIDE 36

QoS-Aware Memory Systems: Challenges

n How do we reduce inter-thread interference?

q Improve system performance and core utilization q Reduce request serialization and core starvation

n How do we control inter-thread interference?

q Provide mechanisms to enable system software to enforce

QoS policies

q While providing high system performance

n How do we make the memory system configurable/flexible?

q Enable flexible mechanisms that can achieve many goals n

Provide fairness or throughput when needed

n

Satisfy performance guarantees when needed

36

slide-37
SLIDE 37

Designing QoS-Aware Memory Systems: Approaches

n Smart resources: Design each shared resource to have a

configurable interference control/reduction mechanism

q QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix

Security’07] [Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11]

q QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+

MICRO’09, ISCA’11]

q QoS-aware caches

n Dumb resources: Keep each resource free-for-all, but reduce/

control interference by injection control or data mapping

q Source throttling to control access to memory system [Ebrahimi+

ASPLOS’10, ISCA’11] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q QoS-aware data mapping to memory controllers [Muralidhara+ CMU TR’11] q QoS-aware thread scheduling to cores

37

slide-38
SLIDE 38

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference n

Smart Resources: Thread Cluster Memory Scheduling

n

Dumb Resources: Fairness via Source Throttling

n Summary 38

slide-39
SLIDE 39

QoS-Aware Memory Scheduling

n How to schedule requests to provide

q High system performance q High fairness to applications q Configurability to system software

n Memory controller needs to be aware of threads 39

Memory ¡ Controller ¡ Core ¡ Core ¡ Core ¡ Core ¡

Memory ¡

Resolves memory contention by scheduling requests

slide-40
SLIDE 40

QoS-Aware Memory Scheduling: Evolution

n Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q Idea: Estimate and balance thread slowdowns q Takeaway: Proportional thread progress improves performance,

especially when threads are “heavy” (memory intensive)

n Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q Idea: Rank threads and service in rank order (to preserve bank

parallelism); batch requests to prevent starvation

n ATLAS memory scheduler [Kim+ HPCA’10] 40

slide-41
SLIDE 41

Within-Thread Bank Parallelism ¡

41 ¡

Bank ¡0 ¡ Bank ¡1 ¡

req ¡ req ¡ req ¡ req ¡

memory ¡service ¡+meline ¡ thread ¡A ¡ ¡ thread ¡B ¡ ¡ thread ¡execu+on ¡+meline ¡ WAIT ¡ WAIT ¡ thread ¡B ¡ ¡ thread ¡A ¡ ¡ Bank ¡0 ¡ Bank ¡1 ¡

req ¡ req ¡ req ¡ req ¡

memory ¡service ¡+meline ¡ thread ¡execu+on ¡+meline ¡ WAIT ¡ WAIT ¡

rank ¡

thread ¡B ¡ ¡ thread ¡A ¡ ¡ thread ¡A ¡ ¡ thread ¡B ¡ ¡

SAVED ¡CYCLES ¡

Key ¡Idea: ¡

slide-42
SLIDE 42

QoS-Aware Memory Scheduling: Evolution

n Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q Idea: Estimate and balance thread slowdowns q Takeaway: Proportional thread progress improves performance,

especially when threads are “heavy” (memory intensive)

n Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q Idea: Rank threads and service in rank order (to preserve bank

parallelism); batch requests to prevent starvation

q Takeaway: Preserving within-thread bank-parallelism improves

performance; request batching improves fairness

n ATLAS memory scheduler [Kim+ HPCA’10]

q Idea: Prioritize threads that have attained the least service from the

memory scheduler

q Takeaway: Prioritizing “light” threads improves performance

42

slide-43
SLIDE 43

1 ¡ 3 ¡ 5 ¡ 7 ¡ 9 ¡ 11 ¡ 13 ¡ 15 ¡ 17 ¡ 7 ¡ 7.5 ¡ 8 ¡ 8.5 ¡ 9 ¡ 9.5 ¡ 10 ¡

Maximum ¡Slowdown ¡

Weighted ¡Speedup ¡ FCFS ¡ FRFCFS ¡ STFM ¡ PAR-­‑BS ¡ ATLAS ¡

No ¡previous ¡memory ¡scheduling ¡algorithm ¡provides ¡ both ¡the ¡best ¡fairness ¡and ¡system ¡throughput ¡

Previous Scheduling Algorithms are Biased

43 ¡

System ¡throughput ¡bias ¡ Fairness ¡bias ¡

BeIer ¡system ¡throughput ¡ BeIer ¡fairness ¡

24 ¡cores, ¡4 ¡memory ¡controllers, ¡96 ¡workloads ¡ ¡

slide-44
SLIDE 44

Take ¡turns ¡accessing ¡memory ¡

Throughput vs. Fairness

44 ¡

Fairness ¡biased ¡approach ¡

thread ¡C ¡

thread ¡B ¡

thread ¡A ¡

less ¡memory ¡ ¡ intensive ¡ higher ¡ priority ¡

PrioriMze ¡less ¡memory-­‑intensive ¡threads ¡

Throughput ¡biased ¡approach ¡

Good ¡for ¡throughput ¡

starva3on ¡è è ¡unfairness ¡

thread ¡C ¡

thread ¡B ¡

thread ¡A ¡

Does ¡not ¡starve ¡

not ¡priori3zed ¡è è ¡ ¡ reduced ¡throughput ¡

Single ¡policy ¡for ¡all ¡threads ¡is ¡insufficient ¡

slide-45
SLIDE 45

Achieving the Best of Both Worlds

45 ¡

thread ¡

thread ¡

higher ¡ priority ¡

thread ¡

thread ¡ thread ¡ ¡ thread ¡

thread ¡ thread ¡

PrioriDze ¡memory-­‑non-­‑intensive ¡threads ¡

For ¡Throughput ¡

Unfairness ¡caused ¡by ¡memory-­‑intensive ¡ being ¡prioriDzed ¡over ¡each ¡other ¡ ¡

  • ¡Shuffle ¡thread ¡ranking ¡

Memory-­‑intensive ¡threads ¡have ¡ ¡ different ¡vulnerability ¡to ¡interference ¡

  • ¡Shuffle ¡asymmetrically ¡

For ¡Fairness ¡

thread ¡ thread ¡ thread ¡ thread ¡

slide-46
SLIDE 46

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

  • 1. Group ¡threads ¡into ¡two ¡clusters ¡
  • 2. PrioriDze ¡non-­‑intensive ¡cluster ¡
  • 3. Different ¡policies ¡for ¡each ¡cluster ¡

46 ¡

thread ¡

Threads ¡in ¡the ¡system ¡

thread ¡

thread ¡

thread ¡ thread ¡

thread ¡ thread ¡

Non-­‑intensive ¡ ¡ cluster ¡

Intensive ¡cluster ¡

thread ¡ thread ¡ thread ¡

Memory-­‑non-­‑intensive ¡ ¡ Memory-­‑intensive ¡ ¡ Priori3zed ¡

higher ¡ priority ¡ higher ¡ priority ¡

Throughput ¡ Fairness ¡

slide-47
SLIDE 47

FRFCFS ¡ STFM ¡ PAR-­‑BS ¡ ATLAS ¡ TCM ¡

4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 16 ¡ 7.5 ¡ 8 ¡ 8.5 ¡ 9 ¡ 9.5 ¡ 10 ¡

Maximum ¡Slowdown ¡ Weighted ¡Speedup ¡

TCM: Throughput and Fairness

47 ¡

BeIer ¡system ¡throughput ¡ BeIer ¡fairness ¡

24 ¡cores, ¡4 ¡memory ¡controllers, ¡96 ¡workloads ¡ ¡

TCM, ¡a ¡heterogeneous ¡scheduling ¡policy, ¡ provides ¡best ¡fairness ¡and ¡system ¡throughput ¡

slide-48
SLIDE 48

TCM: Fairness-Throughput Tradeoff

48 ¡

2 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 12 ¡ 12.5 ¡ 13 ¡ 13.5 ¡ 14 ¡ 14.5 ¡ 15 ¡ 15.5 ¡ 16 ¡

Maximum ¡Slowdown ¡ Weighted ¡Speedup ¡

When ¡configuraDon ¡parameter ¡is ¡varied… ¡

Adjus+ng ¡ ¡ ClusterThreshold ¡

TCM ¡allows ¡robust ¡fairness-­‑throughput ¡tradeoff ¡ ¡

STFM ¡ PAR-­‑BS ¡ ATLAS ¡ TCM ¡

BeIer ¡system ¡throughput ¡ BeIer ¡fairness ¡

FRFCFS ¡

slide-49
SLIDE 49

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference n

Smart Resources: Thread Cluster Memory Scheduling

n

Dumb Resources: Fairness via Source Throttling

n Summary 49

slide-50
SLIDE 50

Designing QoS-Aware Memory Systems: Approaches

n Smart resources: Design each shared resource to have a

configurable interference control/reduction mechanism

q QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix

Security’07] [Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11]

q QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+

MICRO’09, ISCA’11]

q QoS-aware caches

n Dumb resources: Keep each resource free-for-all, but reduce/

control interference by injection control or data mapping

q Source throttling to control access to memory system [Ebrahimi+

ASPLOS’10, ISCA’11] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q QoS-aware data mapping to memory controllers [Muralidhara+ CMU TR’11] q QoS-aware thread scheduling to cores

50

slide-51
SLIDE 51

Many Shared Resources

Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAM Bank 0 DRAM Bank 1 DRAM Bank 2

...

DRAM Bank K

...

Shared Memory Resources Chip Boundary On-chip Off-chip

51

slide-52
SLIDE 52

The Problem with “Smart Resources”

n Independent interference control mechanisms in

caches, interconnect, and memory can contradict each other

n Explicitly coordinating mechanisms for different

resources requires complex implementation

n How do we enable fair sharing of the entire

memory system by controlling interference in a coordinated manner?

52

slide-53
SLIDE 53

An Alternative Approach: Source Throttling

n Manage inter-thread interference at the cores, not at the

shared resources

n Dynamically estimate unfairness in the memory system n Feed back this information into a controller n Throttle cores’ memory access rates accordingly

q Whom to throttle and by how much depends on performance

target (throughput, fairness, per-thread QoS, etc)

q E.g., if unfairness > system-software-specified target then

throttle down core causing unfairness & throttle up core that was unfairly treated

n Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10.

53

slide-54
SLIDE 54

54

Runtime Unfairness Evaluation Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering) if (Unfairness Estimate >Target) { 1-Throttle down App-interfering

(limit injection rate and parallelism)

2-Throttle up App-slowest }

FST

Unfairness Estimate App-slowest App-interfering

⎧ ⎩

Slowdown Estimation Time Interval 1 Interval 2 Interval 3

Runtime Unfairness Evaluation Dynamic Request Throttling

Fairness via Source Throttling (FST) [ASPLOS’10]

slide-55
SLIDE 55

System Software Support

n Different fairness objectives can be configured by

system software

q Keep maximum slowdown in check n

Estimated Max Slowdown < Target Max Slowdown

q Keep slowdown of particular applications in check to achieve a

particular performance target

n

Estimated Slowdown(i) < Target Slowdown(i)

n Support for thread priorities

q Weighted Slowdown(i) =

Estimated Slowdown(i) x Weight(i)

55

slide-56
SLIDE 56

Source Throttling Results: Takeaways

n Source throttling alone provides better performance than a

combination of “smart” memory scheduling and fair caching

q Decisions made at the memory scheduler and the cache

sometimes contradict each other

n Neither source throttling alone nor “smart resources” alone

provides the best performance

n Combined approaches are even more powerful

q Source throttling and resource-based interference control

56

slide-57
SLIDE 57

Designing QoS-Aware Memory Systems: Approaches

n Smart resources: Design each shared resource to have a

configurable interference control/reduction mechanism

q QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix

Security’07] [Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11]

q QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+

MICRO’09, ISCA’11]

q QoS-aware caches

n Dumb resources: Keep each resource free-for-all, but reduce/

control interference by injection control or data mapping

q Source throttling to control access to memory system [Ebrahimi+

ASPLOS’10, ISCA’11] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q QoS-aware data mapping to memory controllers [Muralidhara+ CMU TR’11] q QoS-aware thread scheduling to cores

57

slide-58
SLIDE 58

Another Way of Reducing Interference

n Memory Channel Partitioning

q Idea: Map badly-interfering applications’ pages to different

channels [Muralidhara+ CMU TR’11]

n Separate data of low/high intensity and low/high row-locality applications n Especially effective in reducing interference of threads with “medium” and

“heavy” memory intensity

58

slide-59
SLIDE 59

Summary: Memory QoS Approaches and Techniques

n Approaches: Smart vs. dumb resources

q Smart resources: QoS-aware memory scheduling q Dumb resources: Source throttling; channel partitioning q Both approaches are effective in reducing interference q No single best approach for all workloads

n Techniques: Request scheduling, source throttling, memory

partitioning

q All approaches are effective in reducing interference q Can be applied at different levels: hardware vs. software q No single best technique for all workloads

n Combined approaches and techniques are the most powerful

q Integrated Memory Channel Partitioning and Scheduling

59

slide-60
SLIDE 60

Two Related Talks at ISCA

n How to design QoS-aware memory systems (memory

scheduling and source throttling) in the presence of prefetching

q Ebrahimi et al., “Prefetch-Aware Shared Resource Management for

Multi-Core Systems,” ISCA’11.

q Monday afternoon (Session 3B)

n How to design scalable QoS mechanisms in on-chip

interconnects

q Idea: Isolate shared resources in a region, provide QoS support only

within the region, ensure interference-free access to the region

q Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture

for Scalability and Service Guarantees,” ISCA’11.

q Wednesday morning (Session 8B)

60

slide-61
SLIDE 61

Agenda

n Technology, Application, Architecture Trends n Requirements from the Memory Hierarchy n Research Challenges and Solution Directions

q Main Memory Scalability q QoS support: Inter-thread/application interference n

Smart Resources: Thread Cluster Memory Scheduling

n

Dumb Resources: Fairness via Source Throttling

n Conclusions 61

slide-62
SLIDE 62

Conclusions

n Technology, application, architecture trends dictate

new needs from memory system

n A fresh look at (re-designing) the memory hierarchy

q Scalability: Enabling new memory technologies q QoS, fairness & performance: Reducing and controlling inter-

application interference: QoS-aware memory system design

q Efficiency: Customizability, minimal waste, new technologies

n Many exciting research topics in fundamental areas across

the system stack

q Hardware/software/device cooperation essential

62

slide-63
SLIDE 63

Thank you.

63

slide-64
SLIDE 64

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions

Onur Mutlu http://www.ece.cmu.edu/~omutlu June 5, 2011 ISMM/MSPC