1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory - - PDF document

Review: Major Components of a Computer Review: Major Components of a Computer Processor Devices Memory Memory Control Hierarchy Hierarchy Input Memory Datapath Output Original slides from: Computer Architecture A Quantitative


slide-1
SLIDE 1

1 Memory Hierarchy Memory Hierarchy

Original slides from:

Computer Architecture

A Quantitative Approach Hennessy, Patterson Modified slides by YashwantMalaiya Colorado State University

Review: Major Components of a Computer Review: Major Components of a Computer

Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)

Processor-Memory Performance Gap Processor-Memory Performance Gap

1 10 100 1000 10000 1980 1983 1986 1989 1992 1995 1998 2001 2004 Year Performance “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) Processor-Memory Performance Gap (grows 50%/year)

The Memory Hierarchy Goal The Memory Hierarchy Goal

Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?

n With hierarchy n With parallelism

Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?

n With hierarchy n With parallelism

slide-2
SLIDE 2

2

Second Level Cache (SRAM)

A Typical Memory Hierarchy A Typical Memory Hierarchy

Control Datapath Secondary Memory (Disk) On-Chip Comp

  • n

en ts RegFile Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s Size (bytes): 100’s 10K’s M’s G’s T’s Cost: highest lowest

q Take advantage of the principle of locality to present the user with as

much memory as is available in the cheapest technology at the speed offered by the fastest technology

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6

Memory Technology Memory Technology

Static RAM (SRAM)

n 0.5-2.5ns, 2010: $2000–$5000 per GB (2015: same?)

Dynamic RAM (DRAM)

n 50-70ns, 2010: $20–$75 per GB (2015: <$10 per GB)

Flash Memory

n 70-150ns, 2010: $4-$12 per GB (2015: $.14 per GB)

Magnetic disk

n 5ms-20ms, $0.2-$2.0 per GB (2015: $.7 per GB)

Ideal memory

n Access time of SRAM n Capacity and cost/GB of disk

Static RAM (SRAM)

n 0.5-2.5ns, 2010: $2000–$5000 per GB (2015: same?)

Dynamic RAM (DRAM)

n 50-70ns, 2010: $20–$75 per GB (2015: <$10 per GB)

Flash Memory

n 70-150ns, 2010: $4-$12 per GB (2015: $.14 per GB)

Magnetic disk

n 5ms-20ms, $0.2-$2.0 per GB (2015: $.7 per GB)

Ideal memory

n Access time of SRAM n Capacity and cost/GB of disk

§5.1 Introduction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

Principle of Locality Principle of Locality

Programs access a small proportion of their address space at any time Temporal locality

n Items accessed recently are likely to be accessed

again soon

n e.g., instructions in a loop, induction variables

Spatial locality

n Items near those accessed recently are likely to be

accessed soon

n E.g., sequential instruction access, array data

Programs access a small proportion of their address space at any time Temporal locality

n Items accessed recently are likely to be accessed

again soon

n e.g., instructions in a loop, induction variables

Spatial locality

n Items near those accessed recently are likely to be

accessed soon

n E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8

Taking Advantage of Locality Taking Advantage of Locality

Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory

n Main memory

Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory

n Cache memory attached to CPU

Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory

n Main memory

Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory

n Cache memory attached to CPU

slide-3
SLIDE 3

3 Memory Hierarchy Levels Memory Hierarchy Levels

Block (aka line): unit of copying

n May be multiple words

If accessed data is present in upper level

n Hit: access satisfied by upper level

Hit ratio: hits/accesses

If accessed data is absent

n Miss: block copied from lower level

Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio

n Then accessed data supplied from

upper level

Block (aka line): unit of copying

n May be multiple words

If accessed data is present in upper level

n Hit: access satisfied by upper level

Hit ratio: hits/accesses

If accessed data is absent

n Miss: block copied from lower level

Time taken: miss penalty Miss ratio: misses/accesses = 1 – hit ratio

n Then accessed data supplied from

upper level

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9

Characteristics of the Memory Hierarchy Characteristics of the Memory Hierarchy

Increasing distance from the processor in access time Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM

L1$ L2$ Main Memory Secondary Memory Processor

(Relative) size of the memory at each level

4-8 bytes (word ) 1 to 4 blocks 1,024+ bytes (d isk se cto r = pag e) 8-32 bytes (b lo ck)

11 11

Cache Size Cache Size

Increasing cache size hit rate 1/(cycle time)

  • ptimum

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12

Cache Memory Cache Memory

Cache memory

n The level of the memory hierarchy closest to the CPU

Given accesses X1, …, Xn–1, Xn Cache memory

n The level of the memory hierarchy closest to the CPU

Given accesses X1, …, Xn–1, Xn

§5.2 The Basics of Caches n How do we know if

the data is present?

n Where do we look?

slide-4
SLIDE 4

4

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13

Block Size Considerations Block Size Considerations

Larger blocks should reduce miss rate

n Due to spatial locality

But in a fixed-sized cache

n Larger blocks ⇒ fewer of them

More competition ⇒ increased miss rate

n Larger blocks ⇒ pollution

Larger miss penalty

n Can override benefit of reduced miss rate n Early restart and critical-word-first can help

Larger blocks should reduce miss rate

n Due to spatial locality

But in a fixed-sized cache

n Larger blocks ⇒ fewer of them

More competition ⇒ increased miss rate

n Larger blocks ⇒ pollution

Larger miss penalty

n Can override benefit of reduced miss rate n Early restart and critical-word-first can help 14 14

Increasing Hit Rate Increasing Hit Rate

Hit rate increases with cache size. Hit rate mildly depends on block size. Hit rate increases with cache size. Hit rate mildly depends on block size.

10% 5% 0% Cache size = 4KB 16KB 64KB 16B 32B 64B 128B 256B Block size miss rate = 1 – hit rate 100% 95% 90% hit rate, h Decreasing chances of covering large data locality Decreasing chances of getting fragmented data

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15

Cache Misses Cache Misses

On cache hit, CPU proceeds normally On cache miss

n Stall the CPU pipeline n Fetch block from next level of hierarchy n Instruction cache miss

Restart instruction fetch

n Data cache miss

Complete data access

On cache hit, CPU proceeds normally On cache miss

n Stall the CPU pipeline n Fetch block from next level of hierarchy n Instruction cache miss

Restart instruction fetch

n Data cache miss

Complete data access

Static vs Dynamic RAMs Static vs Dynamic RAMs

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16

slide-5
SLIDE 5

5

17 17

Random Access Memory (RAM) Random Access Memory (RAM)

Memory cell array Address decoder Read/write circuits Address bits Data bits

18 18

Six-Transistor SRAM Cell Six-Transistor SRAM Cell

Bit line Word line Bit line bit bit

19 19

Dynamic RAM (DRAM) Cell Dynamic RAM (DRAM) Cell

Word line Bit line

“Single-transistor DRAM cell” Robert Dennard’s 1967 invevention

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20

Advanced DRAM Organization Advanced DRAM Organization

Bits in a DRAM are organized as a rectangular array

n DRAM accesses an entire row n Burst mode: supply successive words from a row with

reduced latency

Double data rate (DDR) DRAM

n Transfer on rising and falling clock edges

Quad data rate (QDR) DRAM

n Separate DDR inputs and outputs

Bits in a DRAM are organized as a rectangular array

n DRAM accesses an entire row n Burst mode: supply successive words from a row with

reduced latency

Double data rate (DDR) DRAM

n Transfer on rising and falling clock edges

Quad data rate (QDR) DRAM

n Separate DDR inputs and outputs

slide-6
SLIDE 6

6

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21

DRAM Generations DRAM Generations

50 100 150 200 250 300 '80 '83 '85 '89 '92 '96 '98 '00 '04 '07 Trac Tcac Year Capacity $/GB 1980 64Kbit $150000 1983 256Kbit $500000 1985 1Mbit $200000 1989 4Mbit $50000 1992 16Mbit $15000 1996 64Mbit $10000 1998 128Mbit $4000 2000 256Mbit $1000 2004 512Mbit $250 2007 1Gbit $50

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22

Average Access Time Average Access Time

Hit time is also important for performance Average memory access time (AMAT)

n AMAT = Hit time + Miss rate × Miss penalty

Example

n CPU with 1ns clock, hit time = 1 cycle, miss penalty =

20 cycles, I-cache miss rate = 5%

n AMAT = 1 + 0.05 × 20 = 2ns

2 cycles per instruction

Hit time is also important for performance Average memory access time (AMAT)

n AMAT = Hit time + Miss rate × Miss penalty

Example

n CPU with 1ns clock, hit time = 1 cycle, miss penalty =

20 cycles, I-cache miss rate = 5%

n AMAT = 1 + 0.05 × 20 = 2ns

2 cycles per instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23

Performance Summary Performance Summary

When CPU performance increased

n Miss penalty becomes more significant

Can’t neglect cache behavior when evaluating system performance When CPU performance increased

n Miss penalty becomes more significant

Can’t neglect cache behavior when evaluating system performance

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

Multilevel Caches Multilevel Caches

Primary cache attached to CPU

n Small, but fast

Level-2 cache services misses from primary cache

n Larger, slower, but still faster than main memory

Main memory services L-2 cache misses Some high-end systems include L-3 cache Primary cache attached to CPU

n Small, but fast

Level-2 cache services misses from primary cache

n Larger, slower, but still faster than main memory

Main memory services L-2 cache misses Some high-end systems include L-3 cache

slide-7
SLIDE 7

7

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25

Interactions with Advanced CPUs Interactions with Advanced CPUs

Out-of-order CPUs can execute instructions during cache miss

n Pending store stays in load/store unit n Dependent instructions wait in reservation stations

Independent instructions continue

Effect of miss depends on program data flow

n Much harder to analyse n Use system simulation

Out-of-order CPUs can execute instructions during cache miss

n Pending store stays in load/store unit n Dependent instructions wait in reservation stations

Independent instructions continue

Effect of miss depends on program data flow

n Much harder to analyse n Use system simulation Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

Virtual Memory Virtual Memory

Use main memory as a “cache” for secondary (disk) storage

n Managed jointly by CPU hardware and the operating

system (OS)

Programs share main memory

n Each gets a private virtual address space holding its

frequently used code and data

n Protected from other programs

CPU and OS translate virtual addresses to physical addresses

n VM “block” is called a page n VM translation “miss” is called a page fault

Use main memory as a “cache” for secondary (disk) storage

n Managed jointly by CPU hardware and the operating

system (OS)

Programs share main memory

n Each gets a private virtual address space holding its

frequently used code and data

n Protected from other programs

CPU and OS translate virtual addresses to physical addresses

n VM “block” is called a page n VM translation “miss” is called a page fault

§5.4 Virtual Memory

Chapter 6 — Storage and Other I/O Topics — 27 Chapter 6 — Storage and Other I/O Topics — 27

Disk Storage Disk Storage

Nonvolatile, rotating magnetic storage Nonvolatile, rotating magnetic storage

§6.3 Disk Storage

Chapter 6 — Storage and Other I/O Topics — 28 Chapter 6 — Storage and Other I/O Topics — 28

Disk Sectors and Access Disk Sectors and Access

Each sector records

n Sector ID n Data (512 bytes, 4096 bytes proposed) n Error correcting code (ECC)

Used to hide defects and recording errors

n Synchronization fields and gaps

Access to a sector involves

n Queuing delay if other accesses are pending n Seek: move the heads n Rotational latency n Data transfer n Controller overhead

Each sector records

n Sector ID n Data (512 bytes, 4096 bytes proposed) n Error correcting code (ECC)

Used to hide defects and recording errors

n Synchronization fields and gaps

Access to a sector involves

n Queuing delay if other accesses are pending n Seek: move the heads n Rotational latency n Data transfer n Controller overhead

slide-8
SLIDE 8

8

Chapter 6 — Storage and Other I/O Topics — 29 Chapter 6 — Storage and Other I/O Topics — 29

Disk Access Example Disk Access Example

Given

n 512B sector, 15,000rpm, 4ms average seek time,

100MB/s transfer rate, 0.2ms controller overhead, idle disk

Average read time

n 4ms seek time

+ ½ / (15,000/60) = 2ms rotational latency + 512 / 100MB/s = 0.005ms transfer time + 0.2ms controller delay = 6.2ms

If actual average seek time is 1ms

n Average read time = 3.2ms

Given

n 512B sector, 15,000rpm, 4ms average seek time,

100MB/s transfer rate, 0.2ms controller overhead, idle disk

Average read time

n 4ms seek time

+ ½ / (15,000/60) = 2ms rotational latency + 512 / 100MB/s = 0.005ms transfer time + 0.2ms controller delay = 6.2ms

If actual average seek time is 1ms

n Average read time = 3.2ms Chapter 6 — Storage and Other I/O Topics — 30 Chapter 6 — Storage and Other I/O Topics — 30

Disk Performance Issues Disk Performance Issues

Manufacturers quote average seek time

n Based on all possible seeks n Locality and OS scheduling lead to smaller actual

average seek times

Smart disk controller allocate physical sectors on disk

n Present logical sector interface to host n SCSI, ATA, SATA

Disk drives include caches

n Prefetch sectors in anticipation of access n Avoid seek and rotational delay

Manufacturers quote average seek time

n Based on all possible seeks n Locality and OS scheduling lead to smaller actual

average seek times

Smart disk controller allocate physical sectors on disk

n Present logical sector interface to host n SCSI, ATA, SATA

Disk drives include caches

n Prefetch sectors in anticipation of access n Avoid seek and rotational delay Chapter 6 — Storage and Other I/O Topics — 31 Chapter 6 — Storage and Other I/O Topics — 31

Flash Storage Flash Storage

Nonvolatile semiconductor storage

n 100× – 1000× faster than disk n Smaller, lower power, more robust n But more $/GB (between disk and DRAM)

Nonvolatile semiconductor storage

n 100× – 1000× faster than disk n Smaller, lower power, more robust n But more $/GB (between disk and DRAM)

§6.4 Flash Storage

Chapter 6 — Storage and Other I/O Topics — 32 Chapter 6 — Storage and Other I/O Topics — 32

Flash Types Flash Types

NOR flash: bit cell like a NOR gate

n Random read/write access n Used for instruction memory in embedded systems

NAND flash: bit cell like a NAND gate

n Denser (bits/area), but block-at-a-time access n Cheaper per GB n Used for USB keys, media storage, …

Flash bits wears out after 1000’s of accesses

n Not suitable for direct RAM or disk replacement n Wear leveling: remap data to less used blocks

NOR flash: bit cell like a NOR gate

n Random read/write access n Used for instruction memory in embedded systems

NAND flash: bit cell like a NAND gate

n Denser (bits/area), but block-at-a-time access n Cheaper per GB n Used for USB keys, media storage, …

Flash bits wears out after 1000’s of accesses

n Not suitable for direct RAM or disk replacement n Wear leveling: remap data to less used blocks

slide-9
SLIDE 9

9

33 33

Virtual vs. Physical Address Virtual vs. Physical Address

Processor assumes a certain memory addressing scheme:

n A block of data is called a virtual page n An address is called virtual (or logical) address

Main memory may have a different addressing scheme:

n Real memory address is called a physical address,

MMU translates virtual address to physical address

n Complete address translation table is large and must

therefore reside in main memory

n MMU contains TLB (translation lookaside buffer),

which is a small cache of the address translation table

Processor assumes a certain memory addressing scheme:

n A block of data is called a virtual page n An address is called virtual (or logical) address

Main memory may have a different addressing scheme:

n Real memory address is called a physical address,

MMU translates virtual address to physical address

n Complete address translation table is large and must

therefore reside in main memory

n MMU contains TLB (translation lookaside buffer),

which is a small cache of the address translation table

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

Page Fault Penalty Page Fault Penalty

On page fault, the page must be fetched from disk

n Takes millions of clock cycles n Handled by OS code

Try to minimize page fault rate

n Smart replacement algorithms

On page fault, the page must be fetched from disk

n Takes millions of clock cycles n Handled by OS code

Try to minimize page fault rate

n Smart replacement algorithms Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

Memory Protection Memory Protection

Different tasks can share parts of their virtual address spaces

n But need to protect against errant access n Requires OS assistance

Hardware support for OS protection

n Privileged supervisor mode (aka kernel mode) n Privileged instructions n Page tables and other state information only

accessible in supervisor mode

n System call exception (e.g., syscall in MIPS)

Different tasks can share parts of their virtual address spaces

n But need to protect against errant access n Requires OS assistance

Hardware support for OS protection

n Privileged supervisor mode (aka kernel mode) n Privileged instructions n Page tables and other state information only

accessible in supervisor mode

n System call exception (e.g., syscall in MIPS) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

The Memory Hierarchy The Memory Hierarchy

Common principles apply at all levels of the memory hierarchy

n Based on notions of caching

At each level in the hierarchy

n Block placement n Finding a block n Replacement on a miss n Write policy

Common principles apply at all levels of the memory hierarchy

n Based on notions of caching

At each level in the hierarchy

n Block placement n Finding a block n Replacement on a miss n Write policy

§5.5 A Common Framework for Memory Hierarchies

The BIG Picture

slide-10
SLIDE 10

10

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

Virtual Machines Virtual Machines

Host computer emulates guest operating system and machine resources

n Improved isolation of multiple guests n Avoids security and reliability problems n Aids sharing of resources

Virtualization has some performance impact

n Feasible with modern high-performance comptuers

Examples

n IBM VM/370 (1970s technology!) n VMWare n Microsoft Virtual PC

Host computer emulates guest operating system and machine resources

n Improved isolation of multiple guests n Avoids security and reliability problems n Aids sharing of resources

Virtualization has some performance impact

n Feasible with modern high-performance comptuers

Examples

n IBM VM/370 (1970s technology!) n VMWare n Microsoft Virtual PC

§5.6 Virtual Machines

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38

Multilevel On-Chip Caches Multilevel On-Chip Caches

§5.10 Real Stuff: The AMD Opteron X4 and Intel Nehalem Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache Intel Nehalem 4-core processor

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39

3-Level Cache Organization 3-Level Cache Organization

Intel Nehalem AMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write- back/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write- back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8-way, approx LRU replacement, write- back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU replacement, write- back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16-way, replacement n/a, write- back/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles n/a: data not available

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40

Concluding Remarks Concluding Remarks

Fast memories are small, large memories are slow

n We really want fast, large memories L n Caching gives this illusion J

Principle of locality

n Programs use a small part of their memory space

frequently

Memory hierarchy

n L1 cache ↔ L2 cache ↔ … ↔ DRAM memory

↔ disk

Memory system design is critical for multiprocessors Fast memories are small, large memories are slow

n We really want fast, large memories L n Caching gives this illusion J

Principle of locality

n Programs use a small part of their memory space

frequently

Memory hierarchy

n L1 cache ↔ L2 cache ↔ … ↔ DRAM memory

↔ disk

Memory system design is critical for multiprocessors

§5.12 Concluding Remarks