CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review - PowerPoint PPT Presentation

CS654 Advanced Computer Architecture Lec 8 – Memory Hierarchy Review Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

Review from last lecture • Quantify and summarize performance – Ratios, Geometric Mean, Multiplicative Standard Deviation • F&P: Benchmarks age, disks fail,1 point fail danger • Control via State Machines and Microprogramming • Just overlap tasks; easy if tasks are independent • Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then: Cycle Time Pipeline depth unpipeline d Speedup = � 1 Pipeline stall CPI Cycle Time + pipelined • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW,WAR,WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction • Exceptions, Interrupts add complexity 2/9/09 2 CS W&M

Outline • Memory hierarchy • Locality • Cache design • Virtual address spaces • Page table layout • TLB design options • Conclusion 2/9/09 3 CS W&M

Since 1980, CPU has outpaced DRAM ... Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Performance CPU (1/latency) Create a “memory hierarchy”. 60% per yr 1000 CPU 2X in 1.5 yrs 100 Gap grew 50% per year 10 DRAM 9% per yr DRAM 2X in 10 yrs 9 0 9 1 2 1 0 0 0 8 9 0 Year

Levels of the Memory Hierarchy Upper Level Capacity Access Time Staging Cost faster Xfer Unit CPU Registers Registers 100s Bytes <10s ns prog./compiler Instr. Operands 1-8 bytes Cache K Bytes Cache 10-100 ns 1-0.1 cents/bit cache cntl Blocks 8-128 bytes Main Memory Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit OS Pages 512-4K bytes Disk G Bytes, 10 ms (10,000,000 ns) Disk -6 -5 10 - 10 cents/bit user/operator Files Mbytes Larger Tape infinite Tape Lower Level sec-min 10 -8 2/9/09 5 CS W&M

Memory Hierarchy: Apple iMac G5 Managed Managed Managed by OS, by compiler by hardware hardware, application 07 Reg L1 Inst L1 Data L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Latency iMac G5 1, 3, 3, 11, 88, 10 7 , Cycles, 0.6 ns 1.9 ns 1.9 ns 6.9 ns 55 ns 12 ms 1.6 GHz Time Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access

iMac’s PowerPC 970: All caches on-chip L1 (64K Instruction) R eg ist er 512K s L2 (1K) L1 (32K Data)

The Principle of Locality • The Principle of Locality: – Programs access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 15 years, HW relied on locality for speed It is a property of programs which is exploited in machine design. 2/9/09 8 CS W&M

Programs with locality cache well ... Bad locality behavior Memory Address (one dot per access) Temporal Locality Spatial Locality Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty (500 instructions on 21264!) Lower Level Upper Level Memory To Processor Memory Blk X From Processor Blk Y 2/9/09 10 CS W&M

Cache Measures • Hit rate : fraction found in that level – So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory • Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) • Miss penalty : time to replace a block from lower level, including time to replace in CPU – access time : time to lower level = f(latency to lower level) – transfer time : time to transfer block =f(BW between upper & lower levels) 2/9/09 11 CS W&M

4 Questions for Memory Hierarchy • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy) 2/9/09 12 CS W&M

Q1: Where can a block be placed in the upper level? • Block 12 placed in 8 block cache: – Fully associative, direct mapped, 2-way set associative – S.A. Mapping = Block Number Modulo Number Sets Direct Mapped 2-Way Assoc Fully Associative (12 mod 8) = 4 (12 mod 4) = 0 01234567 01234567 01234567 Cache 1111111111222222222233 01234567890123456789012345678901 Memory 2/9/09 13 CS W&M

Q2: How is a block found if it is in the upper level? • Tag on each block – No need to check index or block offset • Increasing associativity shrinks index, expands tag Block Address Block Offset Tag Index 2/9/09 14 CS W&M

Q2 Fig: C5, Opteron data cache, 64KB, 2-way set ass., 64 byte block LRU, write-back, write allocate, 48 bit virt. add, 40 bit phys. add, 2/9/09 15 CS W&M 64bit register, 3 bits of offset used to select 8 B in block

Q3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: – Random – LRU (Least Recently Used) Assoc: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 2/9/09 16 CS W&M

Q3: After a cache read miss, if there are no empty cache blocks, which block should be removed from the cache? A randomly chosen block? The Least Recently Used (LRU) Easy to implement, how block? Appealing, well does it work? but hard to implement for high associativity Miss Rate for 2-way Set Associative Cache Also, Size Random LRU try 16 KB 5.7% 5.2% other LRU 64 KB 2.0% 1.9% approx. 256 KB 1.17% 1.15%

Q4: What happens on a write? Write-Through Write-Back Write data only to the Data written to cache cache block Policy also written to lower- Update lower level when a block falls out level memory of the cache Debug Easy Hard Do read misses No Yes produce writes? Do repeated writes Yes No make it to lower level? Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).

Write Buffers for Write-Through Caches Lower Cache Processor Level Memory Write Buffer Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not A. Bursts of writes are just one register ? common. Q. Are Read After Write A. Yes! Drain buffer before (RAW) hazards an issue next read, or send read 1 st for write buffer? after check write buffers.

6 Basic Cache Optimizations • Reducing Miss Rate 1. Larger Block size (compulsory misses) 2. Larger Cache size (capacity misses) 3. Higher Associativity (conflict misses) • Reducing Miss Penalty 4. Multilevel Caches 5. Giving Read Misses Priority over Writes • E.g., Read complete before earlier writes in write buffer • Reducing hit time 6. Avoid Address Translation during Indexing of the Cache 2/9/09 20 CS W&M

Page size 8KB, TLB direct mapped 256 entries L1 direct mapped 8 KB L2 direct mapped 4 MB Block size 64 Bytes Virt add 64 bits, phys add 41 bits Virt. Indexed, physically tagged 2/9/09 21 CS W&M

Outline • Memory hierarchy • Locality • Cache design • Virtual address spaces • Page table layout • TLB design options • Conclusion 2/9/09 22 CS W&M

The Limits of Physical Addressing “Physical addresses” of memory locations A0-A31 A0-A31 CPU Memory D0-D31 D0-D31 Data All programs share one address space: The physical address space Machine language programs must be aware of the machine organization No way to prevent a program from accessing any machine resource

Solution: Add a Layer of Indirection “Physical Addresses” “Virtual Addresses” A0-A31 Virtual Physical A0-A31 Address CPU Memory Translation D0-D31 D0-D31 Data User programs run in an standardized virtual address space Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports “modern” OS features: Protection, Translation, Sharing

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review - PowerPoint PPT Presentation

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley Review from last

CS654 Advanced Computer Architecture Lec 1 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 3 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 2 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 5 Performance + Pipeline Review Peter Kemper

CS654 Advanced Computer Architecture Lec 9 Limits to ILP and Simultaneous Multithreading

CS654 Advanced Computer Architecture Lec 4 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 8 Instruction Level Parallelism Peter Kemper

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors Peter Kemper

CS654 Advanced Computer Architecture Lec 12 Vector Wrap-up and Multiprocessor Introduction

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed

Virtual Memory & Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address

Enhancing Software-Defined RAN with Ruozhou Yu, Shuang Qin, Mehdi Bennis, Xianfu Chen,

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review - PowerPoint PPT Presentation

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley Review from last

CS654 Advanced Computer Architecture Lec 1 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 3 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 2 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 5 Performance + Pipeline Review Peter Kemper

CS654 Advanced Computer Architecture Lec 9 Limits to ILP and Simultaneous Multithreading

CS654 Advanced Computer Architecture Lec 4 - Introduction Peter Kemper Adapted from the slides

CS654 Advanced Computer Architecture Lec 8 Instruction Level Parallelism Peter Kemper

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors Peter Kemper

CS654 Advanced Computer Architecture Lec 12 Vector Wrap-up and Multiprocessor Introduction

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed

Virtual Memory &amp; Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address

Enhancing Software-Defined RAN with Ruozhou Yu, Shuang Qin, Mehdi Bennis, Xianfu Chen,

Virtual Memory & Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address