CS252 S05 1 Bad locality behavior Memory Address (one dot per - PDF document

Q. How do architects address this gap? A. Put smaller, faster “cache” memories Performance between CPU and DRAM. CPU (1/latency) Create a “memory hierarchy”. 60% per yr CPU 2X in 1.5 yrs COSC 5351 Advanced Computer Architecture Gap grew 50% per Slides modified from Hennessy CS252 course slides year DRAM 9% per yr DRAM 2X in 10 yrs COSC5351 Advanced Computer Year Architecture Upper Level Capacity Access Time Staging Cost Apple ][ (1977) faster Xfer Unit CPU Registers Registers 100s Bytes CPU: 1000 ns <10s ns prog./compiler Instr. Operands 1-8 bytes DRAM: 400 ns Cache K Bytes Cache 10-100 ns 1-0.1 cents/bit cache cntl Blocks 8-128 bytes Main Memory Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit OS Pages 512-4K bytes Disk G Bytes, 10 ms Disk (10,000,000 ns) -5 -6 10 - 10 cents/bit user/operator Files Mbytes Steve Larger Steve Wozniak Tape Jobs infinite Tape Lower Level sec-min -8 10 COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture L1 (64K Instruction) Managed Managed Managed by OS, by compiler by hardware hardware, application R eg Reg L1 Inst L1 Data L2 DRAM Disk ist Size 1K 64K 32K 512K 256M 80G er 512K Latency s iMac G5 L2 1, 3, 3, 11, 88, 10 7 , Cycles, 1.6 GHz 0.6 ns 1.9 ns 1.9 ns 6.9 ns 55 ns 12 ms Time Goal: Illusion of large, fast, cheap memory Let programs address a memory space that (1K) scales to the disk size, at a speed that is usually as fast as register access COSC5351 Advanced Computer COSC5351 Advanced Computer L1 (32K Data) Architecture Architecture CS252 S05 1

Bad locality behavior Memory Address (one dot per access)  The Principle of Locality: ◦ Program access a relatively small portion of the address space at any instant of time. (This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them.) Temporal  Two Different Types of Locality: Locality ◦ Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) ◦ Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)  Last 15 years, HW relied on locality for speed Spatial It is a property of programs which is exploited in machine design. Locality Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture 10(3): 168-192 (1971) Architecture  Hit: data appears in some block in the upper level (example: Block X)  Hit rate : fraction found in that level ◦ Hit Rate: the fraction of memory access found in the upper ◦ So high that usually talk about Miss rate level ◦ Miss rate fallacy: as MIPS to CPU performance, ◦ Hit Time: Time to access the upper level which consists of miss rate to average memory access time in RAM access time + Time to determine hit/miss memory  Miss: data needs to be retrieved from a block in the  Average memory-access time lower level (Block Y) = Hit time + Miss rate x Miss penalty ◦ Miss Rate = 1 - (Hit Rate) (ns or clocks) ◦ Miss Penalty: Time to replace a block in the upper level +  Miss penalty : time to replace a block from Time to deliver the block the processor lower level, including time to replace in  Hit Time << Miss Penalty CPU ◦ access time : time to lower level = f(latency to lower level) Lower Level Upper Level To Processor Memory ◦ transfer time : time to transfer block Memory =f(BW between upper & lower levels) Blk X From Processor Blk Y COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture  T e : Effective memory access time in cache memory system  Q1: Where can a block be placed in the upper  T c : Cache access time level? (Block placement)  T m : Main memory access time  Q2: How is a block found if it is in the upper T e = T c + (1 - h) T m level? (Block identification)  Example: T c = 0.4ns, T m = 1.2ns, h = 0.85%  Q3: Which block should be replaced on a miss? (Block replacement)  T e = 0.4 + (1 - 0.85) × 1.2 = 0.58ns  Q4: What happens on a write? (Write strategy) COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture CS252 S05 2

 Tag on each block  Block 12 placed in 8 block cache: ◦ Fully associative, direct mapped, 2-way set associative ◦ No need to check index or block offset ◦ S.A. Mapping = Block Number Modulo Number Sets  Increasing associativity shrinks index, expands tag Direct Mapped 2-Way Assoc Full Mapped (12 mod 8) = 4 (12 mod 4) = 0 01234567 01234567 01234567 Cache Block Address Block Offset Tag Index 1111111111222222222233 01234567890123456789012345678901 Memory COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture A randomly chosen block? The Least Recently Used  Easy for Direct Mapped Easy to implement, how (LRU) block? Appealing,  Set Associative or Fully Associative: well does it work? but hard to implement for ◦ Random high associativity ◦ LRU (Least Recently Used) Miss Rate for 2-way Set Associative Cache ◦ FIFO, MRU, LFU (frequently), MFU Also, Size Random LRU Asso soc: c: 2-way 4-way 8-way try 5.7% 5.2% 16 KB Size LRU Ran LRU Ran LRU Ran other LRU 2.0% 1.9% 64 KB 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% approx. 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.17% 1.15% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture Write-Through Write-Back Lower Cache Processor Level Memory Write data only to the Data written to cache cache Write Buffer block Policy also written to lower- Update lower level Holds data awaiting write-through to when a block falls out level memory of the cache lower level memory Debug Easy Hard Q. Why a write buffer ? A. So CPU doesn’t stall Do read misses No Yes Q. Why a buffer, why A. Bursts of writes are produce writes? not just one register ? common. Do repeated writes Yes No make it to lower Q. Are Read After Write A. Yes! Drain buffer before level? next read, or send read 1 st (RAW) hazards an issue Additional option -- let writes to an un-cached address for write buffer? after check write buffers. allocate a new cache line (“write - allocate”). COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture CS252 S05 3

“Physical addresses” of memory locations Reducing Miss Rate  A0-A31 A0-A31 Larger Block size (compulsory misses) 1. CPU Memory Larger Cache size (capacity misses) 2. Higher Associativity (conflict misses) 3. D0-D31 D0-D31 Data Reducing Miss Penalty  Multilevel Caches All programs share one address space: 4. The physical address space Reducing hit time  Machine language programs must be Giving Reads Priority over Writes 5. aware of the machine organization • E.g., Read complete before earlier writes in write buffer No way to prevent a program from accessing any machine resource COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture “Virtual Addresses” “Physical  Translation: Addresses” ◦ Program can be given consistent view of memory, even though physical memory is scrambled Physical A0-A31 Virtual A0-A31 ◦ Makes multithreading reasonable (now used a lot!) Address CPU Memory ◦ Only the most important part of program (“Working Set”) Translation must be in physical memory. ◦ Contiguous structures (like stacks) use only as much D0-D31 D0-D31 physical memory as necessary yet still grow later. Data  Protection: ◦ Different threads (or processes) protected from each other. User programs run in an standardized ◦ Different pages can be given special behavior virtual address space  (Read Only, Invisible to user programs, etc). ◦ Kernel data protected from User programs Address Translation hardware ◦ Very important for protection from malicious programs managed by the operating system (OS)  Sharing: ◦ Can map same physical page to multiple users maps virtual address to physical memory (“Shared memory”) Hardware supports “modern” OS features: Protection, Translation, Sharing COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture Physical A virtual address space Physical Page Table Page Table Memory Space Memory Space is divided into blocks Virtual Address frame of memory called pages frame 12 frame frame V page no. offset frame frame A machine frame frame Page Table Page Table usually supports Base Reg V Access PA pages of a few index Rights into virtual virtual sizes address address page table located table (MIPS R4000): in physical P page no. offset OS memory 12 manages A page table is indexed by a Physical Address the page  Page table maps virtual page numbers to physical table for virtual address frames ( “PTE” = Page Table Entry) each ASID  Virtual memory => treat memory  cache for disk A valid page table entry codes physical memory “frame” address for the page COSC5351 Advanced Computer COSC5351 Advanced Computer Architecture Architecture CS252 S05 4

CS252 S05 1 Bad locality behavior Memory Address (one dot per - PDF document

Q. How do architects address this gap? A. Put smaller, faster cache memories Performance between CPU and DRAM. CPU (1/latency) Create a memory hierarchy. 60% per yr CPU 2X in 1.5 yrs COSC 5351 Advanced Computer Architecture Gap

Min Cost Max Flow S05 capacity: maximum number of trucks that can go through this road at a time.

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides Q. How

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides 11

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

P age 1 A take on Moores Law Technology Trends Bit-level parallelism Instruction-level

Automated Identification and Discarding of Low-Quality External Medication Information in an

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory Parallelism Critical work first,

Page 1 Example: Branch Stall Impact Example: Calculating CPI bottom up Run benchmark and collect

P age 1 Photo of Disk Head, Arm, Disk Device Terminology Actuator Inner Outer Arm Head

Lecture 19: Virtual Memory Virtual Memory concept, Virtual- physical translation, page table,

Vehicle Tracking Syst em 50 nodes on 4 th f loor 5 level ad hoc net 30 sec sampling

EECS 252 Graduate Computer Architecture Lec 1 - Introduction David Culler Electrical

P age 1 Correlating Branches Consider 3 Scenarios I dea: t aken/ not Branch address (4 bits)

1 Components of a Vector Processor Cray- 1 Block Scalar CPU: regist ers, dat apat hs,

Virtual Memory 1 Virtual Memory Main memory is cache for secondary storage

Physically Addressed System Physically Addressed System CS 105 Tour of the Black Holes of

PreviousCourses DigitalLogicDesign BooleanAlgebra

IETF 88, Vancouver ALTO WG November 2013 Overview & Motivation How to make Large -Scale

Towards a R-centric architecture for multi-purpose geographical analysis on heterogeneous

Lecture 4: Storage Management 1 / 57 Storage Management Administrivia Assignment 1 is due on

A Decomposi+on-Based Architecture for Distributed Virtual Network

A/V CATALOGING AT THE CROSSROADS: CARTOGRAPHIC RESOURCES BASICS USING RDA Presentation by: Paige

CS252 S05 1 Bad locality behavior Memory Address (one dot per - PDF document

Q. How do architects address this gap? A. Put smaller, faster cache memories Performance between CPU and DRAM. CPU (1/latency) Create a memory hierarchy. 60% per yr CPU 2X in 1.5 yrs COSC 5351 Advanced Computer Architecture Gap

Min Cost Max Flow S05 capacity: maximum number of trucks that can go through this road at a time.

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides Q. How

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides 11

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

P age 1 A take on Moores Law Technology Trends Bit-level parallelism Instruction-level

Automated Identification and Discarding of Low-Quality External Medication Information in an

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory Parallelism Critical work first,

Page 1 Example: Branch Stall Impact Example: Calculating CPI bottom up Run benchmark and collect

P age 1 Photo of Disk Head, Arm, Disk Device Terminology Actuator Inner Outer Arm Head

Lecture 19: Virtual Memory Virtual Memory concept, Virtual- physical translation, page table,

Vehicle Tracking Syst em 50 nodes on 4 th f loor 5 level ad hoc net 30 sec sampling

EECS 252 Graduate Computer Architecture Lec 1 - Introduction David Culler Electrical

P age 1 Correlating Branches Consider 3 Scenarios I dea: t aken/ not Branch address (4 bits)

1 Components of a Vector Processor Cray- 1 Block Scalar CPU: regist ers, dat apat hs,

Virtual Memory 1 Virtual Memory Main memory is cache for secondary storage

Physically Addressed System Physically Addressed System CS 105 Tour of the Black Holes of

Previous*Courses Digital*Logic*Design Boolean*Algebra

IETF 88, Vancouver ALTO WG November 2013 Overview &amp; Motivation How to make Large -Scale

Towards a R-centric architecture for multi-purpose geographical analysis on heterogeneous

Lecture 4: Storage Management 1 / 57 Storage Management Administrivia Assignment 1 is due on

A Decomposi+on-Based Architecture for Distributed Virtual Network

A/V CATALOGING AT THE CROSSROADS: CARTOGRAPHIC RESOURCES BASICS USING RDA Presentation by: Paige

PreviousCourses DigitalLogicDesign BooleanAlgebra

IETF 88, Vancouver ALTO WG November 2013 Overview & Motivation How to make Large -Scale