Large and Fast: Exploiting Memory Hierarchy - PDF document

微算機系統第七章 Large and Fast: Exploiting Memory Hierarchy 陳伯寧教授電信工程學系國立交通大學 chap7-1 Po-Ning Chen@CM.NCTU Memories: Review � SRAM: � value is stored on a pair of inverting gates � very fast but takes up more space than DRAM (4 to 6 transistors) � DRAM: � value is stored as a charge on capacitor (must be refreshed) � very small but slower than SRAM (factor of 5 to 10) Wordline Wordline Passtransistor Passtransistor Capacitor Capacitor Bitline Bitline Po-Ning Chen@CM.NCTU chap7-2

Exploiting memory hierarchy � Users want large and fast memories! (Information collected on 2004) � SRAM access times are 0.5~5ns at cost of US$4,000~$10,000 per Gbyte. � DRAM access times are 50~70ns at cost of US$100~200 per Gbyte. � Disk access times are 5 to 20 million ns at cost of US$0.5~$2 per Gbyte. � A cost-effective approach � build a memory hierarchy Po-Ning Chen@CM.NCTU chap7-3 Exploiting memory hierarchy � An unbreakable rule � The data cannot be present in level i unless it is present in level i+1. CPU CPU Increasing distance Increasing distance Level 1 Level 1 from the CPU in from the CPU in access time access time Level 2 Level 2 Levels in the Levels in the Memory hierarchy Memory hierarchy Level Level n n Size of the memory at each level Size of the memory at each level Po-Ning Chen@CM.NCTU chap7-4

Locality � Locality: A principle that makes having a memory hierarchy a good idea � If an item is referenced, temporal locality (locality in time) : it will tend to be referenced again soon spatial locality (locality in space): nearby items will tend to be referenced soon. � Our initial focus: Take two-level hierarchy as an example (upper, lower) � block: minimum unit of data transferring between two levels � hit: data requested is in the upper level � miss: data requested is not in the upper level Po-Ning Chen@CM.NCTU chap7-5 Locality � Example of temporal locality in programs � Instructions in Loops � Example of spatial locality in programs � Sequentially executed instructions � Example of two-level hierarchy � Main memory (DRAM) – lower level � Cache (SRAM) – upper level Po-Ning Chen@CM.NCTU chap7-6

Terminologies � Hit rate or hit ratio � Fraction of memory accesses found in the upper level � Miss rate or miss ratio � Fraction of memory accesses not found in the upper level � Hit time � The time to access the upper level of memory hierarchy, including the time to determine whether the access is a hit or a miss, but not including the retrieve time of a block from lower memory to higher memory when a miss occurs. � The block retrieve time is referred to as miss penalty. Po-Ning Chen@CM.NCTU chap7-7 How to take advantage of program locality? � Keep more recently accessed data items closer to the processor (take advantage of temporal locality) � Move blocks consisting of multiple continuous words in memory ((take advantage of spatial locality) Po-Ning Chen@CM.NCTU chap7-8

Cache � Two issues: � How do we know if a data item is in the cache? � If it is, how do we find it? � Our first (simplified) example: � block size is one word of data � "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level Po-Ning Chen@CM.NCTU chap7-9 Direct mapped cache � Mapping: Cache Cache cache block Advantage: If the number of 000 001 000 001 010 010 011 011 100 100 101 101 110 110 111 111 address = cache blocks is a power of memory block two, the cache can be assessed directly with the lower-order address modulo bits (e.g., 3 bits in this graph). # of cache blocks in the cache 00 00 00 01 1 00 01 10 01 1 0 01 10 00 01 1 01 11 10 01 1 10 00 00 01 1 10 01 10 01 1 11 10 00 01 1 11 11 10 01 1 0 0 0 1 1 1 1 Memory Po-Ning Chen@CM.NCTU Memory chap7-10

In this design, In this design, Direct mapped cache what kind of what kind of for MIPS locality are we locality are we Address (showing bit positions) Address (showing bit positions) taking advantage taking advantage 3 3 1 1 3 3 0 0 1 1 3 3 1 1 2 2 1 1 1 1 2 1 2 1 0 0 Byte Byte of? of? offset offset 1 0 0 2 0 2 0 1 Hit Hit D a D a t ta a T a T a g g I I n nd d e e x x In n d de e x x Valid Tag Data I Valid Tag Data 0 0 1 1 2 2 1 0 1 0 2 2 1 1 1 1 0 0 2 2 2 2 1 0 0 2 2 3 3 1 2 0 2 0 3 3 2 2 Need a tag to identify a hit. Need a valid bit for data validity test. chap7-11 Po-Ning Chen@CM.NCTU A true example: Caches in DECStation 3100 Address (showing bit positions) Address (showing bit positions) 3 1 1 3 3 0 0 1 7 7 1 1 6 6 1 1 5 5 2 1 1 0 0 3 1 2 Byte Byte offset offset 1 6 6 1 1 4 4 1 Hit Hit D a D a t ta a T a T a g g I n I nd d e e x x In I n d de e x x Valid Valid Tag Tag Data Data 0 0 1 1 2 2 1 6 6 3 3 8 8 1 1 1 1 6 6 3 3 8 8 2 2 1 1 6 1 6 3 3 8 8 3 3 1 6 1 6 3 2 3 2 Two caches of the same structure: Instruction cache and data cache. chap7-12 Po-Ning Chen@CM.NCTU

Miss rate for DECStation 3100 Program Instruction Data miss Overall miss miss rate rate rate gcc 6.1% 2.1% 5.4% spice 1.2% 1.3% 1.2% Po-Ning Chen@CM.NCTU chap7-13 How about a hit in memory write? � “Memory read” is straightforward for a cache system. � A miss causes a re-fill of the cache from the memory. � How about a hit in memory write? � Can we write into the cache without changing the respective memory content? � Answer: Yes, but the “inconsistent” between the cache content and memory content should be explicitly handled. � Approach taken by DECStation 3100: Write-through cache � Always write the data into both the memory and cache. Po-Ning Chen@CM.NCTU chap7-14

How about a miss in memory write? � Approach taken by DECStation 3100 for a miss in memory write: Write-through cache � Always write the data into both the memory and cache. � Observation � Poor performance of a write-through cache: A memory access is resulted for every write, regardless of whether it is a miss or a hit. � Example: Program gcc, whose CPI without considering memory miss is 1.2, requires 13% of memory write. If a memory-write requires 10 cycles, then the resultant CPI = 1.2 + 10 * 13% = 2.5—two times slower in performance. Po-Ning Chen@CM.NCTU chap7-15 Performance improvement of write- through caches � Solution 1: Using write buffer � Writing to a quick write buffer instead of directly to memory. � Then “execution” and “buffer-to-memory-update” can be done simultaneously. CPU- -generate generate- -write rate write rate Buffer- -memory memory- -update rate update rate CPU Buffer � Solution 2: Write-back cache � When a write occurs, the new value is written only to the block in the cache. � The modified block is written to memory when a miss causes a replace in that block. Po-Ning Chen@CM.NCTU chap7-16

Direct mapped cache considering spatial locality A Ad dd dr re es ss s ( (s sh ho ow wi in ng g b bi it t p po os si it ti io on ns s) ) Taking advantage of 3 31 1 16 1 6 1 15 5 4 4 3 3 2 2 1 1 0 0 spatial locality: 1 16 6 1 12 2 2 2 B By yt te e Hi H it t D Da at ta a Ta T ag g o of ff fs se et t In I nd de ex x Bl B lo oc ck k o of ff fs se et t 16 1 6 b bi it ts s 12 1 28 8 b bi it ts s V V Ta ag g Da D at ta a T 4K 4 K e en nt tr ri ie es s 1 16 6 32 3 2 32 3 2 32 3 2 3 32 2 Mu M ux x 32 3 2 Po-Ning Chen@CM.NCTU chap7-17 Hit versus miss: A summary � Read hits � this is what we want! � Read misses � stall the CPU, fetch block from memory, deliver to cache, restart � Write hits: � can replace data in cache and memory (write-through) � write the data only into the cache (write-back the cache later) � Write misses: � read the entire block into the cache, then write the word Po-Ning Chen@CM.NCTU chap7-18

Hit versus miss: A summary � “Block transfer latency” could be a penalty to large block size. Solutions: � Early restart – Resume execution as soon as the requested word of the block is returned. � Requested word first or critical word first – Memory transfer starts with the address of the requested word; then wraps around to the beginning of the block. � Large block transfer bandwidth – E.g., structures b and c in the next slide. Po-Ning Chen@CM.NCTU chap7-19 Hardware issues � Make reading multiple words easier by using banks of memory C CP PU U CP PU U CP C PU U C M Mu ul lt ti ip pl le ex xo or r C Ca ac ch he e Ca C ac ch he e Ca C ac ch he e Bu B us s Bu B us s Bu B us s M Me em mo or ry y M Me em mo or ry y M Me em mo or ry y M Me em mo or ry y Me M em mo or ry y b ba an nk k 0 0 ba b an nk k 1 1 b ba an nk k 2 2 ba b an nk k 3 3 c. . I In nt te er rl le ea av ve ed d m me em mo or ry y o or rg ga an ni iz za at ti io on n b. b . W Wi id de e m me em mo or ry y o or rg ga an ni iz za at ti io on n c Me em mo or ry y M a. a . O On ne e- -w wo or rd d- -w wi id de e me m em mo or ry y o or rg ga an ni iz za at ti io on n Po-Ning Chen@CM.NCTU chap7-20

Large and Fast: Exploiting Memory Hierarchy - PDF document

Large and Fast: Exploiting Memory Hierarchy chap7-1 Po-Ning Chen@CM.NCTU Memories: Review SRAM: value is stored on a pair of inverting gates very

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction Principle of Locality

Chapt hapter er 5 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Slide Set #15: Exploiting Memory Hierarchy 1 ADMIN Chapter 5 Reading 5.1, 5.3, 5.4 2

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Memory Management Ideally programmers want memory that is large fast non

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy Philipp Koehn 14 October 2019 Philipp Koehn Computer Systems Fundamentals:

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

SOUTH AFRICA IRF REGIONAL CONFERENCE The N3Toll Concession(Pty)Ltd: Lessons Learnt during the

Track and Trace Applying Technology To The Paddock to Plate Supply Chain Potential for Error is

Climate Change Impact Assessment over West Central Florida using CLAREnCE10 data from FSU Results

SMAC (Social Media Advisory Council ) October 18, 2011 | QR Codes QR Codes - How They Work

Generic Programming Galore Using D Andrei Alexandrescu, PhD Research Scientist Facebook 1 / 33

FEM FEMA A Ove Overvi rview Exte ew Externa rnal Comme l Commerci rcial al Borr

Sunteck Realty Ltd. Disclaimer By attending the meeting where this presentation is made and any

Q3 FY2016 Earnings Presentation February 5, 2016 Important Notice This presentation contains

Large and Fast: Exploiting Memory Hierarchy - PDF document

Large and Fast: Exploiting Memory Hierarchy chap7-1 Po-Ning Chen@CM.NCTU Memories: Review SRAM: value is stored on a pair of inverting gates very

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction Principle of Locality

Chapt hapter er 5 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Slide Set #15: Exploiting Memory Hierarchy 1 ADMIN Chapter 5 Reading 5.1, 5.3, 5.4 2

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Memory Management Ideally programmers want memory that is large fast non

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy Philipp Koehn 14 October 2019 Philipp Koehn Computer Systems Fundamentals:

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

SOUTH AFRICA IRF REGIONAL CONFERENCE The N3Toll Concession(Pty)Ltd: Lessons Learnt during the

Track and Trace Applying Technology To The Paddock to Plate Supply Chain Potential for Error is

Climate Change Impact Assessment over West Central Florida using CLAREnCE10 data from FSU Results

SMAC (Social Media Advisory Council ) October 18, 2011 | QR Codes QR Codes - How They Work

Generic Programming Galore Using D Andrei Alexandrescu, PhD Research Scientist Facebook 1 / 33

FEM FEMA A Ove Overvi rview Exte ew Externa rnal Comme l Commerci rcial al Borr

Sunteck Realty Ltd. Disclaimer By attending the meeting where this presentation is made and any

Q3 FY2016 Earnings Presentation February 5, 2016 Important Notice This presentation contains

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several