Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 - PDF document

Morgan Kaufmann Publishers 9 May, 2012 §5.1 Introduction Memory Technology � Static RAM (SRAM) �� 0.5ns – 2.5ns, $2000 – $5000 per GB � Dynamic RAM (DRAM) �� 50ns – 70ns, $20 – $75 per GB �� Magnetic disk �� 5ms – 20ms, $0.20 – $2 per GB � Ideal memory � Access time of SRAM � Capacity and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2 Principle of Locality Taking Advantage of Locality � Programs access a small proportion of � Memory hierarchy their address space at any time � Store everything on disk � Temporal locality � Copy recently accessed (and nearby) � Items accessed recently are likely to be items from disk to smaller DRAM memory accessed again soon � Main memory � e.g., instructions in a loop, induction variables � Copy more recently accessed (and � Spatial locality nearby) items from DRAM to smaller � Items near those accessed recently are likely SRAM memory to be accessed soon � Cache memory attached to CPU � E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4 §5.2 The Basics of Caches Memory Hierarchy Levels Cache Memory � Block (aka line): unit of copying � Cache memory � May be multiple words � The level of the memory hierarchy closest to � If accessed data is present in the CPU upper level � Given accesses X 1 , …, X n–1 , X n � Hit: access satisfied by upper level � Hit ratio: hits/accesses � If accessed data is absent � How do we know if � Miss: block copied from lower level � Time taken: miss penalty the data is present? � Miss ratio: misses/accesses � Where do we look? = 1 – hit ratio � Then accessed data supplied from upper level Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 1

Morgan Kaufmann Publishers 9 May, 2012 Direct Mapped Cache Tags and Valid Bits � Location determined by address � How do we know which particular block is stored in a cache location? � Direct mapped: only one choice � Store block address as well as the data � (Block address) modulo (#Blocks in cache) � Actually, only need the high-order bits � Called the tag � What if there is no data in a location? � #Blocks is a power of 2 � Valid bit: 1 = present, 0 = not present � Use low-order � Initially 0 address bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Cache Example Cache Example � 8-blocks, 1 word/block, direct mapped Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 � Initial state Index V Tag Data Index V Tag Data 000 N 000 N 001 N 001 N 010 N 010 N 011 N 011 N 100 N 100 N 101 N 101 N 110 N 110 Y 10 Mem[10110] 111 N 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10 Cache Example Cache Example Word addr Binary addr Hit/miss Cache block Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 22 10 110 Hit 110 26 11 010 Hit 010 Index V Tag Data Index V Tag Data 000 N 000 N 001 N 001 N 010 Y 11 Mem[11010] 010 Y 11 Mem[11010] 011 N 011 N 100 N 100 N 101 N 101 N 110 Y 10 Mem[10110] 110 Y 10 Mem[10110] 111 N 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 2

Morgan Kaufmann Publishers 9 May, 2012 Cache Example Cache Example Word addr Binary addr Hit/miss Cache block Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 18 10 010 Miss 010 3 00 011 Miss 011 16 10 000 Hit 000 Index V Tag Data Index V Tag Data 000 Y 10 Mem[10000] 000 Y 10 Mem[10000] 001 N 001 N 010 Y 11 Mem[11010] 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 011 Y 00 Mem[00011] 100 N 100 N 101 N 101 N 110 Y 10 Mem[10110] 110 Y 10 Mem[10110] 111 N 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Address Subdivision Example: Larger Block Size � 64 blocks, 16 bytes/block � To what block number does address 1200 map? � Block address = � 1200/16 � = 75 � Block number = 75 modulo 64 = 11 31 10 9 4 3 0 Tag Index Offset 22 bits 6 bits 4 bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16 Block Size Considerations Cache Misses � Larger blocks should reduce miss rate � On cache hit, CPU proceeds normally � Due to spatial locality � On cache miss � But in a fixed-sized cache � Stall the CPU pipeline � Larger blocks � fewer of them � Fetch block from next level of hierarchy � More competition � increased miss rate � Instruction cache miss � Larger blocks � pollution � Restart instruction fetch � Data cache miss � Larger miss penalty � Complete data access � Can override benefit of reduced miss rate � Early restart and critical-word-first can help Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 3

Morgan Kaufmann Publishers 9 May, 2012 Write-Through Write-Back � On data-write hit, could just update the block in � Alternative: On data-write hit, just update cache the block in cache � But then cache and memory would be inconsistent � Keep track of whether each block is dirty � Write through: also update memory � When a dirty block is replaced � But makes writes take longer � e.g., if base CPI = 1, 10% of instructions are stores, � Write it back to memory write to memory takes 100 cycles � Can use a write buffer to allow replacing block � Effective CPI = 1 + 0.1×100 = 11 to be read first � Solution: write buffer � Holds data waiting to be written to memory � CPU continues immediately � Only stalls on write if write buffer is already full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20 Write Allocation Example: Intrinsity FastMATH � Embedded MIPS processor � What should happen on a write miss? � 12-stage pipeline � Alternatives for write-through � Instruction and data access on each cycle � Allocate on miss: fetch the block � Split cache: separate I-cache and D-cache � Write around: don’t fetch the block � Each 16KB: 256 blocks × 16 words/block � Since programs often write a whole block before � D-cache: write-through or write-back reading it (e.g., initialization) � SPEC2000 miss rates � For write-back � I-cache: 0.4% � Usually fetch the block � D-cache: 11.4% � Weighted average: 3.2% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22 Example: Intrinsity FastMATH Main Memory Supporting Caches � Use DRAMs for main memory � Fixed width (e.g., 1 word) � Connected by fixed-width clocked bus � Bus clock is typically slower than CPU clock � Example cache block read � 1 bus cycle for address transfer � 15 bus cycles per DRAM access � 1 bus cycle per data transfer � For 4-word block, 1-word-wide DRAM � Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles � Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 4

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 - PDF document

Morgan Kaufmann Publishers 9 May, 2012 5.1 Introduction Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Welcome to The Memory Class An Introduction to Memory Problems and the Memory Center Agenda For

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review Peter Kemper Adapted

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed

Virtual Memory & Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address

Enhancing Software-Defined RAN with Ruozhou Yu, Shuang Qin, Mehdi Bennis, Xianfu Chen,

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 - PDF document

Morgan Kaufmann Publishers 9 May, 2012 5.1 Introduction Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Welcome to The Memory Class An Introduction to Memory Problems and the Memory Center Agenda For

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

CS654 Advanced Computer Architecture Lec 8 Memory Hierarchy Review Peter Kemper Adapted

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed

Virtual Memory &amp; Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address

Enhancing Software-Defined RAN with Ruozhou Yu, Shuang Qin, Mehdi Bennis, Xianfu Chen,

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex

Virtual Memory & Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address