Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - PowerPoint PPT Presentation

Why is DRAM Important? • Why do we need bigger and faster memory? • Data intensive computing – Bigger, more complex application – Large and high-bandwidth data processing – DRAM is often a performance bottleneck 40

Why is DRAM Important? • Parallelism – Out-of-order core • A single core can generate many memory requests – Multicore • Multiple cores share DRAM – Accelerator • GPU 41

Memory Performance Isolation Part 2 Part 3 Part 4 Part 1 Core3 Core1 Core2 Core4 LLC LLC LLC LLC Memory Controller DRAM • Q. How to guarantee predictable memory performance? 42

Memory System Architecture L2 CACHE 0 L2 CACHE 1 SHARED L3 CACHE DRAM INTERFACE DRAM BANKS CORE 0 CORE 1 DRAM MEMORY CONTROLLER L2 CACHE 2 L2 CACHE 3 CORE 2 CORE 3 This slide is from Prof. Onur Mutlu 43

DRAM Organization • Channel • Rank • Chip • Bank • Row • Row/Column 44

The DRAM subsystem “ Channel ” DIMM (Dual in-line memory module) Processor Memory channel Memory channel This slide is from Prof. Onur Mutlu

Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM This slide is from Prof. Onur Mutlu

Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1 This slide is from Prof. Onur Mutlu

Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel This slide is from Prof. Onur Mutlu

Breaking down a Rank . . . Chip 0 Chip 1 Chip 7 Rank 0 <56:63> <8:15> <0:7> <0:63> Data <0:63> This slide is from Prof. Onur Mutlu

Breaking down a Chip Chip 0 Bank 0 <0:7> <0:7> <0:7> ... <0:7> <0:7> This slide is from Prof. Onur Mutlu

Breaking down a Bank 2kB 1B (column) row 16k-1 ... Bank 0 row 0 <0:7> Row-buffer 1B 1B 1B ... <0:7> This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space 0xFFFF…F Channel 0 ... DIMM 0 0x40 Rank 0 64B cache block 0x00 This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 0x00 This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 0x00 This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 8B 0x00 8B This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> cache block 8B 0x00 This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> 8B cache block 8B 0x00 8B This slide is from Prof. Onur Mutlu

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <56:63> <8:15> <0:7> 0x40 64B Data <0:63> 8B cache block 8B 0x00 A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially. This slide is from Prof. Onur Mutlu

DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) • DRAM DIMM Have multiple banks • Different banks can be accessed in parallel Bank Bank Bank Bank 1 2 3 4

Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4

Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 • Out-of-order processors

Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Mess DRAM DIMM • Performance = ?? Bank Bank Bank Bank 1 2 3 4

Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Slow DRAM DIMM • 1bank b/w Bank Bank Bank Bank – Less than peak b/w 1 2 3 4 – How much?

DRAM Chip Bank 4 Bank 3 Bank 2 READ (Bank 1, Row 3, Col 7) Bank 1 Row 1 precharge Row 2 Col7 Row 3 Row 4 Row 5 activate Row Buffer Read/write • State dependent access latency – Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

DDR3 Timing Parameters Kim et al., “Bounding Memory Interference Delay in COTS -based Multi- Core Systems,” RTAS’14 65

DRAM Controller • Service DRAM requests (from CPU) while obeying timing/resource constraints – Translate requests to DRAM command sequences – Timing constraints: e.g., minimum write-to-read delay, activation time, … – Resource conflicts: bank, bus, channel • Maximize performance – Buffering, reordering, pipelining in scheduling requests 66

DRAM Controller Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1. • Request queue – Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering 67

Request Reordering Initial Queue Reordered Queue Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM DRAM 2 Row Switch 1 Row Switch • Improve row hit ratio and throughput • Unpredictable queuing delay 68

Row Management Policy • Open row – Keep the row open after an access • If next access targets the same row: CAS • If next access targets a different row: PRE + ACT + CAS • Close row – Close the row after an access • always pay the same (longer) cost: ACT + CAS • Adaptive policies 69

COTS DRAM Controller • FR-FCFS scheduling (out-of-order) • Separate read/write buffers – High/low watermark based switching • Read prioritized over writes 70

PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms Heechul Yun*, Renato Mancuso + , Zheng-Pei Wu # , Rodolfo Pellizzoni # *University of Kansas, + University of Illinois , # University of Waterloo 71

Background: DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) • DRAM DIMM Have multiple banks • Different banks can be accessed in parallel Bank Bank Bank Bank 1 2 3 4 72

Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 73

Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Fast DRAM DIMM • Peak = 10.6 GB/s Bank Bank Bank Bank – DDR3 1333Mhz 1 2 3 4 • Out-of-order processors 74

Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Mess DRAM DIMM • Performance = ?? Bank Bank Bank Bank 1 2 3 4 75

Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) Slow DRAM DIMM • 1bank b/w Bank Bank Bank Bank – Less than peak b/w 1 2 3 4 – How much? 76

Timing Diagram Row hits are fast but can still suffer interference at bus Row misses are slow but can overlap when target different banks Heechul Yun, Rodolfo Pellizzoni, Prathap Kumar Valsan. Parallelism-Aware Memory Interference Delay Analysis 77 for COTS Multicore Systems. Euromicro Conference on Real-Time Systems (ECRTS) , 2015. [pdf] [ppt]

Problem • OS does NOT know SMP OS DRAM banks Core1 Core2 Core3 Core4 • OS memory pages are spread all over multiple banks L3 Memory Controller (MC) ???? DRAM DIMM Unpredictable memory Bank Bank Bank Bank 1 2 3 4 performance 78

PALLOC • OS is aware of SMP OS DRAM mapping Core1 Core2 Core3 Core4 • Each page can be allocated to a CPC desired DRAM bank Memory Controller (MC) DRAM DIMM Flexible Bank Bank allocation Bank Bank 1 2 3 4 policy 79

PALLOC • Private banking Core1 Core2 Core3 Core4 – Allocate pages on certain L3 exclusively assigned banks Memory Controller (MC) DRAM DIMM Eliminate Bank Bank Bank Inter-core bank Bank 1 2 3 4 conflicts 80

Challenges • Finding DRAM address mapping • Modifying Linux’s memory allocation mechanism 81

Identifying Memory Mapping 31 21 19 14 12 6 0 cache-sets banks banks Intel Xeon 3530 + 4GiB DDR3 DIMM (16 banks) 31 18 16 14 0 7 6 cache-sets banks channel Freescale P4080 + 2x2 GiB DDR3 DIMM (32 banks) • Memory mappings are platform specific • We developed a detection tool software 82

DRAM Address Map Detector Slow Fast https://github.com/heechul/palloc/blob/master/README-map-detector.md 83

More Recent Solution https://github.com/IAIK/drama 84

Complex Addressing 85

Implementation • Modified Linux kernel’s buddy allocator – DRAM bank-aware page frame allocation at each page fault 86

Background • On how Linux allocates memory 87

User-level Memory Allocation • When does a process actually allocate a memory from the kernel? – On a page fault – Allocate a page (e.g., 4KB) • What does malloc () do? – Doesn’t physically allocate pages – Manage a process’s heap – Variable size objects in heap 88

Kernel-level Memory Allocation • Page-level allocator (low-level) – Page granularity (4K) – Buddy allocator • Other kernel-memory allocators – Support fine-grained allocations – Slab, kmalloc, vmalloc allocators 89

Kernel-Level Memory Allocators Kernel code vmalloc kmalloc (large) non-physically Arbitrary size objects contiguous memory SLAB allocator Multiple fixed-sized object caches Page allocator (buddy) Allocate power of two pages: 4K, 8K, 16K, … 90

Buddy Allocator • Linux’s page -level allocator • Allocate power of two number of pages: 1, 2, 4, 8, … pages. • Request rounded up to next highest power of 2 • When smaller allocation needed than is available, current chunk split into two buddies of next-lower power of 2 • Quickly expand/shrink across the lists 32KB 16KB 8KB 4KB 91

Buddy Allocator • Example – Assume 256KB chunk available, kernel requests 21KB 256 Free 128 128 Free Free 64 64 128 Free Free Free 32 32 64 128 Free Free Free Free 32 32 64 128 A Free Free Free 92

Buddy Allocator • Example 32 32 64 128 A Free Free Free – Free A 32 32 64 128 Free Free Free Free 64 64 128 Free Free Free 128 128 Free Free 256 Free 93

PALLOC in Linux Kernel code vmalloc kmalloc (large) non-physically Arbitrary size objects contiguous memory SLAB allocator Multiple fixed-sized object caches PALLOC 94

Simplified Pseudocode 95

PALLOC Interface • Example: per-core private banking (PB) # cd /sys/fs/cgroup # mkdir core0 core1 core2 core3  create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank  assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo 12-15 > core3/palloc.dram_bank 96

Evaluation Platforms • Platform #1: Intel Xeon 3530 – X86-64, 4 cores, 8MB shared L3 cache – 1 x 4GB DDR3 DRAM module (16 banks) – Modified Linux 3.6.0 • Platform #2: Freescale P4080 – PowerPC, 8 cores, 2MB shared LLC – 2 x 2GB DDR3 DRAM module (32 banks) – Modified Linux 3.0.6 97

Samebank vs. Diffbank • Samebank: All cores  Bank0 • Diffbank: Core0  Bank0, Core1-3  Bank 1-3 – Zero interference !!! 98

Real-Time Performance Buddy(solo) PALLOC(diffbank) Buddy • Setup: HRT  Core0, X-server  Core1 • Buddy: no bank control (use all Bank 0-15) • Diffbank: Core0  Bank0-7, Core1  Bank8-15 99

SPEC2006 • Use 15 high-medium memory intensive benchmarks 100

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - PowerPoint PPT Presentation

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS Applications Real-time architecture/OS Fault tolerance, safety, security Amazon prime air 2 Topics Introduction to Real-Time Systems, CPS

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Today: LRU Approximations, Multiprogramming LRU approximations: Second Chance

A Distributed Polylogarithmic Time Algorithm for Self-Stabilizing Skip Graphs Christian Decker

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

CSCI 3136 Principles of Programming Languages Names, Scopes, and Bindings - 1 Summer 2013

GRAD ORIENTATION Welcome to UBC Computer Science! COMPUTER SCIENCE GRADUATE STUDENT

Fix the hosts (Position Paper) Matt Mathis (Google) Andrew McGregor (Fastly) Stanford Buffer

A"Buffer(Based"Approach"to"Rate"Adapta2on:"

Condition Synchronization People are still trying to figure that out. Compromises: between

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - PowerPoint PPT Presentation

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS Applications Real-time architecture/OS Fault tolerance, safety, security Amazon prime air 2 Topics Introduction to Real-Time Systems, CPS

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Today: LRU Approximations, Multiprogramming LRU approximations: Second Chance

A Distributed Polylogarithmic Time Algorithm for Self-Stabilizing Skip Graphs Christian Decker

Memory Management Memory Management 5A. Memory Management and Address Spaces 1. allocate/assign

CSCI 3136 Principles of Programming Languages Names, Scopes, and Bindings - 1 Summer 2013

GRAD ORIENTATION Welcome to UBC Computer Science! COMPUTER SCIENCE GRADUATE STUDENT

Fix the hosts (Position Paper) Matt Mathis (Google) Andrew McGregor (Fastly) Stanford Buffer

A&quot;Buffer(Based&quot;Approach&quot;to&quot;Rate&quot;Adapta2on:&quot;

Condition Synchronization People are still trying to figure that out. Compromises: between

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

A"Buffer(Based"Approach"to"Rate"Adapta2on:"