Memory Hierarchy
Heechul Yun
1
Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - - PowerPoint PPT Presentation
Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS Applications Real-time architecture/OS Fault tolerance, safety, security Amazon prime air 2 Topics Introduction to Real-Time Systems, CPS
Heechul Yun
1
2
Amazon prime air
– Real-time cache, DRAM controller designs – Real-time microarchitecture/OS Support – Real-time support for GPU/FPGA
3
– Performance: average timing – Determinism: variance and worst-case timing
– Focused on determinism – So that we can analyze the system at design time – Many challenges exist in computer architecture – In general, performance demand was not high.
– Such as self-driving cars and UAVs (intelligent robots) – Demand both performance and determinism – More difficult to satisfy both
4
5
Performance Predictability
Performance Architecture Real-Time Architecture High Performance Real-Time Architecture
6
Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM
toward μP based platforms”
predictable real-time behavior on high- performance platforms”
7
(*) Arne Hamann (Bosch), “Industrial Challenges: moving from classical to high-performance real-time systems." In Waters 2019
8
https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/cast/cast_papers/media/cast-32A.pdf
certification agencies on multicore
that affect software timing
interference channels are taken care of (“robust partitioning”) nobody can (yet)
9
shared hardware resources can leak secret
10
https://meltdownattack.com/
11
Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM
– Mapping: phy addr mapping function set index – Replacement: select victim line among the ways
– It just works!
12
13
tags index
Cache cache-line (L) Cache Physical address Cache sets S L
Cache Cache Cache
14
tags index
Cache Physical address Cache sets Cache cache-line (L) S L 2 3 4 1
– Evict least recently used cache-line – “Good” (analyzable) policy. Tight analysis exists. – Expensive to maintain order. Not used for large caches
15
– Use a binary tree – Each node records which half is older – On a miss, follow the older path and flip the bits along the way – Approximate LRU, No need to sort, practical – But analysis is more pessimistic
16
1 1 1 1
L0 L1 L2 L3 L4 L5 L6 L7
Older
Image credit: Prof. Mikko H. Lipasti
17
Image credit: https://en.wikipedia.org/wiki/Pseudo-LRU
– One MRU bit per cache-line – Set 1 on access; when the last remaining 0 bit is set to 1, all other bits are reset to 0. – At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.
18
Udacity Lecture: https://www.youtube.com/watch?v=8CjifA2yw7s
– Manual (if you are lucky) – Reverse engineering
19
Image source: [Abel and Reineke, RTAS 2013]
– Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)!
– Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer
20
– Problem: extremely pessimistic
– 1000 instructions, 100 mem accesses, 10 misses
– Actual = 900 + 90*1 + 10*100 = 1990 = ~2000cycles – WCETallmiss = 900 + 100 * 100 = 10900 = ~11000 cycles
21
– To reduce pessimism in WCET estimation
– If we assume
– Then we can statically determine hits/misses
22
23
Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM
– Requires h/w support
– Can be done in s/w as long as there’s MMU.
– Page-coloring
24
– E.g., Freescale P4080, Intel
25
Cache Cache Cache Cache Cache sets Cache cache-line (L) 2 3 4 1 Core1 Core2 Core3 Core4
– Intel’s way partitioning mechanism – Thread/VM logical id resource (cache) partition
– CAT: cache allocation technology – CMT: cache monitoring technology – MBM: memory bandwidth monitoring – CDP: code/data prioritization
– C. Peng, “Achieving QoS in Server Virtualization,” 2016
26
27
28
29
30
31
Cache Cache Cache Cache Cache sets Cache cache-line (L) 2 3 4 1 Core1 Core2 Core3 Core4
– Page coloring: control physical address (cache index) of pages
certain CPU cores
32
Cache Cache Cache Cache Cache sets Cache cache-line (L) 2 3 4 1 Core1 Core2 Core3 Core4
allocated page
33
Color index: OS-controlled address bits
OS controlled bits for L2 partitioning 31 6 Set index 17 12 31 6 Set index 14 12 Tag Page offset Cache-line
12 Physical page frame number Cache-line
Cache-line
Set index Set index Tag Tag 6 14 16 L2 Cache (shared) L1 Cache (private) Physical Address 31
34
12 14 banks L3 cache-sets 31 6 19 21 banks (private) L1: 32K-I/D (4/8 way) 4KB/way (12 bits) (private) L2: 256KB (8way) 32KB/way (12 + 3bits) (shared) L3: 8MB (16way) 512KB/way (12 + 7bits) L2 cache-sets 15 L1-D cache-sets
– Partitioning L3 using bit 12,13 also partitions the L2 cache and DRAM banks due to address overlap
35
– Slice id + Set index unique cache location
USENIX, SP’15 “Last-Level Cache Side-Channel Attacks are Practical”
36
– Assign the mapping statically at once – Pros: simple – Cons: what if the assignment is not ideal?
– Assignment may change over time – Pros: can adapt changes in behavior – Cons: page recoloring is costly
37
– Pros
– Cons
– Pros
– Cons
38
39
– MSHRs, WBBuffer, …
Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM
– Bigger, more complex application – Large and high-bandwidth data processing – DRAM is often a performance bottleneck
40
– Out-of-order core
– Multicore
– Accelerator
41
performance?
Part 1 Part 2 Part 3 Part 4
42
Core1 Core2 Core3 Core4 DRAM Memory Controller LLC LLC LLC LLC
43
CORE 1
L2 CACHE 0
SHARED L3 CACHE DRAM INTERFACE
CORE 0 CORE 2 CORE 3
L2 CACHE 1 L2 CACHE 2 L2 CACHE 3
DRAM BANKS
DRAM MEMORY CONTROLLER
This slide is from Prof. Onur Mutlu
44
Memory channel Memory channel DIMM (Dual in-line memory module) Processor “Channel”
This slide is from Prof. Onur Mutlu
DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM
This slide is from Prof. Onur Mutlu
DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1
This slide is from Prof. Onur Mutlu
Rank 0 (Front) Rank 1 (Back) Data <0:63> CS <0:1> Addr/Cmd <0:63> <0:63> Memory channel
This slide is from Prof. Onur Mutlu
Rank 0 <0:63> Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
This slide is from Prof. Onur Mutlu
Chip 0 <0:7> Bank 0
<0:7> <0:7> <0:7>
...
<0:7>
This slide is from Prof. Onur Mutlu
Bank 0 <0:7>
row 0 row 16k-1
...
2kB
1B
1B (column)
1B
Row-buffer
1B
...
<0:7>
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Channel 0 DIMM 0 Rank 0
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Rank 0
Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
. . .
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Rank 0
Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
Row 0 Col 0
. . .
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Rank 0
Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
8B Row 0 Col 0
. . .
8B
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Rank 0
Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
8B Row 0 Col 1
. . .
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Rank 0
Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
8B 8B Row 0 Col 1
. . .
8B
This slide is from Prof. Onur Mutlu
0xFFFF…F 0x00 0x40
...
64B cache block Physical memory space
Rank 0
Chip 0 Chip 1 Chip 7
<0:7> <8:15> <56:63> Data <0:63>
8B 8B Row 0 Col 1
A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially.
. . .
This slide is from Prof. Onur Mutlu
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
accessed in parallel
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
– DDR3 1333Mhz
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
– DDR3 1333Mhz
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
– Less than peak b/w – How much?
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
Bank 4
Row 1 Row 2 Row 3 Row 4 Row 5 Bank 1 Row Buffer Bank 2 Bank 3 activate precharge Read/write
– Row miss: 19 cycles, Row hit: 9 cycles
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
READ (Bank 1, Row 3, Col 7)
Col7
65
Kim et al., “Bounding Memory Interference Delay in COTS-based Multi-Core Systems,” RTAS’14
timing/resource constraints
– Translate requests to DRAM command sequences – Timing constraints: e.g., minimum write-to-read delay, activation time, … – Resource conflicts: bank, bus, channel
– Buffering, reordering, pipelining in scheduling requests
66
– Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering
67
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
68
Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1
DRAM DRAM Initial Queue Reordered Queue 2 Row Switch 1 Row Switch
– Keep the row open after an access
– Close the row after an access
69
– High/low watermark based switching
70
PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms
Heechul Yun*, Renato Mancuso+, Zheng-Pei Wu#, Rodolfo Pellizzoni# *University of Kansas, +University of Illinois , #University of Waterloo
71
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
accessed in parallel
72
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
– DDR3 1333Mhz
73
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
– DDR3 1333Mhz
74
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
75
– Less than peak b/w – How much?
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
76
77
Row misses are slow but can overlap when target different banks Row hits are fast but can still suffer interference at bus
Heechul Yun, Rodolfo Pellizzoni, Prathap Kumar Valsan. Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems. Euromicro Conference on Real-Time Systems (ECRTS), 2015. [pdf] [ppt]
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
DRAM banks
spread all over multiple banks
Unpredictable memory performance
SMP OS
78
DRAM DIMM
CPC Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
DRAM mapping
allocated to a desired DRAM bank
Flexible allocation policy
SMP OS
79
L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1
Core1 Core2 Core3 Core4
– Allocate pages
exclusively assigned banks
Eliminate Inter-core bank conflicts
80
mechanism
81
82
12 14 19 21 banks banks cache-sets 31 6 14 18 banks 31 6 7 channel cache-sets 16
Intel Xeon 3530 + 4GiB DDR3 DIMM (16 banks) Freescale P4080 + 2x2 GiB DDR3 DIMM (32 banks)
83
https://github.com/heechul/palloc/blob/master/README-map-detector.md Slow Fast
84
https://github.com/IAIK/drama
85
– DRAM bank-aware page frame allocation at each page fault
86
87
allocate a memory from the kernel?
– On a page fault – Allocate a page (e.g., 4KB)
– Doesn’t physically allocate pages – Manage a process’s heap – Variable size objects in heap
88
– Page granularity (4K) – Buddy allocator
– Support fine-grained allocations – Slab, kmalloc, vmalloc allocators
89
90
Kernel code
kmalloc
Arbitrary size objects
SLAB allocator
Multiple fixed-sized object caches
Page allocator (buddy)
Allocate power of two pages: 4K, 8K, 16K, …
vmalloc
(large) non-physically contiguous memory
chunk split into two buddies of next-lower power of 2
91
4KB 8KB 16KB 32KB
– Assume 256KB chunk available, kernel requests 21KB
92
32 A 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 64 Free 128 Free 128 Free 256 Free
– Free A
93
32 A 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 64 Free 128 Free 128 Free 256 Free
94
Kernel code
kmalloc
Arbitrary size objects
SLAB allocator
Multiple fixed-sized object caches
vmalloc
(large) non-physically contiguous memory
Simplified Pseudocode
95
96
# cd /sys/fs/cgroup # mkdir core0 core1 core2 core3 create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo 12-15 > core3/palloc.dram_bank
– X86-64, 4 cores, 8MB shared L3 cache – 1 x 4GB DDR3 DRAM module (16 banks) – Modified Linux 3.6.0
– PowerPC, 8 cores, 2MB shared LLC – 2 x 2GB DDR3 DRAM module (32 banks) – Modified Linux 3.0.6
97
– Zero interference !!!
98
99
Buddy(solo) PALLOC(diffbank) Buddy
100
MLP, but not significant for most benchmarks
101
Normalized IPC
0.00 0.20 0.40 0.60 0.80 1.00 1.20 4banks 8banks 16banks Buddy
102
Slowdown ratio
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00
buddy PB PB+PC
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas
103
– (1) alone, (2) with co-runners – LLC is partitioned (equal partition) using PALLOC (*)
104
DRAM LLC Core1 Core2 Core3 Core4
subject co-runner(s)
(*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
– A linked-list traversal, data dependency, one outstanding miss
– An array reads or writes, no data dependency, multiple misses
105
Working-set size: (LLC) < ¼ LLC cache-hits, (DRAM) > 2X LLC cache misses
– despite partitioned LLC
106
– On all tested out-of-order cores (A9, A15, Nehalem)
107
– Due to write-backs
108
– Co-runners: BwWrite(DRAM)
109
Cortex-A7 (in-order) Cortex-A15 (out-of-order)
– Page-coloring based kernel level memory allocator – Can partition cache and DRAM banks
– Space partitioning (dedicated DRAM banks, cache partition) improves performance isolation – But not perfect: memory bus contention, MSHR, …
– Multichannel, NUMA systems. – Complex addressing schemes in modern DRAM controllers
110
https://github.com/heechul/palloc
111
Kernel code
kmalloc
Arbitrary size objects
SLAB allocator
Multiple fixed-sized object caches
vmalloc
(large) non-physically contiguous memory
to a physical address, MMU generates a trap to the OS
– Step 1: allocate a free page frame – Step 2: bring the stored page on disk (if necessary) – Step 3: update the PTE (mapping and valid bit) – Step 4: restart the instruction
112
113
114
115
Unmapped pages
Code Data Heap Stack
116
Access next instruction
Code Data Heap Stack
117
Page fault
Code Data Heap Stack
118
OS 1) allocates free page frame 2) loads the missed page from the disk (exec file) 3) updates the page table entry
Code Data Heap Stack
119
Code Data Heap Stack
Over time, more pages are mapped as needed
– handle_pte_fault
– pte_alloc(vma->mm, .., fe->address) – page = alloc_zeroed_user_highpage_movable » alloc_pages_nodemask
– buffered_rmqueue
__rmqueue_smallest
120
Simplified Pseudocode
121
ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); cmap = ph->cmap; if (order == 0) { page = palloc_find_cmap(zone, cmap, 0, c_stat); if (page) return page; /* Search the entire list. Make color cache in the process */ for (current_order = 0; current_order < MAX_ORDER; ++current_order) { area = &(zone->free_area[current_order]); list_for_each(curr, tmp, &area->free_list[migratetype]) { palloc_insert(zone, page, current_order); page = palloc_find_cmap(zone, cmap, 0, c_stat)
122
123
$ cat /proc/buddyinfo Node 0, zone DMA 1 1 1 1 1 1 1 0 1 1 3 Node 0, zone DMA32 4 5 6 8 7 7 8 7 9 6 696 Node 0, zone Normal 3390 17134 4920 1413 537 239 79 46 11 4 5659
$ cat /proc/buddyinfo Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3 Node 0, zone DMA32 18761 13471 10910 7731 5786 2960 756 57 1 0 0 Node 0, zone Normal 129885 67082 9032 1217 68 10 8 1 0 0 0 Node 1, zone Normal 294117 197469 109366 84395 10394 818 7 4 3 0 0
$ cat /proc/buddyinfo Node 0, zone Normal 2975 263 3898 1816 66 30 16 11 4 2 71
124
– 64bit value for each virtual page
– Process’s mapped virtual memory regions
125
# pagetype -L -p `pidof bandwidth` voffset offset flags 10 2a6f6 color=1 __RU_lA____M______________________ 11 2a6f7 color=1 __RU_lA____M______________________ 21 1bf67 color=1 ___U_lA____Ma_b___________________ 22 26f73 color=0 ___UDlA____Ma_b___________________ d3e 251c7 color=1 ___U_lA____Ma_b___________________ 769de 32d13 color=0 ___UDlA____Ma_b___________________ 769df 31ebe color=3 ___UDlA____Ma_b___________________ 769e0 31a4d color=3 ___UDlA____Ma_b___________________ 769e1 20c74 color=1 ___U_lA____Ma_b___________________ 769e2 285ae color=3 ___UDlA____Ma_b___________________ 769e3 313f5 color=1 ___U_lA____Ma_b___________________ 769e4 154b7 color=1 ___U_lA____Ma_b___________________ 769e5 1fa36 color=1 ___U_lA____Ma_b___________________ 769e6 7945 color=1 ___U_lA____Ma_b___________________
126
Virtual page number (<<12) Physical page frame number (<<12) Page color page table entry flag