Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - - PowerPoint PPT Presentation

memory hierarchy
SMART_READER_LITE
LIVE PREVIEW

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time - - PowerPoint PPT Presentation

Memory Hierarchy Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS Applications Real-time architecture/OS Fault tolerance, safety, security Amazon prime air 2 Topics Introduction to Real-Time Systems, CPS


slide-1
SLIDE 1

Memory Hierarchy

Heechul Yun

1

slide-2
SLIDE 2

Topics

  • Introduction to Real-Time Systems, CPS
  • CPS Applications
  • Real-time architecture/OS
  • Fault tolerance, safety, security

2

Amazon prime air

slide-3
SLIDE 3

Topics

  • Introduction to Real-Time Systems, CPS
  • CPS Applications
  • Real-time architecture/OS

– Real-time cache, DRAM controller designs – Real-time microarchitecture/OS Support – Real-time support for GPU/FPGA

  • Fault tolerance, safety, security

3

slide-4
SLIDE 4

Real-Time Computing

  • Performance vs. Determinism

– Performance: average timing – Determinism: variance and worst-case timing

  • Traditional real-time systems

– Focused on determinism – So that we can analyze the system at design time – Many challenges exist in computer architecture – In general, performance demand was not high.

  • High performance real-time systems

– Such as self-driving cars and UAVs (intelligent robots) – Demand both performance and determinism – More difficult to satisfy both

4

slide-5
SLIDE 5

Architecture for Intelligent Robots

  • Time predictability
  • High performance

5

Performance Predictability

Performance Architecture Real-Time Architecture High Performance Real-Time Architecture

slide-6
SLIDE 6

Challenge: Memory Hierarchy

6

  • Many important hardware resources are shared
  • Each is locally optimized for average performance
  • Software has limited ability to reason and control
  • Unpredictable software execution timing

Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM

slide-7
SLIDE 7

Automotive Industry Challenges

  • “Automotive industry is transitioning from µC

toward μP based platforms”

  • “Interference effects (on μPs) are more severe by
  • rders of magnitude compared to µC platforms”
  • “Goal for automotive systems engineering:

predictable real-time behavior on high- performance platforms”

7

(*) Arne Hamann (Bosch), “Industrial Challenges: moving from classical to high-performance real-time systems." In Waters 2019

slide-8
SLIDE 8

Certification Challenges in Aviation

8

https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/cast/cast_papers/media/cast-32A.pdf

slide-9
SLIDE 9

CAST-32A: Multicore-Processors

  • A position paper by FAA and other

certification agencies on multicore

  • Discuss interference channels of multi-core

that affect software timing

  • Suggestion 1: disable all but one core
  • Suggestion 2: provide evidence that all

interference channels are taken care of (“robust partitioning”)  nobody can (yet)

9

slide-10
SLIDE 10

Timing Matters for Security

  • Measurable timing differences in accessing

shared hardware resources can leak secret

10

https://meltdownattack.com/

slide-11
SLIDE 11

Challenge: Memory Hierarchy

11

  • Many important hardware resources are shared
  • Each is locally optimized for average performance
  • Software has limited ability to reason and control
  • Unpredictable software execution timing

Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM

slide-12
SLIDE 12

Cache

  • Small but fast memory (SRAM)
  • Hardware (cache controller) managed storage

– Mapping: phy addr  mapping function  set index – Replacement: select victim line among the ways

  • Improve average performance
  • Transparent to software

– It just works!

12

slide-13
SLIDE 13

Direct-Map Cache

  • Cache-line size = 2L
  • # of cache-sets = 2S
  • Cache size = 2L+S

13

tags index

  • ffset

Cache cache-line (L) Cache Physical address Cache sets S L

slide-14
SLIDE 14

Cache Cache Cache

Set-Associative Cache

  • Cache-line size = 2L
  • # of cache-sets = 2S
  • # of ways = W
  • Cache size = W x 2L+S

14

tags index

  • ffset

Cache Physical address Cache sets Cache cache-line (L) S L 2 3 4 1

slide-15
SLIDE 15

Cache Replacement Policy

  • Least Recently Used (LRU)

– Evict least recently used cache-line – “Good” (analyzable) policy. Tight analysis exists. – Expensive to maintain order. Not used for large caches

15

slide-16
SLIDE 16

Cache Replacement Policy

  • (Tree) Pseudo-LRU

– Use a binary tree – Each node records which half is older – On a miss, follow the older path and flip the bits along the way – Approximate LRU, No need to sort, practical – But analysis is more pessimistic

16

1 1 1 1

L0 L1 L2 L3 L4 L5 L6 L7

Older

Image credit: Prof. Mikko H. Lipasti

slide-17
SLIDE 17

Cache Replacement Policy

  • (Tree) Pseudo-LRU

17

Image credit: https://en.wikipedia.org/wiki/Pseudo-LRU

slide-18
SLIDE 18

Cache Replacement Policy

  • (Bit) PLRU or NRU (Not Recently Used)

– One MRU bit per cache-line – Set 1 on access; when the last remaining 0 bit is set to 1, all other bits are reset to 0. – At cache misses, the line with lowest index whose MRU-bit is 0 is replaced.

18

Udacity Lecture: https://www.youtube.com/watch?v=8CjifA2yw7s

slide-19
SLIDE 19

Cache Replacement Policies

  • How to know which policy is used?

– Manual (if you are lucky) – Reverse engineering

19

Image source: [Abel and Reineke, RTAS 2013]

slide-20
SLIDE 20

WCET and Caches

  • How to determine the WCET of a task?
  • The longest execution path of the task?

– Problem: the longest path can take less time to finish than shorter paths if your system has a cache(s)!

  • Example

– Path1: 1000 instructions, 0 cache misses – Path2: 500 instructions, 100 cache misses – Cache hit: 1 cycle, Cache miss: 100 cycles – Path 2 takes much longer

20

slide-21
SLIDE 21

WCET and Caches

  • Treat all memory accesses as cache-misses?

– Problem: extremely pessimistic

  • Example

– 1000 instructions, 100 mem accesses, 10 misses

  • Cache hit: 1 cycle, cache miss: 100 cycles

– Actual = 900 + 90*1 + 10*100 = 1990 = ~2000cycles – WCETallmiss = 900 + 100 * 100 = 10900 = ~11000 cycles

  • >5X higher

21

slide-22
SLIDE 22

WCET and Caches

  • Take cache hits/misses into account?

– To reduce pessimism in WCET estimation

  • How to know cache hits/misses of a given job?

– If we assume

  • the path (instruction stream) is given
  • the job is not interrupted.
  • A known “good” cache replacement policy is used

– Then we can statically determine hits/misses

  • But less so when “bad” replacement policies are used

22

slide-23
SLIDE 23

Shared Cache

23

  • Cache space is shared between cores
  • How to improve isolation?

Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM

slide-24
SLIDE 24

Cache Partitioning

  • Way-partitioning

– Requires h/w support

  • Set-partitioning

– Can be done in s/w as long as there’s MMU.

  • MMU: virtual -> physical address translation h/w
  • Page table: translation table managed by the OS
  • Most (but not all) processors support MMU

– Page-coloring

24

slide-25
SLIDE 25

Way Partitioning

  • H/W support is needed

– E.g., Freescale P4080, Intel

25

Cache Cache Cache Cache Cache sets Cache cache-line (L) 2 3 4 1 Core1 Core2 Core3 Core4

slide-26
SLIDE 26

Intel CAT

  • Cache Allocation Technology (CAT)

– Intel’s way partitioning mechanism – Thread/VM  logical id  resource (cache) partition

  • Part of intel’s platform QoS techniques

– CAT: cache allocation technology – CMT: cache monitoring technology – MBM: memory bandwidth monitoring – CDP: code/data prioritization

  • Some slides are from

– C. Peng, “Achieving QoS in Server Virtualization,” 2016

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

Set Partitioning

31

Cache Cache Cache Cache Cache sets Cache cache-line (L) 2 3 4 1 Core1 Core2 Core3 Core4

  • Can be done in S/W

– Page coloring: control physical address (cache index) of pages

slide-32
SLIDE 32

Page Coloring

  • Cache can be divided into page colors
  • Assign certain colors to

certain CPU cores

32

Cache Cache Cache Cache Cache sets Cache cache-line (L) 2 3 4 1 Core1 Core2 Core3 Core4

slide-33
SLIDE 33

Page Coloring

  • OS’s memory allocator controls the color of each

allocated page

33

Color index: OS-controlled address bits

slide-34
SLIDE 34

OS controlled bits for L2 partitioning 31 6 Set index 17 12 31 6 Set index 14 12 Tag Page offset Cache-line

  • ffset

12 Physical page frame number Cache-line

  • ffset

Cache-line

  • ffset

Set index Set index Tag Tag 6 14 16 L2 Cache (shared) L1 Cache (private) Physical Address 31

ARM Cortex-A15

  • Why not using bit 12,13?

34

slide-35
SLIDE 35

12 14 banks L3 cache-sets 31 6 19 21 banks (private) L1: 32K-I/D (4/8 way)  4KB/way (12 bits) (private) L2: 256KB (8way)  32KB/way (12 + 3bits) (shared) L3: 8MB (16way)  512KB/way (12 + 7bits) L2 cache-sets 15 L1-D cache-sets

Intel Nehalem

  • Co-partitioning problem

– Partitioning L3 using bit 12,13 also partitions the L2 cache and DRAM banks due to address overlap

35

slide-36
SLIDE 36

Intel Sandy Bridge

  • Modern Intel processors use more complex mapping

– Slice id + Set index  unique cache location

USENIX, SP’15 “Last-Level Cache Side-Channel Attacks are Practical”

36

slide-37
SLIDE 37

How to Partition?

  • Decision: Core (Thread)  Page colors
  • Static partitioning

– Assign the mapping statically at once – Pros: simple – Cons: what if the assignment is not ideal?

  • Dynamic

– Assignment may change over time – Pros: can adapt changes in behavior – Cons: page recoloring is costly

37

slide-38
SLIDE 38

Summary

  • Cache way partitioning

– Pros

  • Low re-partition overhead

– Cons

  • Hardware support is needed.
  • Cache set partitioning (Page coloring)

– Pros

  • SW (OS) approach. No HW support is needed

– Cons

  • Co-partitioning problem
  • Re-partitioning (coloring) overhead is high

38

slide-39
SLIDE 39

Main Memory

39

  • Cache space
  • Cache internal hardware structures

– MSHRs, WBBuffer, …

Core1 Core2 GPU Accel. Memory Controller (MC) Shared Cache DRAM

slide-40
SLIDE 40

Why is DRAM Important?

  • Why do we need bigger and faster memory?
  • Data intensive computing

– Bigger, more complex application – Large and high-bandwidth data processing – DRAM is often a performance bottleneck

40

slide-41
SLIDE 41

Why is DRAM Important?

  • Parallelism

– Out-of-order core

  • A single core can generate many memory requests

– Multicore

  • Multiple cores share DRAM

– Accelerator

  • GPU

41

slide-42
SLIDE 42

Memory Performance Isolation

  • Q. How to guarantee predictable memory

performance?

Part 1 Part 2 Part 3 Part 4

42

Core1 Core2 Core3 Core4 DRAM Memory Controller LLC LLC LLC LLC

slide-43
SLIDE 43

Memory System Architecture

43

CORE 1

L2 CACHE 0

SHARED L3 CACHE DRAM INTERFACE

CORE 0 CORE 2 CORE 3

L2 CACHE 1 L2 CACHE 2 L2 CACHE 3

DRAM BANKS

DRAM MEMORY CONTROLLER

This slide is from Prof. Onur Mutlu

slide-44
SLIDE 44

DRAM Organization

  • Channel
  • Rank
  • Chip
  • Bank
  • Row
  • Row/Column

44

slide-45
SLIDE 45

The DRAM subsystem

Memory channel Memory channel DIMM (Dual in-line memory module) Processor “Channel”

This slide is from Prof. Onur Mutlu

slide-46
SLIDE 46

Breaking down a DIMM

DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM

This slide is from Prof. Onur Mutlu

slide-47
SLIDE 47

Breaking down a DIMM

DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1

This slide is from Prof. Onur Mutlu

slide-48
SLIDE 48

Rank

Rank 0 (Front) Rank 1 (Back) Data <0:63> CS <0:1> Addr/Cmd <0:63> <0:63> Memory channel

This slide is from Prof. Onur Mutlu

slide-49
SLIDE 49

Breaking down a Rank

Rank 0 <0:63> Chip 0 Chip 1 Chip 7

. . .

<0:7> <8:15> <56:63> Data <0:63>

This slide is from Prof. Onur Mutlu

slide-50
SLIDE 50

Breaking down a Chip

Chip 0 <0:7> Bank 0

<0:7> <0:7> <0:7>

...

<0:7>

This slide is from Prof. Onur Mutlu

slide-51
SLIDE 51

Breaking down a Bank

Bank 0 <0:7>

row 0 row 16k-1

...

2kB

1B

1B (column)

1B

Row-buffer

1B

...

<0:7>

This slide is from Prof. Onur Mutlu

slide-52
SLIDE 52

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Channel 0 DIMM 0 Rank 0

This slide is from Prof. Onur Mutlu

slide-53
SLIDE 53

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Rank 0

Chip 0 Chip 1 Chip 7

<0:7> <8:15> <56:63> Data <0:63>

. . .

This slide is from Prof. Onur Mutlu

slide-54
SLIDE 54

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Rank 0

Chip 0 Chip 1 Chip 7

<0:7> <8:15> <56:63> Data <0:63>

Row 0 Col 0

. . .

This slide is from Prof. Onur Mutlu

slide-55
SLIDE 55

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Rank 0

Chip 0 Chip 1 Chip 7

<0:7> <8:15> <56:63> Data <0:63>

8B Row 0 Col 0

. . .

8B

This slide is from Prof. Onur Mutlu

slide-56
SLIDE 56

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Rank 0

Chip 0 Chip 1 Chip 7

<0:7> <8:15> <56:63> Data <0:63>

8B Row 0 Col 1

. . .

This slide is from Prof. Onur Mutlu

slide-57
SLIDE 57

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Rank 0

Chip 0 Chip 1 Chip 7

<0:7> <8:15> <56:63> Data <0:63>

8B 8B Row 0 Col 1

. . .

8B

This slide is from Prof. Onur Mutlu

slide-58
SLIDE 58

Example: Transferring a cache block

0xFFFF…F 0x00 0x40

...

64B cache block Physical memory space

Rank 0

Chip 0 Chip 1 Chip 7

<0:7> <8:15> <56:63> Data <0:63>

8B 8B Row 0 Col 1

A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially.

. . .

This slide is from Prof. Onur Mutlu

slide-59
SLIDE 59

DRAM Organization

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • Have multiple banks
  • Different banks can be

accessed in parallel

slide-60
SLIDE 60

Best-case

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

Fast

  • Peak = 10.6 GB/s

– DDR3 1333Mhz

slide-61
SLIDE 61

Best-case

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • Peak = 10.6 GB/s

– DDR3 1333Mhz

  • Out-of-order processors

Fast

slide-62
SLIDE 62

Most-cases

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

Mess

  • Performance = ??
slide-63
SLIDE 63

Worst-case

  • 1bank b/w

– Less than peak b/w – How much?

Slow

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

slide-64
SLIDE 64

Bank 4

DRAM Chip

Row 1 Row 2 Row 3 Row 4 Row 5 Bank 1 Row Buffer Bank 2 Bank 3 activate precharge Read/write

  • State dependent access latency

– Row miss: 19 cycles, Row hit: 9 cycles

(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

READ (Bank 1, Row 3, Col 7)

Col7

slide-65
SLIDE 65

DDR3 Timing Parameters

65

Kim et al., “Bounding Memory Interference Delay in COTS-based Multi-Core Systems,” RTAS’14

slide-66
SLIDE 66

DRAM Controller

  • Service DRAM requests (from CPU) while obeying

timing/resource constraints

– Translate requests to DRAM command sequences – Timing constraints: e.g., minimum write-to-read delay, activation time, … – Resource conflicts: bank, bus, channel

  • Maximize performance

– Buffering, reordering, pipelining in scheduling requests

66

slide-67
SLIDE 67

DRAM Controller

  • Request queue

– Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering

67

Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

slide-68
SLIDE 68

Request Reordering

  • Improve row hit ratio and throughput
  • Unpredictable queuing delay

68

Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1

DRAM DRAM Initial Queue Reordered Queue 2 Row Switch 1 Row Switch

slide-69
SLIDE 69

Row Management Policy

  • Open row

– Keep the row open after an access

  • If next access targets the same row: CAS
  • If next access targets a different row: PRE + ACT + CAS
  • Close row

– Close the row after an access

  • always pay the same (longer) cost: ACT + CAS
  • Adaptive policies

69

slide-70
SLIDE 70

COTS DRAM Controller

  • FR-FCFS scheduling (out-of-order)
  • Separate read/write buffers

– High/low watermark based switching

  • Read prioritized over writes

70

slide-71
SLIDE 71

PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms

Heechul Yun*, Renato Mancuso+, Zheng-Pei Wu#, Rodolfo Pellizzoni# *University of Kansas, +University of Illinois , #University of Waterloo

71

slide-72
SLIDE 72

Background: DRAM Organization

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • Have multiple banks
  • Different banks can be

accessed in parallel

72

slide-73
SLIDE 73

Best-case

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

Fast

  • Peak = 10.6 GB/s

– DDR3 1333Mhz

73

slide-74
SLIDE 74

Best-case

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • Peak = 10.6 GB/s

– DDR3 1333Mhz

  • Out-of-order processors

Fast

74

slide-75
SLIDE 75

Most-cases

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

Mess

  • Performance = ??

75

slide-76
SLIDE 76

Worst-case

  • 1bank b/w

– Less than peak b/w – How much?

Slow

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

76

slide-77
SLIDE 77

Timing Diagram

77

Row misses are slow but can overlap when target different banks Row hits are fast but can still suffer interference at bus

Heechul Yun, Rodolfo Pellizzoni, Prathap Kumar Valsan. Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems. Euromicro Conference on Real-Time Systems (ECRTS), 2015. [pdf] [ppt]

slide-78
SLIDE 78

Problem

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • OS does NOT know

DRAM banks

  • OS memory pages are

spread all over multiple banks

????

Unpredictable memory performance

SMP OS

78

slide-79
SLIDE 79

DRAM DIMM

PALLOC

CPC Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • OS is aware of

DRAM mapping

  • Each page can be

allocated to a desired DRAM bank

Flexible allocation policy

SMP OS

79

slide-80
SLIDE 80

PALLOC

L3 DRAM DIMM Memory Controller (MC) Bank 4 Bank 3 Bank 2 Bank 1

Core1 Core2 Core3 Core4

  • Private banking

– Allocate pages

  • n certain

exclusively assigned banks

Eliminate Inter-core bank conflicts

80

slide-81
SLIDE 81

Challenges

  • Finding DRAM address mapping
  • Modifying Linux’s memory allocation

mechanism

81

slide-82
SLIDE 82

Identifying Memory Mapping

  • Memory mappings are platform specific
  • We developed a detection tool software

82

12 14 19 21 banks banks cache-sets 31 6 14 18 banks 31 6 7 channel cache-sets 16

Intel Xeon 3530 + 4GiB DDR3 DIMM (16 banks) Freescale P4080 + 2x2 GiB DDR3 DIMM (32 banks)

slide-83
SLIDE 83

DRAM Address Map Detector

83

https://github.com/heechul/palloc/blob/master/README-map-detector.md Slow Fast

slide-84
SLIDE 84

More Recent Solution

84

https://github.com/IAIK/drama

slide-85
SLIDE 85

Complex Addressing

85

slide-86
SLIDE 86

Implementation

  • Modified Linux kernel’s buddy allocator

– DRAM bank-aware page frame allocation at each page fault

86

slide-87
SLIDE 87

Background

  • On how Linux allocates memory

87

slide-88
SLIDE 88

User-level Memory Allocation

  • When does a process actually

allocate a memory from the kernel?

– On a page fault – Allocate a page (e.g., 4KB)

  • What does malloc() do?

– Doesn’t physically allocate pages – Manage a process’s heap – Variable size objects in heap

88

slide-89
SLIDE 89

Kernel-level Memory Allocation

  • Page-level allocator (low-level)

– Page granularity (4K) – Buddy allocator

  • Other kernel-memory allocators

– Support fine-grained allocations – Slab, kmalloc, vmalloc allocators

89

slide-90
SLIDE 90

Kernel-Level Memory Allocators

90

Kernel code

kmalloc

Arbitrary size objects

SLAB allocator

Multiple fixed-sized object caches

Page allocator (buddy)

Allocate power of two pages: 4K, 8K, 16K, …

vmalloc

(large) non-physically contiguous memory

slide-91
SLIDE 91

Buddy Allocator

  • Linux’s page-level allocator
  • Allocate power of two number of pages: 1, 2, 4, 8, … pages.
  • Request rounded up to next highest power of 2
  • When smaller allocation needed than is available, current

chunk split into two buddies of next-lower power of 2

  • Quickly expand/shrink across the lists

91

4KB 8KB 16KB 32KB

slide-92
SLIDE 92

Buddy Allocator

  • Example

– Assume 256KB chunk available, kernel requests 21KB

92

32 A 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 64 Free 128 Free 128 Free 256 Free

slide-93
SLIDE 93

Buddy Allocator

  • Example

– Free A

93

32 A 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 32 Free 64 Free 128 Free 64 Free 128 Free 128 Free 256 Free

slide-94
SLIDE 94

PALLOC in Linux

94

Kernel code

kmalloc

Arbitrary size objects

SLAB allocator

Multiple fixed-sized object caches

PALLOC

vmalloc

(large) non-physically contiguous memory

slide-95
SLIDE 95

Simplified Pseudocode

95

slide-96
SLIDE 96

PALLOC Interface

  • Example: per-core private banking (PB)

96

# cd /sys/fs/cgroup # mkdir core0 core1 core2 core3  create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank  assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo 12-15 > core3/palloc.dram_bank

slide-97
SLIDE 97

Evaluation Platforms

  • Platform #1: Intel Xeon 3530

– X86-64, 4 cores, 8MB shared L3 cache – 1 x 4GB DDR3 DRAM module (16 banks) – Modified Linux 3.6.0

  • Platform #2: Freescale P4080

– PowerPC, 8 cores, 2MB shared LLC – 2 x 2GB DDR3 DRAM module (32 banks) – Modified Linux 3.0.6

97

slide-98
SLIDE 98

Samebank vs. Diffbank

  • Samebank: All cores  Bank0
  • Diffbank: Core0  Bank0, Core1-3  Bank 1-3

– Zero interference !!!

98

slide-99
SLIDE 99

Real-Time Performance

  • Setup: HRT  Core0, X-server  Core1
  • Buddy: no bank control (use all Bank 0-15)
  • Diffbank: Core0  Bank0-7, Core1  Bank8-15

99

Buddy(solo) PALLOC(diffbank) Buddy

slide-100
SLIDE 100

SPEC2006

  • Use 15 high-medium memory intensive benchmarks

100

slide-101
SLIDE 101

Performance Impact on Unicore

  • #of bank vs. performance on a single core
  • Finding: bank partitioning negatively affect performance due to reduced

MLP, but not significant for most benchmarks

101

Normalized IPC

0.00 0.20 0.40 0.60 0.80 1.00 1.20 4banks 8banks 16banks Buddy

slide-102
SLIDE 102

Performance Isolation on 4 Cores

  • Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference)
  • PB: DRAM bank partitioning only;
  • PB+PC: DRAM bank and Cache partitioning
  • Finding: bank (and cache) partitioning improves isolation, but far from ideal

102

Slowdown ratio

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00

buddy PB PB+PC

slide-103
SLIDE 103

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas

103

slide-104
SLIDE 104

Cache Interference Experiments

  • Measure the performance of the ‘subject’

– (1) alone, (2) with co-runners – LLC is partitioned (equal partition) using PALLOC (*)

  • Q: Does cache partitioning provide isolation?

104

DRAM LLC Core1 Core2 Core3 Core4

subject co-runner(s)

(*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14

slide-105
SLIDE 105

IsolBench: Synthetic Workloads

  • Latency

– A linked-list traversal, data dependency, one outstanding miss

  • Bandwidth

– An array reads or writes, no data dependency, multiple misses

  • Subject benchmarks: LLC partition fitting

105

Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses

slide-106
SLIDE 106

Latency(LLC) vs. BwRead(DRAM)

  • No interference on Cortex-A7 and Nehalem
  • On Cortex-A15, Latency(LLC) suffers 3.8X slowdown

– despite partitioned LLC

106

slide-107
SLIDE 107

BwRead(LLC) vs. BwRead(DRAM)

  • Up to 10.6X slowdown on Cortex-A15
  • Cache partitioning != performance isolation

– On all tested out-of-order cores (A9, A15, Nehalem)

107

slide-108
SLIDE 108

BwRead(LLC) vs. BwWrite(DRAM)

  • Up to 21X slowdown on Cortex-A15
  • Writes generally cause more slowdowns

– Due to write-backs

108

slide-109
SLIDE 109

EEMBC and SD-VBS

  • X-axis: EEMBC, SD-VBS (cache partition fitting)

– Co-runners: BwWrite(DRAM)

  • Cache partitioning != performance isolation

109

Cortex-A7 (in-order) Cortex-A15 (out-of-order)

slide-110
SLIDE 110

Summary

  • PALLOC

– Page-coloring based kernel level memory allocator – Can partition cache and DRAM banks

  • Findings

– Space partitioning (dedicated DRAM banks, cache partition) improves performance isolation – But not perfect: memory bus contention, MSHR, …

  • Things to consider

– Multichannel, NUMA systems. – Complex addressing schemes in modern DRAM controllers

110

https://github.com/heechul/palloc

slide-111
SLIDE 111

PALLOC in Linux

111

Kernel code

kmalloc

Arbitrary size objects

SLAB allocator

Multiple fixed-sized object caches

PALLOC

vmalloc

(large) non-physically contiguous memory

slide-112
SLIDE 112

Page Fault

  • When a virtual address can not be translated

to a physical address, MMU generates a trap to the OS

  • Page fault handling procedure

– Step 1: allocate a free page frame – Step 2: bring the stored page on disk (if necessary) – Step 3: update the PTE (mapping and valid bit) – Step 4: restart the instruction

112

slide-113
SLIDE 113

Page Fault Handling

113

slide-114
SLIDE 114

Demand Paging

114

slide-115
SLIDE 115

Starting Up a Process

115

Unmapped pages

Code Data Heap Stack

slide-116
SLIDE 116

Starting Up a Process

116

Access next instruction

Code Data Heap Stack

slide-117
SLIDE 117

Starting Up a Process

117

Page fault

Code Data Heap Stack

slide-118
SLIDE 118

Starting Up a Process

118

OS 1) allocates free page frame 2) loads the missed page from the disk (exec file) 3) updates the page table entry

Code Data Heap Stack

slide-119
SLIDE 119

Starting Up a Process

119

Code Data Heap Stack

Over time, more pages are mapped as needed

slide-120
SLIDE 120

Page Fault Handling in Linux

  • __handle_mm_fault (mm/memory.c)

– handle_pte_fault

  • do_anonymous_page

– pte_alloc(vma->mm, .., fe->address) – page = alloc_zeroed_user_highpage_movable » alloc_pages_nodemask

  • alloc_pages_nodemask (mm/page_alloc.c)

– buffered_rmqueue

__rmqueue_smallest

120

slide-121
SLIDE 121

Simplified Pseudocode

121

slide-122
SLIDE 122

rmqueue_smallest

ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); cmap = ph->cmap; if (order == 0) { page = palloc_find_cmap(zone, cmap, 0, c_stat); if (page) return page; /* Search the entire list. Make color cache in the process */ for (current_order = 0; current_order < MAX_ORDER; ++current_order) { area = &(zone->free_area[current_order]); list_for_each(curr, tmp, &area->free_list[migratetype]) { palloc_insert(zone, page, current_order); page = palloc_find_cmap(zone, cmap, 0, c_stat)

122

slide-123
SLIDE 123

Node, Zone

123

slide-124
SLIDE 124

Node, Zone

  • x86 PC

$ cat /proc/buddyinfo Node 0, zone DMA 1 1 1 1 1 1 1 0 1 1 3 Node 0, zone DMA32 4 5 6 8 7 7 8 7 9 6 696 Node 0, zone Normal 3390 17134 4920 1413 537 239 79 46 11 4 5659

  • x86 server (NUMA)

$ cat /proc/buddyinfo Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3 Node 0, zone DMA32 18761 13471 10910 7731 5786 2960 756 57 1 0 0 Node 0, zone Normal 129885 67082 9032 1217 68 10 8 1 0 0 0 Node 1, zone Normal 294117 197469 109366 84395 10394 818 7 4 3 0 0

  • Pi3

$ cat /proc/buddyinfo Node 0, zone Normal 2975 263 3898 1816 66 30 16 11 4 2 71

124

slide-125
SLIDE 125

Pagemap

  • /proc/<pid>/pagemap

– 64bit value for each virtual page

  • Bits 0-54 page frame number (PFN) if present
  • Bit 62 page swapped
  • Bit 63 page present
  • /proc/<pid>/maps

– Process’s mapped virtual memory regions

125

slide-126
SLIDE 126

Pagetype

# pagetype -L -p `pidof bandwidth` voffset offset flags 10 2a6f6 color=1 __RU_lA____M______________________ 11 2a6f7 color=1 __RU_lA____M______________________ 21 1bf67 color=1 ___U_lA____Ma_b___________________ 22 26f73 color=0 ___UDlA____Ma_b___________________ d3e 251c7 color=1 ___U_lA____Ma_b___________________ 769de 32d13 color=0 ___UDlA____Ma_b___________________ 769df 31ebe color=3 ___UDlA____Ma_b___________________ 769e0 31a4d color=3 ___UDlA____Ma_b___________________ 769e1 20c74 color=1 ___U_lA____Ma_b___________________ 769e2 285ae color=3 ___UDlA____Ma_b___________________ 769e3 313f5 color=1 ___U_lA____Ma_b___________________ 769e4 154b7 color=1 ___U_lA____Ma_b___________________ 769e5 1fa36 color=1 ___U_lA____Ma_b___________________ 769e6 7945 color=1 ___U_lA____Ma_b___________________

126

Virtual page number (<<12) Physical page frame number (<<12) Page color page table entry flag