Changelog Changes not seen in fjrst lecture: 19 March 2020: move - - PowerPoint PPT Presentation

changelog
SMART_READER_LITE
LIVE PREVIEW

Changelog Changes not seen in fjrst lecture: 19 March 2020: move - - PowerPoint PPT Presentation

Changelog Changes not seen in fjrst lecture: 19 March 2020: move page usage slides later 19 March 2020: adjust PF counting exercise to specify addreses, not ofgsets 19 March 2020: Linux maps: correct shown mmap call for 0x400000 0 virtual


slide-1
SLIDE 1

Changelog

Changes not seen in fjrst lecture:

19 March 2020: move page usage slides later 19 March 2020: adjust PF counting exercise to specify addreses, not

  • fgsets

19 March 2020: Linux maps: correct shown mmap call for 0x400000

slide-2
SLIDE 2

virtual memory 3

1

slide-3
SLIDE 3

Zoom logistics

recommend: exit full screen

  • pen chat + participants window

participants window has non-verbal feedback features I will try to monitor the chat window I can take questions via raise hand + turn on your audio… but probably text is usually easier/more reliable? I intend to record these (both through Zoom and locally)

2

slide-4
SLIDE 4

general logistics

lectures streamed via Zoom with questions videos + audio-recordings + slides available

if you have trouble getting at anything, let us know

please use Piazza

  • ffjce hours via Discord with queue

quizzes still happening

3

slide-5
SLIDE 5

last time

virtual memory — two-level tables page fault handling

return from page fault normally → retry instruction trick: fjx page table before returning

allocate-on-demand

pretend to allocate right away actually allocate later (on use)

copy-on-write

pretend to copy right away actually allocate later (on write)

4

slide-6
SLIDE 6

xv6: adding space on demand

struct proc { uint sz; // Size of process memory (bytes) ... };

xv6 tracks “end of heap” (now just for sbrk()) adding allocate on demand logic for the heap:

  • n sbrk(): don’t change page table right away
  • n page fault

case 1: if address ≥ sz: out of bounds: kill process case 2: otherwise, allocate page containing address, return from trap

5

slide-7
SLIDE 7

versus more complicated OSes

typical desktop/server: range of valid addresses is not just 0 to maximum need some more complicated data structure to represent

6

slide-8
SLIDE 8

copy-on write cases

trying to write forbidden page (e.g. kernel memory)

kill program instead of making it writable

fault from trying to write read-only page: case 1: multiple process’s page table entries refer to it

copy the page replace read-only page table entry to point to copy

case 2: only one page table entry refers to it

make it writeable

7

slide-9
SLIDE 9

mmap

Linux/Unix has a function to “map” a fjle to memory

int file = open("somefile.dat", O_RDWR); // data is region of memory that represents file char *data = mmap(..., file, 0); // read byte 6 (zero-indexed) from somefile.dat char seventh_char = data[6]; // modifies byte 100 of somefile.dat data[100] = 'x'; // can continue to use 'data' like an array

8

slide-10
SLIDE 10

mmap options (1)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

length bytes from open fjle fd starting at byte offset

(Linux extension: can omit fd with special value of flags)

protection fmags prot, bitwise or together 1 or more of:

PROT_READ PROT_WRITE PROT_EXEC PROT_NONE (for forcing segfaults)

9

slide-11
SLIDE 11

mmap options (1)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

length bytes from open fjle fd starting at byte offset

(Linux extension: can omit fd with special value of flags)

protection fmags prot, bitwise or together 1 or more of:

PROT_READ PROT_WRITE PROT_EXEC PROT_NONE (for forcing segfaults)

9

slide-12
SLIDE 12

mmap options (1)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

length bytes from open fjle fd starting at byte offset

(Linux extension: can omit fd with special value of flags)

protection fmags prot, bitwise or together 1 or more of:

PROT_READ PROT_WRITE PROT_EXEC PROT_NONE (for forcing segfaults)

9

slide-13
SLIDE 13

mmap options (2)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

flags, choose one of: MAP_SHARED — changing memory changes fjle and vice-versa

multiple processes mmap same fjle: get same physical pages read()/write() must use same physical pages changes to memory (if writable) must be sent to disk eventually

MAP_PRIVATE — make a copy of data in fjle

changes to memory do not change fjle almost as if copied during mmap call but probably actually copied using copy-on-write

10

slide-14
SLIDE 14

mmap options (2)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

flags, choose one of: MAP_SHARED — changing memory changes fjle and vice-versa

multiple processes mmap same fjle: get same physical pages read()/write() must use same physical pages changes to memory (if writable) must be sent to disk eventually

MAP_PRIVATE — make a copy of data in fjle

changes to memory do not change fjle almost as if copied during mmap call but probably actually copied using copy-on-write

10

slide-15
SLIDE 15

mmap options (2)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

flags, choose one of: MAP_SHARED — changing memory changes fjle and vice-versa

multiple processes mmap same fjle: get same physical pages read()/write() must use same physical pages changes to memory (if writable) must be sent to disk eventually

MAP_PRIVATE — make a copy of data in fjle

changes to memory do not change fjle almost as if copied during mmap call but probably actually copied using copy-on-write

10

slide-16
SLIDE 16

mmap options (2)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

flags, choose one of: MAP_SHARED — changing memory changes fjle and vice-versa

multiple processes mmap same fjle: get same physical pages read()/write() must use same physical pages changes to memory (if writable) must be sent to disk eventually

MAP_PRIVATE — make a copy of data in fjle

changes to memory do not change fjle almost as if copied during mmap call but probably actually copied using copy-on-write

10

slide-17
SLIDE 17

mmap options (2)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

flags, choose one of: MAP_SHARED — changing memory changes fjle and vice-versa

multiple processes mmap same fjle: get same physical pages read()/write() must use same physical pages changes to memory (if writable) must be sent to disk eventually

MAP_PRIVATE — make a copy of data in fjle

changes to memory do not change fjle almost as if copied during mmap call but probably actually copied using copy-on-write

10

slide-18
SLIDE 18

mmap options (3)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

flags, choose one of: MAP_SHARED — changing memory changes fjle and vice-versa MAP_PRIVATE — make a copy of data in fjle …or’d with optional additonal fmags Linux: MAP_ANONYMOUS — ignore fd, allocate empty space

trick: Linux tracks process’s memory as list of mmap’s …‘normal’ memory heap, just special case w/o fjle

and more (see manual page)

11

slide-19
SLIDE 19

mmap options (4)

#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

addr, suggestion about where to put mapping (may be ignored)

not mandatory unless MAP_FIXED is used (which is rare) can pass NULL — “choose for me” address chosen will be returned MAP_FAILED (constant) on failure

read()/write()/etc. use same physical memory that’s referenced by process’s page table …and OS must eventually modify disk with changes read()/write()/etc. use same physical memory that’s referenced by process’s page table …and OS must eventually modify disk with changes

12

slide-20
SLIDE 20

mmap exercise

suppose hello.txt initially contains “foo”:

int fd = open("hello.txt", O_RDWR); char *p1 = mmap(NULL, 3 /* size */, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); char *p2 = mmap(NULL, 3, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); char *p3 = mmap(NULL, 3, PROT_READ, MAP_SHARED, fd, 0); p2[2] = 'b'; p1[2] = 'x'; p1[1] = 'i'; char buffer[3]; read(fd, buffer, 3); printf("%3s/%3s/%3s\n", buffer, p2, p3);

What is the output? (Assume no failures.)

  • A. foo/fob/foo
  • D. fix/fob/fix
  • B. fix/fob/foo
  • E. fix/fob/fob
  • C. fix/fix/fix
  • F. something else

13

slide-21
SLIDE 21

mmap exercise

suppose hello.txt initially contains “foo”:

int fd = open("hello.txt", O_RDWR); char *p1 = mmap(NULL, 3 /* size */, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); char *p2 = mmap(NULL, 3, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); char *p3 = mmap(NULL, 3, PROT_READ, MAP_SHARED, fd, 0); p2[2] = 'b'; p1[2] = 'x'; p1[1] = 'i'; char buffer[3]; read(fd, buffer, 3); printf("%3s/%3s/%3s\n", buffer, p2, p3);

What is the output? (Assume no failures.)

  • A. foo/fob/foo
  • D. fix/fob/fix
  • B. fix/fob/foo
  • E. fix/fob/fob
  • C. fix/fix/fix
  • F. something else

14

slide-22
SLIDE 22

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

16

slide-23
SLIDE 23

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

16

slide-24
SLIDE 24

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

16

slide-25
SLIDE 25

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

16

slide-26
SLIDE 26

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

16

slide-27
SLIDE 27

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

17

slide-28
SLIDE 28

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-29
SLIDE 29

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-30
SLIDE 30

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-31
SLIDE 31

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-32
SLIDE 32

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-33
SLIDE 33

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-34
SLIDE 34

mapped pages (read-only)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD initially — all invalid? (could also prefjll entries…) read from second page? page fault PF handler: fjnd cached page update page table, retry read from fjrst page? page fault PF handler: no cached page fjrst read in page PF handler: read in page now point to page

18

slide-35
SLIDE 35

shared mmap

int fd = open("/tmp/somefile.dat", O_RDWR); mmap(0, 64 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

from /proc/PID/maps for this program:

7f93ad877000-7f93ad887000 rw-s 00000000 08:01 1839758 /tmp/somefile.dat 19

slide-36
SLIDE 36

mapped pages (read/write, shared)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD write to page? update cached fjle data data on disk out of date eventually free memory… write update to disk

20

slide-37
SLIDE 37

mapped pages (read/write, shared)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD write to page? update cached fjle data data on disk out of date eventually free memory… write update to disk

20

slide-38
SLIDE 38

mapped pages (read/write, shared)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD write to page? update cached fjle data data on disk out of date eventually free memory… write update to disk

20

slide-39
SLIDE 39

mapped pages (read/write, shared)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD write to page? update cached fjle data data on disk out of date eventually free memory… write update to disk

20

slide-40
SLIDE 40

minor and major faults

minor page fault

page is already in memory (“page cache”) just fjll in page table entry

major page fault

page not already in memory (“page cache”) need to allocate space possibly need to read data from disk/etc.

21

slide-41
SLIDE 41

Linux: reporting minor/major faults

$ /usr/bin/time --verbose some-command Command being timed: "some-command" User time (seconds): 18.15 System time (seconds): 0.35 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.57 ... Maximum resident set size (kbytes): 749820 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 230166 Voluntary context switches: 1423 Involuntary context switches: 53 Swaps: 0 ... Exit status: 0

22

slide-42
SLIDE 42

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

23

slide-43
SLIDE 43

mapped pages (copy-on-write)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD reads like before write to second page? protection fault page table entry says read-only fault handler: make copy, update page table copies of fjle data, modifjed

24

slide-44
SLIDE 44

mapped pages (copy-on-write)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD reads like before write to second page? protection fault page table entry says read-only fault handler: make copy, update page table copies of fjle data, modifjed

24

slide-45
SLIDE 45

mapped pages (copy-on-write)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD reads like before write to second page? protection fault page table entry says read-only fault handler: make copy, update page table copies of fjle data, modifjed

24

slide-46
SLIDE 46

mapped pages (copy-on-write)

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data on disk/SSD reads like before write to second page? protection fault page table entry says read-only fault handler: make copy, update page table copies of fjle data, modifjed

24

slide-47
SLIDE 47

maps counting

4KB (0x1000 byte) pages virtual 0x10000-0x1FFFF (64KB) → “foo.dat” bytes 0-0x0FFFF

map setup private (copy-on-write) bytes 0-0x3FFF and 0x5000-0x6FFF cached in memory

program reads addresses 0x13800–0x15800 then, program overwrites addresses 0x14800–0x15100 assume: program page table fjlled in on demand only

smarter OS would probably proactively fjll in multiple pages

question: how much page/protection faults?

1: set PTE for ofgset 0x3000-0x3FFF (use cached version) 2,3: read from disk + set PTE for 0x4000-0x4FFF; set PTE for 0x5000-0x5FFF 4,5: copy for 0x4000-0x4FFF, 0x5000-0x5FFF

25

slide-48
SLIDE 48

maps counting

4KB (0x1000 byte) pages virtual 0x10000-0x1FFFF (64KB) → “foo.dat” bytes 0-0x0FFFF

map setup private (copy-on-write) bytes 0-0x3FFF and 0x5000-0x6FFF cached in memory

program reads addresses 0x13800–0x15800 then, program overwrites addresses 0x14800–0x15100 assume: program page table fjlled in on demand only

smarter OS would probably proactively fjll in multiple pages

question: how much page/protection faults?

1: set PTE for ofgset 0x3000-0x3FFF (use cached version) 2,3: read from disk + set PTE for 0x4000-0x4FFF; set PTE for 0x5000-0x5FFF 4,5: copy for 0x4000-0x4FFF, 0x5000-0x5FFF

25

slide-49
SLIDE 49

maps counting

4KB (0x1000 byte) pages virtual 0x10000-0x1FFFF (64KB) → “foo.dat” bytes 0-0x0FFFF

map setup private (copy-on-write) bytes 0-0x3FFF and 0x5000-0x6FFF cached in memory

program reads addresses 0x13800–0x15800 then, program overwrites addresses 0x14800–0x15100 assume: program page table fjlled in on demand only

smarter OS would probably proactively fjll in multiple pages

question: how much page/protection faults?

1: set PTE for ofgset 0x3000-0x3FFF (use cached version) 2,3: read from disk + set PTE for 0x4000-0x4FFF; set PTE for 0x5000-0x5FFF 4,5: copy for 0x4000-0x4FFF, 0x5000-0x5FFF

26

slide-50
SLIDE 50

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

27

slide-51
SLIDE 51

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

28

slide-52
SLIDE 52

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-53
SLIDE 53

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-54
SLIDE 54

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-55
SLIDE 55

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-56
SLIDE 56

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-57
SLIDE 57

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-58
SLIDE 58

mapped pages (no backing fjle)

virtual pages w/o backing fjle page table (part) data in memory data on disk (if any) “swapped out” access new page page fault handler allocates on demand need more memory? save page to disk AKA “swap out” data in memory data in memory

29

slide-59
SLIDE 59

Linux maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 / bin / cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

at virtual addresses 0x400000–0x40b000 read, not write, execute, private private = copy-on-write (if writeable) starting at ofgset 0 of the fjle /bin/cat device major number 8 device minor number 1 inode 48328831 more on what this means when we talk about fjlesystems heap — no corresponding fjle allocated using sbrk() but can get same efgect with mmap() call read/write, copy-on-write (private) mapping

int fd = open("/bin/cat", O_RDONLY); mmap(0x60b000, 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0xb000);

(aside: probably used for global variables) as if:

int fd = open("/bin/cat", O_RDONLY); mmap(0x400000, 0xb000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0x0);

as if:

mmap(..., 0x5000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS /* = no file */, ...);

30

slide-60
SLIDE 60

swapping with copy-on-write

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data

  • n disk/SSD

copies of fjle data, modifjed free up space by removing cached copies of fjle need to free up more space? can move copied data to disk “swapped out” modifjed data ‘swapped out’ modifjed data

31

slide-61
SLIDE 61

swapping with copy-on-write

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data

  • n disk/SSD

copies of fjle data, modifjed free up space by removing cached copies of fjle need to free up more space? can move copied data to disk “swapped out” modifjed data ‘swapped out’ modifjed data

31

slide-62
SLIDE 62

swapping with copy-on-write

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data

  • n disk/SSD

copies of fjle data, modifjed free up space by removing cached copies of fjle need to free up more space? can move copied data to disk “swapped out” modifjed data ‘swapped out’ modifjed data

31

slide-63
SLIDE 63

swapping with copy-on-write

virtual pages mapped to fjle page table (part) fjle data, cached in memory fjle data

  • n disk/SSD

copies of fjle data, modifjed free up space by removing cached copies of fjle need to free up more space? can move copied data to disk “swapped out” modifjed data ‘swapped out’ modifjed data

31

slide-64
SLIDE 64

swapping

historical major use of virtual memory is supporting “swapping” using disk (or SSD, …) as the next level of the memory hierarchy process is allocated space on disk/SSD memory is a cache for disk/SSD

  • nly need keep ‘currently active’ pages in physical memory

swapping mmap with “default” fjles to use

32

slide-65
SLIDE 65

swapping

historical major use of virtual memory is supporting “swapping” using disk (or SSD, …) as the next level of the memory hierarchy process is allocated space on disk/SSD memory is a cache for disk/SSD

  • nly need keep ‘currently active’ pages in physical memory

swapping ≈ mmap with “default” fjles to use

32

slide-66
SLIDE 66

HDD/SDDs are slow

HDD reads and writes: milliseconds to tens of milliseconds

minimum size: 512 bytes writing tens of kilobytes basically as fast as writing 512 bytes

SSD writes and writes: hundreds of microseconds

designed for writes/reads of kilobytes (not much smaller)

page fault handler is going switch to another program

33

slide-67
SLIDE 67

HDD/SDDs are slow

HDD reads and writes: milliseconds to tens of milliseconds

minimum size: 512 bytes writing tens of kilobytes basically as fast as writing 512 bytes

SSD writes and writes: hundreds of microseconds

designed for writes/reads of kilobytes (not much smaller)

page fault handler is going switch to another program

33

slide-68
SLIDE 68

HDD/SDDs are slow

HDD reads and writes: milliseconds to tens of milliseconds

minimum size: 512 bytes writing tens of kilobytes basically as fast as writing 512 bytes

SSD writes and writes: hundreds of microseconds

designed for writes/reads of kilobytes (not much smaller)

page fault handler is going switch to another program

33

slide-69
SLIDE 69

the page cache

memory is a cache for disk fjles and program memory has a place on disk

running low on memory? always have room on disk assumption: disk space approximately infjnite

physical memory pages: disk ‘temporarily’ kept in faster storage

possibly being used by one or more processes? possibly part of a fjle on disk being read/written? possibly both

goal: manage this cache intelligently

34

slide-70
SLIDE 70

the page cache

memory is a cache for disk fjles and program memory has a place on disk

running low on memory? always have room on disk assumption: disk space approximately infjnite

physical memory pages: disk ‘temporarily’ kept in faster storage

possibly being used by one or more processes? possibly part of a fjle on disk being read/written? possibly both

goal: manage this cache intelligently

34

slide-71
SLIDE 71

the page cache

memory is a cache for disk fjles and program memory has a place on disk

running low on memory? always have room on disk assumption: disk space approximately infjnite

physical memory pages: disk ‘temporarily’ kept in faster storage

possibly being used by one or more processes? possibly part of a fjle on disk being read/written? possibly both

goal: manage this cache intelligently

34

slide-72
SLIDE 72

the page cache

memory is a cache for disk fjles and program memory has a place on disk

running low on memory? always have room on disk assumption: disk space approximately infjnite

physical memory pages: disk ‘temporarily’ kept in faster storage

possibly being used by one or more processes? possibly part of a fjle on disk being read/written? possibly both

goal: manage this cache intelligently

34

slide-73
SLIDE 73

page cache components [text]

mapping: virtual address or fjle+ofgset → physical page

handle cache hits

fjnd backing location based on virtual address/fjle+ofgset

handle cache misses

track information about each physical page

handle page allocation handle cache eviction

35

slide-74
SLIDE 74

page cache components

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

OS datastructure page table OS datastructure OS datastructure? OS datastructure

page usage

(recently used? etc.)

cache hit

OS lookup for read()/write() CPU lookup in page table

cache miss: OS looks up location on disk allocating a physical page choose page that’s not being used much might need to evict used page requires removing pointers to it need reverse mappings to fjnd pointers to remove

37

slide-75
SLIDE 75

page cache components

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

OS datastructure page table OS datastructure OS datastructure? OS datastructure

page usage

(recently used? etc.)

cache hit

OS lookup for read()/write() CPU lookup in page table

cache miss: OS looks up location on disk allocating a physical page choose page that’s not being used much might need to evict used page requires removing pointers to it need reverse mappings to fjnd pointers to remove

38

slide-76
SLIDE 76

virtual addr/fjle ofgset to physical page

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table for cache hit on memory access structure determined by hardware! OS datastructure kernel data structure for cache hit on read/write (or page fault for mmap’d memory) multiple designs; one idea: balanced tree

39

slide-77
SLIDE 77

virtual addr/fjle ofgset to physical page

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table for cache hit on memory access structure determined by hardware! OS datastructure kernel data structure for cache hit on read/write (or page fault for mmap’d memory) multiple designs; one idea: balanced tree

39

slide-78
SLIDE 78

virtual addr/fjle ofgset to physical page

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table for cache hit on memory access structure determined by hardware! OS datastructure kernel data structure for cache hit on read/write (or page fault for mmap’d memory) multiple designs; one idea: balanced tree

39

slide-79
SLIDE 79

Linux: forward mapping

process control block (task_struct) mmap region info (vm_area_struct)

  • pen fjle info

(struct file) fjle on disk info (struct inode) cached physical pages for fjle (address_space) page table

used to fjll (for mmap) read()/write()

40

slide-80
SLIDE 80

Linux: forward mapping

process control block (task_struct) mmap region info (vm_area_struct)

  • pen fjle info

(struct file) fjle on disk info (struct inode) cached physical pages for fjle (address_space) page table

used to fjll (for mmap) read()/write()

41

slide-81
SLIDE 81

Linux: forward mapping

process control block (task_struct) mmap region info (vm_area_struct)

  • pen fjle info

(struct file) fjle on disk info (struct inode) cached physical pages for fjle (address_space) page table

used to fjll (for mmap) read()/write()

42

slide-82
SLIDE 82

Linux: forward mapping

process control block (task_struct) mmap region info (vm_area_struct)

  • pen fjle info

(struct file) fjle on disk info (struct inode) cached physical pages for fjle (address_space) page table

used to fjll (for mmap) read()/write()

43

slide-83
SLIDE 83

Linux: forward mapping

process control block (task_struct) mmap region info (vm_area_struct)

  • pen fjle info

(struct file) fjle on disk info (struct inode) cached physical pages for fjle (address_space) page table

used to fjll (for mmap) read()/write()

44

slide-84
SLIDE 84

mapped pages (read/write, shared)

fjle data, cached in memory fjle data on disk/SSD

45

slide-85
SLIDE 85

page replacement

step 1: evict a page to free a physical page case 1: there’s an unused page, just use that (easy) case 2: need to remove whatever what’s in that page (more work) step 2: load new, more important in its place needs some way of knowing location of data

47

slide-86
SLIDE 86

page replacement

step 1: evict a page to free a physical page case 1: there’s an unused page, just use that (easy) case 2: need to remove whatever what’s in that page (more work) step 2: load new, more important in its place needs some way of knowing location of data

48

slide-87
SLIDE 87

page cache components

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

OS datastructure page table OS datastructure OS datastructure? OS datastructure

page usage

(recently used? etc.)

cache hit

OS lookup for read()/write() CPU lookup in page table

cache miss: OS looks up location on disk allocating a physical page choose page that’s not being used much might need to evict used page requires removing pointers to it need reverse mappings to fjnd pointers to remove

49

slide-88
SLIDE 88

virtual address/fjle ofgset → location on disk

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table OS datastructure OS datastructure OS datastructure based on fjlesystem — later topic (Linux) part of fjle: track mmap ‘regions’ swapped out non-fjle: trick: unused PTEs

50

slide-89
SLIDE 89

virtual address/fjle ofgset → location on disk

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table OS datastructure OS datastructure OS datastructure based on fjlesystem — later topic (Linux) part of fjle: track mmap ‘regions’ swapped out non-fjle: trick: unused PTEs

50

slide-90
SLIDE 90

virtual address/fjle ofgset → location on disk

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table OS datastructure OS datastructure OS datastructure based on fjlesystem — later topic (Linux) part of fjle: track mmap ‘regions’ swapped out non-fjle: trick: unused PTEs

50

slide-91
SLIDE 91

virtual address/fjle ofgset → location on disk

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table OS datastructure OS datastructure OS datastructure based on fjlesystem — later topic (Linux) part of fjle: track mmap ‘regions’ swapped out non-fjle: trick: unused PTEs

51

slide-92
SLIDE 92

Linux maps: list of maps

$ cat /proc/self/maps 00400000−0040b000 r−xp 00000000 08:01 48328831 / bin / cat 0060a000−0060b000 r− −p 0000a000 08:01 48328831 /bin/cat 0060b000−0060c000 rw−p 0000b000 08:01 48328831 /bin/cat 01974000−01995000 rw−p 00000000 00:00 0 [ heap ] 7f60c718b000−7f60c7490000 r− −p 00000000 08:01 77483660 /usr/lib/locale/locale−archive 7f60c7490000−7f60c764e000 r−xp 00000000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c764e000−7f60c784e000 − − −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c784e000−7f60c7852000 r− −p 001be000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7852000−7f60c7854000 rw−p 001c2000 08:01 96659129 /lib/x86_64−linux−gnu/libc−2.19.so 7f60c7854000−7f60c7859000 rw−p 00000000 00:00 0 7f60c7859000−7f60c787c000 r−xp 00000000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a39000−7f60c7a3b000 rw−p 00000000 00:00 0 7f60c7a7a000−7f60c7a7b000 rw−p 00000000 00:00 0 7f60c7a7b000−7f60c7a7c000 r− −p 00022000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7c000−7f60c7a7d000 rw−p 00023000 08:01 96659109 /lib/x86_64−linux−gnu/ld−2.19.so 7f60c7a7d000−7f60c7a7e000 rw−p 00000000 00:00 0 7ffc5d2b2000−7ffc5d2d3000 rw−p 00000000 00:00 0 [ stack ] 7ffc5d3b0000−7ffc5d3b3000 r− −p 00000000 00:00 0 [ vvar ] 7ffc5d3b3000−7ffc5d3b5000 r−xp 00000000 00:00 0 [ vdso ] ffffffffff600000−ffffffffff601000 r−xp 00000000 00:00 0 [ vsyscall ]

PCB contains list of struct vm_area_struct with: (shown in this output): virtual address start, end permissions

  • fgset in backing fjle (if any)

pointer to backing fjle (if any) (not shown): info about sharing of non-fjle data (e.g. heap after fork) …

52

slide-93
SLIDE 93

virtual address/fjle ofgset → location on disk

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

page table OS datastructure OS datastructure OS datastructure based on fjlesystem — later topic (Linux) part of fjle: track mmap ‘regions’ swapped out non-fjle: trick: unused PTEs

53

slide-94
SLIDE 94

Linux: tracking swapped out pages

need to lookup location on disk potentially one location for every virtual page trick: store location in “ignored” part of page table entry

instead of physical page #, permission bits, etc., store ofgset on disk

54

slide-95
SLIDE 95

page replacement

step 1: evict a page to free a physical page case 1: there’s an unused page, just use that (easy) case 2: need to remove whatever what’s in that page (more work) step 2: load new, more important in its place needs some way of knowing location of data

55

slide-96
SLIDE 96

evicting a page

remove victim page from page table, etc.

every page table it is referenced by every list of fjle pages …

if needed, save victim page to disk going to require: way to fjnd page tables, etc. using page way to detect whether it needs to be saved to disk

56

slide-97
SLIDE 97

page cache components

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

OS datastructure page table OS datastructure OS datastructure? OS datastructure

page usage

(recently used? etc.)

cache hit

OS lookup for read()/write() CPU lookup in page table

cache miss: OS looks up location on disk allocating a physical page choose page that’s not being used much might need to evict used page requires removing pointers to it need reverse mappings to fjnd pointers to remove

57

slide-98
SLIDE 98

page cache components

virtual address

(used by program)

fjle + ofgset

(for read()/write())

physical page

(if cached)

disk location

OS datastructure page table OS datastructure OS datastructure? OS datastructure

page usage

(recently used? etc.)

cache hit

OS lookup for read()/write() CPU lookup in page table

cache miss: OS looks up location on disk allocating a physical page choose page that’s not being used much might need to evict used page requires removing pointers to it need reverse mappings to fjnd pointers to remove

57

slide-99
SLIDE 99

tracking physical pages: fjnding mappings

want to evict a page? remove from page tables, etc. need to track where every page is used! common solution: structure for every physical page with info about every cached fjle/page table using page

58

slide-100
SLIDE 100

Linux: reverse mapping (fjle pages)

process control block (task_struct) mmap region info (vm_area_struct)

  • pen fjle info

(struct file) fjle on disk info (struct inode) cached physical pages for fjle (address_space) page table per-physical page info (struct page)

page number given page number fjnd references to that page (e.g. to remove/change them)

59

slide-101
SLIDE 101

60

slide-102
SLIDE 102

backup slides

61

slide-103
SLIDE 103

fast copies

recall : fork() creates a copy of an entire program! (usually, the copy then calls execve — replaces itself with another program) how isn’t this really slow?

62

slide-104
SLIDE 104

do we really need a complete copy?

Used by OS bash Stack Heap / other dynamic Writable data Code + Constants Used by OS new copy of bash Stack Heap / other dynamic Writable data Code + Constants shared as read-only can’t be shared?

63

slide-105
SLIDE 105

do we really need a complete copy?

Used by OS bash Stack Heap / other dynamic Writable data Code + Constants Used by OS new copy of bash Stack Heap / other dynamic Writable data Code + Constants shared as read-only can’t be shared?

63

slide-106
SLIDE 106

do we really need a complete copy?

Used by OS bash Stack Heap / other dynamic Writable data Code + Constants Used by OS new copy of bash Stack Heap / other dynamic Writable data Code + Constants shared as read-only can’t be shared?

63

slide-107
SLIDE 107

trick for extra sharing

sharing writeable data is fjne — until either process modifjes the copy can we detect modifjcations? trick: tell CPU (via page table) shared part is read-only processor will trigger a fault when it’s written

64

slide-108
SLIDE 108

copy-on-write and page tables

VPN valid? write?physical page … … … … 0x00601 1 1 0x12345 0x00602 1 1 0x12347 0x00603 1 1 0x12340 0x00604 1 1 0x200DF 0x00605 1 1 0x200AF … … … … VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 0x200AF … … … …

copy operation actually duplicates page table both processes share all physical pages but marks pages in both copies as read-only when either process tries to write read-only page triggers a fault — OS actually copies the page after allocating a copy, OS reruns the write instruction

65

slide-109
SLIDE 109

copy-on-write and page tables

VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 0x200AF … … … … VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 0x200AF … … … …

copy operation actually duplicates page table both processes share all physical pages but marks pages in both copies as read-only when either process tries to write read-only page triggers a fault — OS actually copies the page after allocating a copy, OS reruns the write instruction

65

slide-110
SLIDE 110

copy-on-write and page tables

VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 0x200AF … … … … VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 0x200AF … … … …

copy operation actually duplicates page table both processes share all physical pages but marks pages in both copies as read-only when either process tries to write read-only page triggers a fault — OS actually copies the page after allocating a copy, OS reruns the write instruction

65

slide-111
SLIDE 111

copy-on-write and page tables

VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 0x200AF … … … … VPN valid? write?physical page … … … … 0x00601 1 0x12345 0x00602 1 0x12347 0x00603 1 0x12340 0x00604 1 0x200DF 0x00605 1 1 0x300FD … … … …

copy operation actually duplicates page table both processes share all physical pages but marks pages in both copies as read-only when either process tries to write read-only page triggers a fault — OS actually copies the page after allocating a copy, OS reruns the write instruction

65

slide-112
SLIDE 112

sketch: implementing mmap

access mapped fjle for fjrst time, read from disk

(like swapping when memory was swapped out)

write “mapped” memory, write to disk eventually

need to detect whether writes happened usually hardware support: dirty bit

extra detail: other processes should see changes

all accesses to fjle use same physical memory how? OS tracks copies of fjles in memory

66

slide-113
SLIDE 113

aside: Zipf model

working set model makes sense for programs but not the only use of caches example: Wikipedia — most popular articles

67

slide-114
SLIDE 114

Wikipedia page views for 1 hour

100 101 102 103 104 105 106 Rank 100 101 102 103 104 105 # Views

NOTE: log-log-scale

68

slide-115
SLIDE 115

Zipf distribution

Zipf distribution: straight line on log-log graph of rank v. count a few items a much more popular than others

most caching benefjt here

long tail: lots of items accessed a very small number of times

more cache less effjcient — but does something not like working set model, where there’s just not more

69

slide-116
SLIDE 116

good caching strategy for Zipf

keep the most recently popular things up till what you have room for

still benefjt to caching things used 100 times/hour versus 1000

LRU is okay — popular things always recently used

seems to be what Wikipedia’s caches do?

70

slide-117
SLIDE 117

good caching strategy for Zipf

keep the most recently popular things up till what you have room for

still benefjt to caching things used 100 times/hour versus 1000

LRU is okay — popular things always recently used

seems to be what Wikipedia’s caches do?

70

slide-118
SLIDE 118

alternative policies for Zipf

least frequently used

very simple policy if pure Zipf distribution — what you want practical problem: what about changes in popularity?

least frequently used + adjustments for ‘recentness’ more?

71

slide-119
SLIDE 119

models of reuse

working set/locality

active things are likely to be active soon what’s popular changes over time want: something like least-recently used

Zipf distribution

some things are just popular always want: something like least-frequently used

  • ther models?

when X is loaded, Y is always needed?

want: identify pairs of related values, load/discard together

some things are only used once

want: identify these, do not cache 72

slide-120
SLIDE 120

page cache versus processor cache

unlike processor cache, page cache… stores multi-kilobyte blocks

add/remove whole 4+KB pages versus 64-128B blocks smaller page tables; better for hard drives/SSDs

handles misses (get value if not cached) in software

OS data structures tack data on disk/SSDs hardware doesn’t know/care about them hardware only knows how to invoke page fault handler

has no restrictions on where values are stored in cache

any physical page can be used for any virtual page (processor caches have limited associativity)

73

slide-121
SLIDE 121

page cache versus processor cache

unlike processor cache, page cache… stores multi-kilobyte blocks

add/remove whole 4+KB pages versus 64-128B blocks smaller page tables; better for hard drives/SSDs

handles misses (get value if not cached) in software

OS data structures tack data on disk/SSDs hardware doesn’t know/care about them hardware only knows how to invoke page fault handler

has no restrictions on where values are stored in cache

any physical page can be used for any virtual page (processor caches have limited associativity)

73

slide-122
SLIDE 122

page cache versus processor cache

unlike processor cache, page cache… stores multi-kilobyte blocks

add/remove whole 4+KB pages versus 64-128B blocks smaller page tables; better for hard drives/SSDs

handles misses (get value if not cached) in software

OS data structures tack data on disk/SSDs hardware doesn’t know/care about them hardware only knows how to invoke page fault handler

has no restrictions on where values are stored in cache

any physical page can be used for any virtual page (processor caches have limited associativity)

73

slide-123
SLIDE 123

page cache versus processor cache

unlike processor cache, page cache… stores multi-kilobyte blocks

add/remove whole 4+KB pages versus 64-128B blocks smaller page tables; better for hard drives/SSDs

handles misses (get value if not cached) in software

OS data structures tack data on disk/SSDs hardware doesn’t know/care about them hardware only knows how to invoke page fault handler

has no restrictions on where values are stored in cache

any physical page can be used for any virtual page (processor caches have limited associativity)

73

slide-124
SLIDE 124

page cache versus processor cache

unlike processor cache, page cache… stores multi-kilobyte blocks

add/remove whole 4+KB pages versus 64-128B blocks smaller page tables; better for hard drives/SSDs

handles misses (get value if not cached) in software

OS data structures tack data on disk/SSDs hardware doesn’t know/care about them hardware only knows how to invoke page fault handler

has no restrictions on where values are stored in cache

any physical page can be used for any virtual page (processor caches have limited associativity)

73

slide-125
SLIDE 125

Linux: tracking memory regions

struct vm_area_struct { ... unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ ... pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ ... struct anon_vma *anon_vma; /* Serialized by page_table_lock */ ... unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ struct file * vm_file; /* File we map to (can be NULL). */ ... } __randomize_layout;

virtual addresses of mapping mapping are part of sorted list/tree to allow fjnding by start/end address permissions (read/write/execute) fmags: private or shared? … private = copy-on-write shared = make changes to underlying fjle for fjnding other uses of non-fjle pages e.g. two copies after fork

process control block (task_struct) sorted list of mmap’s (vm_area_structs)

  • pen fjles (struct file)

75

slide-126
SLIDE 126

Linux: tracking memory regions

struct vm_area_struct { ... unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ ... pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ ... struct anon_vma *anon_vma; /* Serialized by page_table_lock */ ... unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ struct file * vm_file; /* File we map to (can be NULL). */ ... } __randomize_layout;

virtual addresses of mapping mapping are part of sorted list/tree to allow fjnding by start/end address permissions (read/write/execute) fmags: private or shared? … private = copy-on-write shared = make changes to underlying fjle for fjnding other uses of non-fjle pages e.g. two copies after fork

process control block (task_struct) sorted list of mmap’s (vm_area_structs)

  • pen fjles (struct file)

75

slide-127
SLIDE 127

Linux: tracking memory regions

struct vm_area_struct { ... unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ ... pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ ... struct anon_vma *anon_vma; /* Serialized by page_table_lock */ ... unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ struct file * vm_file; /* File we map to (can be NULL). */ ... } __randomize_layout;

virtual addresses of mapping mapping are part of sorted list/tree to allow fjnding by start/end address permissions (read/write/execute) fmags: private or shared? … private = copy-on-write shared = make changes to underlying fjle for fjnding other uses of non-fjle pages e.g. two copies after fork

process control block (task_struct) sorted list of mmap’s (vm_area_structs)

  • pen fjles (struct file)

75

slide-128
SLIDE 128

Linux: tracking memory regions

struct vm_area_struct { ... unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ ... pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ ... struct anon_vma *anon_vma; /* Serialized by page_table_lock */ ... unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ struct file * vm_file; /* File we map to (can be NULL). */ ... } __randomize_layout;

virtual addresses of mapping mapping are part of sorted list/tree to allow fjnding by start/end address permissions (read/write/execute) fmags: private or shared? … private = copy-on-write shared = make changes to underlying fjle for fjnding other uses of non-fjle pages e.g. two copies after fork

process control block (task_struct) sorted list of mmap’s (vm_area_structs)

  • pen fjles (struct file)

75

slide-129
SLIDE 129

Linux: tracking memory regions

struct vm_area_struct { ... unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ ... pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ ... struct anon_vma *anon_vma; /* Serialized by page_table_lock */ ... unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ struct file * vm_file; /* File we map to (can be NULL). */ ... } __randomize_layout;

virtual addresses of mapping mapping are part of sorted list/tree to allow fjnding by start/end address permissions (read/write/execute) fmags: private or shared? … private = copy-on-write shared = make changes to underlying fjle for fjnding other uses of non-fjle pages e.g. two copies after fork

process control block (task_struct) sorted list of mmap’s (vm_area_structs)

  • pen fjles (struct file)

75

slide-130
SLIDE 130

Linux: tracking memory regions

struct vm_area_struct { ... unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ ... pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ ... struct anon_vma *anon_vma; /* Serialized by page_table_lock */ ... unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units */ struct file * vm_file; /* File we map to (can be NULL). */ ... } __randomize_layout;

virtual addresses of mapping mapping are part of sorted list/tree to allow fjnding by start/end address permissions (read/write/execute) fmags: private or shared? … private = copy-on-write shared = make changes to underlying fjle for fjnding other uses of non-fjle pages e.g. two copies after fork

process control block (task_struct) sorted list of mmap’s (vm_area_structs)

  • pen fjles (struct file)

76

slide-131
SLIDE 131

Linux: tracking fjles in memory

struct file { ... struct inode *f_inode; ... }; ... struct inode { ... struct address_space i_data; ... }; ... struct address_space { ... struct radix_tree_root i_pages; /* cached pages */ atomic_t i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root_cached i_mmap; /* tree of private and shared mappings */ ...

process control block (task_struct)

  • pen fjle info (struct file)

fjle on disk info (struct inode) address_space cached physical pages for fjle mmap() virtual addresses for fjle

77

slide-132
SLIDE 132

Linux: tracking fjles in memory

struct file { ... struct inode *f_inode; ... }; ... struct inode { ... struct address_space i_data; ... }; ... struct address_space { ... struct radix_tree_root i_pages; /* cached pages */ atomic_t i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root_cached i_mmap; /* tree of private and shared mappings */ ...

process control block (task_struct)

  • pen fjle info (struct file)

fjle on disk info (struct inode) address_space cached physical pages for fjle mmap() virtual addresses for fjle

77

slide-133
SLIDE 133

Linux: reverse mapping (non-fjle pages)

process control block (task_struct) mmap region info (vm_area_struct) linked list of mmap regions (anon_vma) page table per-physical page info (struct page)

page number given non-fjle page (heap, copied-on-write copy of fjle, etc.) fjnd references to that page (may be multiple because of fork, etc.)

78

slide-134
SLIDE 134

list of allocations per page

naive solution: seperate list for each page?

a lot of overhead (many tens of bytes per 4K page?)

but, trick: many pages ‘copied’ at the same time (e.g. fork) idea: share list between all pages

initially: list one of mmap region

  • n fork: add to existing list; create a new one

79

slide-135
SLIDE 135

Linux: physical page → fjle → PTE

Linux tracking where fjle pages are in page tables:

struct page { ... struct address_space *mapping; pgoff_t index; /* Our offset within mapping. */ ... }; struct address_space { ... struct rb_root_cached i_mmap; /* tree of private and shared mappings */ ... };

tree of mappings lets us fjnd vm_area_structs and PTEs rather complicated look up (but writing ot disk is already slow)

80