RadixVM: Scalable address spaces for multithreaded applications - - PowerPoint PPT Presentation

radixvm scalable address spaces for multithreaded
SMART_READER_LITE
LIVE PREVIEW

RadixVM: Scalable address spaces for multithreaded applications - - PowerPoint PPT Presentation

Title RadixVM: Scalable address spaces for multithreaded applications Austin T. Clements M. Frans Kaashoek Nickolai Zeldovich MIT CSAIL RadixVM: Scalable address spaces for multithreaded applications Parallel applications use VM intensively


slide-1
SLIDE 1

Title

RadixVM: Scalable address spaces for multithreaded applications

RadixVM: Scalable address spaces for multithreaded applications

Austin T. Clements

  • M. Frans Kaashoek

Nickolai Zeldovich MIT CSAIL

slide-2
SLIDE 2

Parallel applications use VM intensively

RadixVM: Scalable address spaces for multithreaded applications

  • Hardware

00101110110100011110111001001100110000111001101110110000111010101110100101111011 10100101000100001110010010010000001000100011110010111001101001011111111011110001 00011111000010011111110100011101100011101100110010111001100000101100011001000011 11100100110100011111000110000101110010100011100001111011110101001111100001001110

Application

slide-3
SLIDE 3

Parallel applications use VM intensively

RadixVM: Scalable address spaces for multithreaded applications

Virtual memory system

Kernel

  • Hardware

00101110110100011110111001001100110000111001101110110000111010101110100101111011 10100101000100001110010010010000001000100011110010111001101001011111111011110001 00011111000010011111110100011101100011101100110010111001100000101100011001000011 11100100110100011111000110000101110010100011100001111011110101001111100001001110

Application

Every popular operating system serializes basic VM operations like mmap and munmap.

slide-4
SLIDE 4

Parallel applications use VM intensively

RadixVM: Scalable address spaces for multithreaded applications

Garbage collector Memory manager

Runtime

Virtual memory system

Kernel

munmap mmap

  • Hardware

00101110110100011110111001001100110000111001101110110000111010101110100101111011 10100101000100001110010010010000001000100011110010111001101001011111111011110001 00011111000010011111110100011101100011101100110010111001100000101100011001000011 11100100110100011111000110000101110010100011100001111011110101001111100001001110

Application

malloc free

Every popular operating system serializes basic VM operations like mmap and munmap.

slide-5
SLIDE 5

Application performance suffers

RadixVM: Scalable address spaces for multithreaded applications

200 400 600 800 1000 1200 1 10 20 30 40 50 60 70 80 T

  • tal throughput (jobs/hour)

# cores Linux 3.5.7 Multithreaded MapReduce [Mao '10]

slide-6
SLIDE 6

Inside parallel applications

RadixVM: Scalable address spaces for multithreaded applications

Independent VM operations on non-overlapping regions.

slide-7
SLIDE 7

Inside parallel applications

RadixVM: Scalable address spaces for multithreaded applications

Independent VM operations on non-overlapping regions. Common pattern for parallel applications.

slide-8
SLIDE 8

Goal

RadixVM: Scalable address spaces for multithreaded applications

Perfectly scalable mmap, munmap, and page fault

  • perations on non-overlapping address space regions.
slide-9
SLIDE 9

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Per-CPU TLB

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520

Memory map Hardware page table

s r w x /bin/ls s r w x (anon)

slide-10
SLIDE 10

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Per-CPU TLB

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520

Memory map Hardware page table

s r w x /bin/ls s r w x (anon) Tracks mapped regions and region metadata

slide-11
SLIDE 11

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Per-CPU TLB

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520

Memory map Hardware page table

s r w x /bin/ls s r w x (anon) Shared by OS and hardware. Maps virtual to physical.

slide-12
SLIDE 12

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Per-CPU TLB

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520

Memory map Hardware page table

s r w x /bin/ls s r w x (anon) Caches page tables. Internal to CPU.

slide-13
SLIDE 13

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Application munmap load/store mmap

slide-14
SLIDE 14

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) s r w x (anon) Application munmap load/store mmap

slide-15
SLIDE 15

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

slide-16
SLIDE 16

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

slide-17
SLIDE 17

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

Seems reasonable. Why doesn't it scale?

slide-18
SLIDE 18

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

Seems reasonable. Why doesn't it scale? Locking

slide-19
SLIDE 19

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

Seems reasonable. Why doesn't it scale? Locking Shootdown broadcast

slide-20
SLIDE 20

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

Seems reasonable. Why doesn't it scale? Locking Shootdown broadcast Cache contention

slide-21
SLIDE 21

Structure of a VM system

RadixVM: Scalable address spaces for multithreaded applications

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) Page faults TLB misses s r w x (anon) Application munmap load/store mmap

Seems reasonable. Why doesn't it scale? Locking Shootdown broadcast Cache contention Cross-core communication

slide-22
SLIDE 22

This talk: RadixVM

RadixVM: Scalable address spaces for multithreaded applications

Concurrent memory map representation Method of targeting TLB shootdowns Scalable, space-efficient reference counting To achieve perfectly scalable non-overlapping operations, we eliminate communication between such operations.

slide-23
SLIDE 23

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata

slide-24
SLIDE 24

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata Popular operating systems use a balanced tree of region objects.

slide-25
SLIDE 25

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata Popular operating systems use a balanced tree of region objects. Memory-efficient

slide-26
SLIDE 26

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata Popular operating systems use a balanced tree of region objects. Memory-efficient Unnecessary communication

slide-27
SLIDE 27

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata Popular operating systems use a balanced tree of region objects. Memory-efficient Unnecessary communication

slide-28
SLIDE 28

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata Popular operating systems use a balanced tree of region objects. Memory-efficient Unnecessary communication

~1000 cycles

slide-29
SLIDE 29

Metadata management

RadixVM: Scalable address spaces for multithreaded applications

Need to store OS-level memory mapping metadata Popular operating systems use a balanced tree of region objects. Memory-efficient Unnecessary communication

~1000 cycles

Most potential data structures (skip lists, B-trees, etc.) induce communication between disjoint operations.

slide-30
SLIDE 30

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

slide-31
SLIDE 31

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

slide-32
SLIDE 32

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

235

slide-33
SLIDE 33

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

235 s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

slide-34
SLIDE 34

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

235 s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Good: Operations on non-overlapping regions are concurrent and induce no communication.

slide-35
SLIDE 35

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

235 s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Good: Operations on non-overlapping regions are concurrent and induce no communication. Bad: Space use is obscene, time is proportional to region size

slide-36
SLIDE 36

Array-based memory map

RadixVM: Scalable address spaces for multithreaded applications

235 s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Good: Operations on non-overlapping regions are concurrent and induce no communication. Bad: Space use is obscene, time is proportional to region size How can we achieve good concurrency while keeping space and time under control?

slide-37
SLIDE 37

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

235 s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Solution: Range-oriented radix tree

slide-38
SLIDE 38

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Solution: Range-oriented radix tree

...

slide-39
SLIDE 39

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Solution: Range-oriented radix tree

... ...

slide-40
SLIDE 40

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Solution: Range-oriented radix tree

... ... ...

slide-41
SLIDE 41

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Solution: Range-oriented radix tree

... ...

Fold constant-valued chunks into parent, recursively.

...

slide-42
SLIDE 42

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon) /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc /lib/libc

Solution: Range-oriented radix tree

... ...

Fold constant-valued chunks into parent, recursively.

slide-43
SLIDE 43

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon)

Solution: Range-oriented radix tree

... ...

/lib/libc

Fold constant-valued chunks into parent, recursively.

slide-44
SLIDE 44

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon)

Solution: Range-oriented radix tree

... ...

/lib/libc

Fold constant-valued chunks into parent, recursively.

2-3x the size of the balanced region tree

slide-45
SLIDE 45

Radix tree

RadixVM: Scalable address spaces for multithreaded applications

s r w x file

(anon) (anon) (anon) (anon)

Solution: Range-oriented radix tree

... ...

/lib/libc

Fold constant-valued chunks into parent, recursively.

2-3x the size of the balanced region tree

We can achieve array-like concurrency with time and space similar to the balanced tree.

slide-46
SLIDE 46

TLB shootdown

RadixVM: Scalable address spaces for multithreaded applications

munmap must notify cores of changes to cached mappings

slide-47
SLIDE 47

TLB shootdown

RadixVM: Scalable address spaces for multithreaded applications

munmap must notify cores of changes to cached mappings Which cores have a mapping cached? Who knows?!

slide-48
SLIDE 48

TLB shootdown

RadixVM: Scalable address spaces for multithreaded applications

munmap must notify cores of changes to cached mappings Which cores have a mapping cached? Who knows?!

slide-49
SLIDE 49

TLB shootdown

RadixVM: Scalable address spaces for multithreaded applications

munmap must notify cores of changes to cached mappings Which cores have a mapping cached? Who knows?!

slide-50
SLIDE 50

TLB shootdown

RadixVM: Scalable address spaces for multithreaded applications

munmap must notify cores of changes to cached mappings Which cores have a mapping cached? Who knows?!

slide-51
SLIDE 51

TLB shootdown

RadixVM: Scalable address spaces for multithreaded applications

munmap must notify cores of changes to cached mappings Which cores have a mapping cached? Who knows?! In the common case, there is little or no sharing.

slide-52
SLIDE 52

TLB tracking

RadixVM: Scalable address spaces for multithreaded applications

A software-managed TLB would make this easy.

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

? ? ?

slide-53
SLIDE 53

TLB tracking

RadixVM: Scalable address spaces for multithreaded applications

A software-managed TLB would make this easy.

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

? ? ?

Trap and track

slide-54
SLIDE 54

TLB tracking

RadixVM: Scalable address spaces for multithreaded applications

A software-managed TLB would make this easy.

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

? ? ?

Trap and track

slide-55
SLIDE 55

Soft TLBs, the hard way

RadixVM: Scalable address spaces for multithreaded applications

Solution: Per-core page tables for precise TLB tracking

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

slide-56
SLIDE 56

Soft TLBs, the hard way

RadixVM: Scalable address spaces for multithreaded applications

Solution: Per-core page tables for precise TLB tracking

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

slide-57
SLIDE 57

Soft TLBs, the hard way

RadixVM: Scalable address spaces for multithreaded applications

Solution: Per-core page tables for precise TLB tracking

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

Trap and track

slide-58
SLIDE 58

Soft TLBs, the hard way

RadixVM: Scalable address spaces for multithreaded applications

Solution: Per-core page tables for precise TLB tracking

Virt Phys 8a4bd 00382 87c38 0049c Virt Phys 18bca 00230 87c38 0049c Virt Phys 8a4bd 00382 87c38 0049c b987a 00520 s r w x /bin/ls s r w x (anon) 235 Page faults TLB misses

Trap and track TLB tracking allows us to target TLB shootdowns, eliminating unnecessary shootdown communication.

slide-59
SLIDE 59

Reference counting

RadixVM: Scalable address spaces for multithreaded applications

Reference counting for physical pages and radix nodes

slide-60
SLIDE 60

Reference counting

RadixVM: Scalable address spaces for multithreaded applications

Reference counting for physical pages and radix nodes

Scalable inc/dec Shared counters N

slide-61
SLIDE 61

Reference counting

RadixVM: Scalable address spaces for multithreaded applications

Reference counting for physical pages and radix nodes

Scalable inc/dec Shared counters N Distributed counters Y O(cpus) O(objs*cpus) Space Zero-detection cost O(1) O(1)

slide-62
SLIDE 62

Reference counting

RadixVM: Scalable address spaces for multithreaded applications

Reference counting for physical pages and radix nodes

Scalable inc/dec Shared counters N Distributed counters Y O(cpus) O(objs*cpus) Space Zero-detection cost O(1) O(1) SNZIs [Ellen '07] Mostly O(cpus) O(1)

slide-63
SLIDE 63

Reference counting

RadixVM: Scalable address spaces for multithreaded applications

Reference counting for physical pages and radix nodes

Scalable inc/dec Shared counters N Distributed counters Y O(cpus) O(objs*cpus) Space Zero-detection cost O(1) O(1) SNZIs [Ellen '07] Mostly O(cpus) O(1) Refcache Y O(1) O(1)

slide-64
SLIDE 64

Reference counting

RadixVM: Scalable address spaces for multithreaded applications

Reference counting for physical pages and radix nodes

Scalable inc/dec Shared counters N Distributed counters Y O(cpus) O(objs*cpus) Space Zero-detection cost O(1) O(1) SNZIs [Ellen '07] Mostly O(cpus) O(1) Refcache Y O(1) O(1) Immediate zero detection Y N Y N

slide-65
SLIDE 65

Refcache

RadixVM: Scalable address spaces for multithreaded applications

Approach: Shared counters with per-core delta caches

slide-66
SLIDE 66

Refcache

RadixVM: Scalable address spaces for multithreaded applications

Approach: Shared counters with per-core delta caches

global_count = 1 ... Object A global_count = 2 ... Object B Single counter per object

slide-67
SLIDE 67

Refcache

RadixVM: Scalable address spaces for multithreaded applications

Approach: Shared counters with per-core delta caches

global_count = 1 ... Object A global_count = 2 ... Object B Single counter per object Object V Delta Object V Delta Object V Delta Object V Delta CPU 0 CPU 1 CPU 2 CPU 3 Caches changes, not values

slide-68
SLIDE 68

Refcache

RadixVM: Scalable address spaces for multithreaded applications

Approach: Shared counters with per-core delta caches

global_count = 1 ... Object A global_count = 2 ... Object B Single counter per object Object V Delta Object V Delta Object V Delta Object V Delta CPU 0 CPU 1 CPU 2 CPU 3 Caches changes, not values Operations are local inc(A) 1 A +1 inc(A) 1 A +1

slide-69
SLIDE 69

Refcache

RadixVM: Scalable address spaces for multithreaded applications

Approach: Shared counters with per-core delta caches

global_count = 1 ... Object A global_count = 2 ... Object B Single counter per object Object V Delta Object V Delta Object V Delta Object V Delta CPU 0 CPU 1 CPU 2 CPU 3 Caches changes, not values Operations are local inc(A) 1 A +1 inc(A) 1 A +1 1 B

  • 1

dec(B)

slide-70
SLIDE 70

Refcache

RadixVM: Scalable address spaces for multithreaded applications

Approach: Shared counters with per-core delta caches

global_count = 1 ... Object A global_count = 2 ... Object B Single counter per object Object V Delta Object V Delta Object V Delta Object V Delta CPU 0 CPU 1 CPU 2 CPU 3 Caches changes, not values Operations are local inc(A) 1 A +1 inc(A) 1 A +1 1 B

  • 1

dec(B) Generally unknown True count = ∑ True count = ∑

slide-71
SLIDE 71

Refcache

RadixVM: Scalable address spaces for multithreaded applications

When is the true count zero?

slide-72
SLIDE 72

Refcache

RadixVM: Scalable address spaces for multithreaded applications

When is the true count zero? Assumption: When the true count is zero, it will stay zero.

slide-73
SLIDE 73

Refcache

RadixVM: Scalable address spaces for multithreaded applications

When is the true count zero? Assumption: When the true count is zero, it will stay zero. Divide time in to epochs. Each epoch, all CPUs flush their delta caches. If an object's global count stays zero for a whole epoch, then its true count is zero.

t

slide-74
SLIDE 74

Refcache

RadixVM: Scalable address spaces for multithreaded applications

When is the true count zero? Assumption: When the true count is zero, it will stay zero. Divide time in to epochs. Each epoch, all CPUs flush their delta caches. If an object's global count stays zero for a whole epoch, then its true count is zero.

t CPU 0 1 2 3

slide-75
SLIDE 75

Refcache

RadixVM: Scalable address spaces for multithreaded applications

When is the true count zero? Assumption: When the true count is zero, it will stay zero. Divide time in to epochs. Each epoch, all CPUs flush their delta caches. If an object's global count stays zero for a whole epoch, then its true count is zero.

t CPU 0 1 2 3 global

slide-76
SLIDE 76

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) true global

slide-77
SLIDE 77

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) true global

slide-78
SLIDE 78

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? true global

slide-79
SLIDE 79

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement true global

slide-80
SLIDE 80

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement The true count is the sum of everything up to right now. true global

slide-81
SLIDE 81

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement The true count is the sum of everything up to right now. But the global count only reflects the blue region. Operations in the orange region are still cached. true global

slide-82
SLIDE 82

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement The true count is the sum of everything up to right now. But the global count only reflects the blue region. Operations in the orange region are still cached. true global

slide-83
SLIDE 83

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement The true count is the sum of everything up to right now. But the global count only reflects the blue region. Operations in the orange region are still cached. Global count now reflects cached ops true global

slide-84
SLIDE 84

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement The true count is the sum of everything up to right now. But the global count only reflects the blue region. Operations in the orange region are still cached. Global count now reflects cached ops Abort delete true global

slide-85
SLIDE 85

Refcache example

RadixVM: Scalable address spaces for multithreaded applications

t CPU 0 1 2 3 Initially: Global count is 1, no cached deltas (so true count is 1) CPU 0 decrements and flushes; global count is now 0. What about true count? CPU 1 increments after flush, before CPU 0's decrement The true count is the sum of everything up to right now. But the global count only reflects the blue region. Operations in the orange region are still cached. Global count now reflects cached ops Abort delete

Refcache enables time- and space-efficient scalable reference counting with minimal latency.

true global

slide-86
SLIDE 86

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

slide-87
SLIDE 87

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

(anon) (anon) (anon) (anon)

mmap

slide-88
SLIDE 88

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

(anon) (anon) (anon) (anon)

slide-89
SLIDE 89

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

(anon) (anon) (anon) (anon)

Page fault

slide-90
SLIDE 90

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

(anon) (anon) (anon) (anon)

Page fault Install in local page table Allocate backing page Record faulting CPU

slide-91
SLIDE 91

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

(anon) (anon) (anon) (anon)

slide-92
SLIDE 92

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

(anon) (anon) (anon) (anon)

munmap Shootdown Clear page table Release backing pages

slide-93
SLIDE 93

Bringing it all together

RadixVM: Scalable address spaces for multithreaded applications

11001 10100 10010 10000 01010 01100 11001 10100

Per-core page tables s r w x file

(anon) (anon) (anon) (anon)

cores Radix tree memory map Reference counted physical pages

11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100 11001 10100 10010 10000 01010 01100 11001 10100

slide-94
SLIDE 94

Implementation

RadixVM: Scalable address spaces for multithreaded applications

We built RadixVM in a custom research kernel. We believe RadixVM could be built in a mainstream kernel. All benchmarks are source-compatible with Linux.

slide-95
SLIDE 95

The other 99% is perspiration

RadixVM: Scalable address spaces for multithreaded applications

Booting 80 cores (ACPI, x2APIC, IOMMU, oh my!) NUMA-aware everything (memory allocation, per-CPU data, etc) Performance analysis tools (NMI profiling, PEBS, load latency profiling, statistics counters) Hardware curve balls (false sharing, bad prefetch behavior, etc) All necessary for good results; all standard engineering.

slide-96
SLIDE 96

Evaluation

RadixVM: Scalable address spaces for multithreaded applications

Does parallel mmap/munmap matter to applications? Are all of RadixVM's components necessary for scalability?

slide-97
SLIDE 97

RadixVM improves application scalability

RadixVM: Scalable address spaces for multithreaded applications

Metis multicore MapReduce [Mao '10], inverse indexing application

200 400 600 800 1000 1200 1 10 20 30 40 50 60 70 80 Total throughput (jobs/hour) # cores RadixVM Linux 3.5.7

slide-98
SLIDE 98

RadixVM improves application scalability

RadixVM: Scalable address spaces for multithreaded applications

Metis multicore MapReduce [Mao '10], inverse indexing application

200 400 600 800 1000 1200 1 10 20 30 40 50 60 70 80 Total throughput (jobs/hour) # cores RadixVM Linux 3.5.7

Page fault lock contention

slide-99
SLIDE 99

RadixVM improves application scalability

RadixVM: Scalable address spaces for multithreaded applications

Metis multicore MapReduce [Mao '10], inverse indexing application

200 400 600 800 1000 1200 1 10 20 30 40 50 60 70 80 Total throughput (jobs/hour) # cores RadixVM Linux 3.5.7

Page fault lock contention Pairwise sharing

slide-100
SLIDE 100

Radix trees avoid communication

RadixVM: Scalable address spaces for multithreaded applications

lookup existing keys insert/delete random keys ~No communication Linear scalability

50M 100M 150M 200M 250M 300M 350M 1 10 20 30 40 50 60 70 80 Total throughput (lookups/sec) # cores (n/2 readers, n/2 writers) Radix tree Lock-free skiplist

slide-101
SLIDE 101

Refcache avoids cache line sharing

RadixVM: Scalable address spaces for multithreaded applications

map/unmap a shared physical page

10M 20M 30M 40M 50M 60M 70M 80M 1 10 20 30 40 50 60 70 80 Total map-unmap pairs/sec # cores Refcache Shared counter

slide-102
SLIDE 102

Targeted TLB shootdown improves scalability

RadixVM: Scalable address spaces for multithreaded applications

slide-103
SLIDE 103

Targeted TLB shootdown improves scalability

RadixVM: Scalable address spaces for multithreaded applications 2M 4M 6M 8M 10M 12M 1 10 20 30 40 50 60 70 80 Total pages/sec # cores Per-core Shared

Core-local address space use No TLB shootdowns

slide-104
SLIDE 104

Targeted TLB shootdown improves scalability

RadixVM: Scalable address spaces for multithreaded applications 2M 4M 6M 8M 10M 12M 1 10 20 30 40 50 60 70 80 Total pages/sec # cores Per-core Shared

Core-local address space use No TLB shootdowns

100k 200k 300k 400k 500k 600k 1 10 20 30 40 50 60 70 80 Total pages/sec # cores Per-core Shared

Global address space use Page table contention Per-core

  • verhead
slide-105
SLIDE 105

Related work

RadixVM: Scalable address spaces for multithreaded applications

Scalable VM systems

  • K42 [Krieger '06]
  • Corey [Boyd-Wickizer '08]
  • Bonsai [Clements '12]

Scalable reference counters

  • Modula-2+ local refs [DeTreville '90]
  • Distributed counters [Appavoo '07]
  • Scalable non-zero indicators [Ellen '07]
  • Sloppy counters [Boyd-Wickizer '10]
slide-106
SLIDE 106

Conclusion

RadixVM: Scalable address spaces for multithreaded applications

(anon (anon (anon (anon /lib/l

Radix trees Per-core page tables Refcache

slide-107
SLIDE 107

Conclusion

RadixVM: Scalable address spaces for multithreaded applications

(anon (anon (anon (anon /lib/l

Radix trees Per-core page tables Refcache

Perfect scalability for non-overlapping VM operations

slide-108
SLIDE 108

Conclusion

RadixVM: Scalable address spaces for multithreaded applications

(anon (anon (anon (anon /lib/l

Radix trees Per-core page tables Refcache

Perfect scalability for non-overlapping VM operations Check it out: http://pdos.csail.mit.edu/multicore