Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism
Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020
Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation - - PowerPoint PPT Presentation
Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism Dimitrios Skarlatos , Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020 Virtual
Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020
2
Application
Main Memory
PA 4 PA 1
Page Tables Core L3 Cache L1 Cache L2 Cache TLB TLB Miss! PA 4 VA 1 PA 1 VA 8 Issue LD VA 1
3
Application
Main Memory
PA 4 PA 1
Page Tables Core L3 Cache L1 Cache L2 Cache TLB
PA 4 VA 1 PA 1 VA 8
à “Page Walk” = Fetch entry from page table TLB Miss Issue LD VA 1
PA4 VA1
4
Main Memory
PA 4 PA 1
5
Main Memory
PA 4 PA 1
pmd pte
PMD PTE CR3 47 … 39 38 … 30 29 … 21 20 … 12 11 … 0 9-bits 9-bits 9-bits 9-bits Page Offset
+ + + +
Address A Virtual Address TLB Entry
pgd pud
PUD PGD PA4 VA1
6
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1
PMD PTE PUD PGD
pud pmd pte pgd
pgd
7
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1
PMD PTE PUD PGD
pgd
pgd
pud pmd pte
pud
8
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1
PMD PTE PUD PGD
pud
pgd pte pud pmd
pmd
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1
PMD PTE PUD PGD
pmd
pmd
9 pgd pud pte
pte
10
Application
Main Memory
PA 4 PA 1
Page Tables Core L3 Cache L1 Cache L2 Cache TLB
PA4 VA1
PA 4 VA 1 PA 1 VA 8 Issue LD VA 1
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB
PMD PTE PUD PGD
L2 TLB
11 pgd pte pud pmd
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB
PMD PTE PUD PGD
L2 TLB
MMU Cache
12 pgd pte pud pmd
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB
PMD PTE PUD PGD
L2 TLB
pgd pgd pud
pmd
pte pud
MMU Cache
13 pgd pte pud pmd
Application
Main Memory
PA 4 PA 1
Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB
PMD PTE PUD PGD
L2 TLB
pgd pgd pud
pmd
pte pud
Non-Volatile Memory Technology
Sunny Cove introduces 5-Level Radix Page Tables!!
MMU Cache
14 pgd pte pud pmd
15
à parallel single-step lookups
16
The old approach from Intel and IBM
17
Global Hash Table
Application
Collisions
Tag Tag
VA 1 H VA 9 H
COLLISIONS OS is invoked to resolve them!
18
Global Hash Table
Tag Tag
VA 1 H VA 9 H
Application A Application B
VA 6 H
The old approach from Intel and IBM
How to share pages? Multiple page sizes?
19
Global Hash Table
Tag Tag
VA 1 H VA 9 H
Application A Application B
VA 6 H
The old approach from Intel and IBM
How to share pages? Multiple page sizes? New level of indirection!!
20
Global Hash Table New level of indirection!!
Tag
How to share pages? Multiple page sizes?
Tag Tag
VA 1 H VA 9 H
Application A Application B
VA 6 H
The old approach from Intel and IBM
21
Global Hash Table New level of indirection!!
Tag
How to share pages? Multiple page sizes?
Tag Tag
VA 1 H VA 9 H
Application A Application B
VA 6 H
The old approach from Intel and IBM
COLLISIONS PAGE SHARING PAGE SIZES
22
Global Hash Table
Tag
VA 1 H VA 9 H
Application A Application B
VA 6 H
Tag Tag
The old approach from Intel and IBM Switched to radix page tables!
Rethinking virtual memory translation for parallelism
23
24
a c g b f d
T1 T2 T3
d-ary Cuckoo Hash Table
d
H2 H1 H3
25
a c g b d
T1 T2 T3
d-ary Cuckoo Hash Table
e
H1
f
26
a c g b e d
T1 T2 T3
d-ary Cuckoo Hash Table
H1
f
27
a c g e d
T1 T2 T3
d-ary Cuckoo Hash Table
f
H3
b
28
a c g f e d
T1 T2 T3
d-ary Cuckoo Hash Table
H3
b
29
a c g f e d
T1 T2 T3
d-ary Cuckoo Hash Table
b
H2
30
a c b g f e d
T1 T2 T3
d-ary Cuckoo Hash Table
31
a c b g f e d
T1 T2 T3
d-ary Cuckoo Hash Table
COLLISIONS PAGE SHARING PAGE SIZES
32
a c b g f e d
T1 T2 T3
d-ary Cuckoo Hash Table
COLLISIONS PAGE SHARING PAGE SIZES
PRIVATE PAGE TABLES
33
PRIVATE PAGE TABLES
Main Memory
COLLISIONS PAGE SHARING PAGE SIZES
Page Tables A Page Tables B
App A App B Private page tables cannot be too big
34
Main Memory
COLLISIONS PAGE SHARING PAGE SIZES
Page Tables A
PA 4 VA 1 PA 1 VA 8
App A
PRIVATE PAGE TABLES
Private page tables cannot be too big Need to dynamically resize
35
Main Memory
COLLISIONS PAGE SHARING PAGE SIZES
Page Tables A
PA 4 VA 1 PA 1 VA 8
New Page Tables A
App A
PRIVATE PAGE TABLES
Private page tables cannot be too big Need to dynamically resize
36
Main Memory
COLLISIONS PAGE SHARING PAGE SIZES
Page Tables A
App A
New Page Tables A
PA 4 VA 1 PA 1 VA 8
PRIVATE PAGE TABLES
Private page tables cannot be too big Need to dynamically resize
37
Main Memory
COLLISIONS PAGE SHARING PAGE SIZES
Page Tables A
Private page tables cannot be too big Need to dynamically resize App A
New Page Tables A
PA 4 VA 1 PA 1 VA 8
PRIVATE PAGE TABLES
38
Main Memory
COLLISIONS PAGE SHARING PAGE SIZES
Page Tables A
Private page tables cannot be too big Need to dynamically resize App A
New Page Tables A
PRIVATE PAGE TABLES
While the program is running Gradual Resizing!
PA 4 VA 1 PA 1 VA 8
39
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
m
At every insert à Rehash one element
40
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
m
H'1
At every insert à Rehash one element
41
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
m
H2 H1 H3 H'2 H'1 H'3
42
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
m
H2 H1 H3 H'2 H'1 H'3
2 x d Lookups!
43
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3 m
Rehashing Pointers
44
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3 m Migrated Region
45
a c b g l f e k
T1 T2 T3
Old d-ary Cuckoo Hash Table
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3 m
H'1
Migrated Region
46
a c g l f e
T1 T2 T3
Old d-ary Cuckoo Hash Table
m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3 Migrated Region b k
47
a c g l f e
T1 T2 T3
Old d-ary Cuckoo Hash Table
b k m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3 Migrated Region
48
a c g l f e
T1 T2 T3
Old d-ary Cuckoo Hash Table
b k m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3
H2 H1 H3
m Migrated Region
< < <
49
a c g l f e
T1 T2 T3
Old d-ary Cuckoo Hash Table
b k m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3
H2 H1 H3
m Migrated Region
H'1 H'3 < < <
50
a c g l f e
T1 T2 T3
Old d-ary Cuckoo Hash Table
b k m
T1’ T2’ T3’
New d-ary Cuckoo Hash Table
P1 P2 P3
H2 H1 H3
m Migrated Region
< < < H'1 H'3
Only need d lookups during resizing
pte d pte f pte a pte c pte k
51
a c
T2
f k
T1 T3
d-ary Elastic Cuckoo Hash Table
H2 H1 H3
k d
52
a c
T2
f k d
T1 T3
d-ary Elastic Cuckoo Page Table
H2 H1 H3
VA
pte f pte a pte c pte k pte d
53
a c
T2
f k d
T1 T3
d-ary Elastic Cuckoo Page Table
H2 H1 H3
VAk
pte f pte a pte c pte k pte d
No sequential page walk (unlike radix) At most d accesses always Leverages multiple issue
processors
54
a c
T2
f k d
T1 T3
d-ary Elastic Cuckoo Page Table
H2 H1 H3
VAk
pte f pte a pte c pte k pte d
Leverages multiple issue
processors No sequential page walk (unlike radix) At most d accesses always
55
VA a c
T2
f k d
T1 T3
d-ary Elastic Cuckoo Page Table 4KB PTE Entries
H2 H1 H3
VA4KB
pte f pte a pte c pte k pte d
T2
d
T1 T3
Elastic Cuckoo Page Table 2MB PMD Entries
H2 H1 H3
VA2MB
pmd a pmd c pmd d
c
T2 T1 T3
Elastic Cuckoo Page Table 1GB PUD Entries
H2 H1 H3
VA1GB
pud f pud c pud d
56
VA
a c
T2
f k d
T1 T3
d-ary Elastic Cuckoo Page Table 4KB PTE Entries
H2 H1 H3
VA4KB
pte f pte a pte c pte k pte d
c
T2 T1 T3
Elastic Cuckoo Page Table 1GB PUD Entries
H2 H1 H3
VA1GB
pud f pud c pud d
MMU Cache T2
d
T1 T3
Elastic Cuckoo Page Table 2MB PMD Entries
H2 H1 H3
VA2MB
pmd a pmd c pmd d
57
VA
T2
d
T1 T3
Elastic Cuckoo Page Table 2MB PMD Entries
H2 H1 H3
VA2MB
pmd a pmd c pmd d
MMU Cache
58
VA
T2
d
T1 T3
Elastic Cuckoo Page Table 2MB PMD Entries
H2 H1 H3
VA2MB
pmd a pmd c pmd d
MMU Cache
59
VA
H2
VA2MB
MMU Cache
pmd a
T2
Elastic Cuckoo Page Table 2MB PMD Entries
0.4 0.6 0.8 1 1.2 1.4
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Speedup Radix 4KB
61
0.4 0.6 0.8 1 1.2 1.4
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Speedup Radix 4KB
62
0.4 0.6 0.8 1 1.2 1.4
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Speedup Radix 4KB Radix THP
3.31 1.64
63
0.4 0.6 0.8 1 1.2 1.4
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Speedup Radix 4KB Radix THP Cuckoo 4KB
3.31 1.64
64
0.4 0.6 0.8 1 1.2 1.4
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Speedup Radix 4KB Radix THP Cuckoo 4KB Cuckoo THP
3.31 3.45 1.64 1.82
65
0.4 0.6 0.8 1 1.2 1.4
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Speedup Radix 4KB Radix THP Cuckoo 4KB Cuckoo THP
3.31 3.45 1.64 1.82
66
Elastic Cuckoo Page Tables over Radix
3-28% (only 4KB pages) 3-18% (+Huge pages)
0.2 0.4 0.6 0.8 1
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Time in Translations Radix 4KB Radix 2MB Cuckoo 4KB Cuckoo THP
67
0.2 0.4 0.6 0.8 1
B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n
Time in Translations Radix 4KB Radix 2MB Cuckoo 4KB Cuckoo THP
68
Elastic Cuckoo Page Tables Reduce Time Spent in Translation
by 41% on Average
69
Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020