Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation - - PowerPoint PPT Presentation

elastic cuckoo page tables rethinking virtual memory
SMART_READER_LITE
LIVE PREVIEW

Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation - - PowerPoint PPT Presentation

Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism Dimitrios Skarlatos , Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020 Virtual


slide-1
SLIDE 1

Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism

Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020

slide-2
SLIDE 2

Virtual Memory Translation is Expensive

2

Application

Main Memory

PA 4 PA 1

Page Tables Core L3 Cache L1 Cache L2 Cache TLB TLB Miss! PA 4 VA 1 PA 1 VA 8 Issue LD VA 1

slide-3
SLIDE 3

Virtual Memory Translation is Expensive

3

Application

Main Memory

PA 4 PA 1

Page Tables Core L3 Cache L1 Cache L2 Cache TLB

PA 4 VA 1 PA 1 VA 8

à “Page Walk” = Fetch entry from page table TLB Miss Issue LD VA 1

PA4 VA1

slide-4
SLIDE 4

x86-64 Radix Page Tables

4

Main Memory

PA 4 PA 1

slide-5
SLIDE 5

x86-64 Radix Page Tables

5

Main Memory

PA 4 PA 1

pmd pte

PMD PTE CR3 47 … 39 38 … 30 29 … 21 20 … 12 11 … 0 9-bits 9-bits 9-bits 9-bits Page Offset

+ + + +

Address A Virtual Address TLB Entry

pgd pud

PUD PGD PA4 VA1

slide-6
SLIDE 6

Virtual Memory Translation is Expensive

6

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1

PMD PTE PUD PGD

pud pmd pte pgd

pgd

slide-7
SLIDE 7

Virtual Memory Translation is Expensive

7

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1

PMD PTE PUD PGD

pgd

pgd

pud pmd pte

pud

slide-8
SLIDE 8

Virtual Memory Translation is Expensive

8

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1

PMD PTE PUD PGD

pud

pgd pte pud pmd

pmd

slide-9
SLIDE 9

Virtual Memory Translation is Expensive

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache TLB à “Page Walk” = Fetch entry from radix page table TLB Miss Issue LD VA 1

PMD PTE PUD PGD

pmd

pmd

9 pgd pud pte

pte

slide-10
SLIDE 10

Virtual Memory Translation is Expensive

10

Application

Main Memory

PA 4 PA 1

Page Tables Core L3 Cache L1 Cache L2 Cache TLB

PA4 VA1

PA 4 VA 1 PA 1 VA 8 Issue LD VA 1

slide-11
SLIDE 11

Multilevel TLBs

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB

PMD PTE PUD PGD

L2 TLB

11 pgd pte pud pmd

slide-12
SLIDE 12

Memory Management Unit (MMU) Cache

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB

PMD PTE PUD PGD

L2 TLB

MMU Cache

12 pgd pte pud pmd

slide-13
SLIDE 13

Translations in Data Caches

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB

PMD PTE PUD PGD

L2 TLB

pgd pgd pud

pmd

pte pud

MMU Cache

13 pgd pte pud pmd

slide-14
SLIDE 14

NVM will Make the Problem Worse

Application

Main Memory

PA 4 PA 1

Radix Page Tables Core L3 Cache L1 Cache L2 Cache L1 TLB

PMD PTE PUD PGD

L2 TLB

pgd pgd pud

pmd

pte pud

Non-Volatile Memory Technology

Sunny Cove introduces 5-Level Radix Page Tables!!

MMU Cache

14 pgd pte pud pmd

slide-15
SLIDE 15

Contribution: Elastic Cuckoo Page Tables

  • Rethinking virtual memory translation for parallelism
  • Idea: Dynamically resizable page tables based on cuckoo hashing
  • No sequential page table lookups
  • Application speedup over state-of-the-art:
  • 3-28% with 4KB pages
  • 3-18% with Huge pages

15

à parallel single-step lookups

slide-16
SLIDE 16

Alternative: A Global Hashed Page Table

16

slide-17
SLIDE 17

Alternative: A Global Hashed Page Table

The old approach from Intel and IBM

17

Global Hash Table

Application

Collisions

Tag Tag

VA 1 H VA 9 H

COLLISIONS OS is invoked to resolve them!

slide-18
SLIDE 18

Alternative: A Global Hashed Page Table

18

Global Hash Table

Tag Tag

VA 1 H VA 9 H

Application A Application B

VA 6 H

The old approach from Intel and IBM

How to share pages? Multiple page sizes?

slide-19
SLIDE 19

Alternative: A Global Hashed Page Table

19

Global Hash Table

Tag Tag

VA 1 H VA 9 H

Application A Application B

VA 6 H

The old approach from Intel and IBM

How to share pages? Multiple page sizes? New level of indirection!!

slide-20
SLIDE 20

Alternative: A Global Hashed Page Table

20

Global Hash Table New level of indirection!!

Tag

How to share pages? Multiple page sizes?

Tag Tag

VA 1 H VA 9 H

Application A Application B

VA 6 H

The old approach from Intel and IBM

slide-21
SLIDE 21

Alternative: A Global Hashed Page Table

21

Global Hash Table New level of indirection!!

Tag

How to share pages? Multiple page sizes?

Tag Tag

VA 1 H VA 9 H

Application A Application B

VA 6 H

The old approach from Intel and IBM

COLLISIONS PAGE SHARING PAGE SIZES

slide-22
SLIDE 22

Alternative: A Global Hashed Page Table

22

Global Hash Table

Tag

VA 1 H VA 9 H

Application A Application B

VA 6 H

Tag Tag

The old approach from Intel and IBM Switched to radix page tables!

DEAD END

slide-23
SLIDE 23

Elastic Cuckoo Page Tables

Rethinking virtual memory translation for parallelism

23

slide-24
SLIDE 24

Cuckoo Hashing [Pagh 2001, Fotakis 2005]

24

a c g b f d

T1 T2 T3

d-ary Cuckoo Hash Table

d

H2 H1 H3

slide-25
SLIDE 25

Insertions with Cuckoo Hashing

25

a c g b d

T1 T2 T3

d-ary Cuckoo Hash Table

e

H1

f

slide-26
SLIDE 26

Insertions with Cuckoo Hashing

26

a c g b e d

T1 T2 T3

d-ary Cuckoo Hash Table

H1

f

slide-27
SLIDE 27

Insertions with Cuckoo Hashing

27

a c g e d

T1 T2 T3

d-ary Cuckoo Hash Table

f

H3

b

slide-28
SLIDE 28

Insertions with Cuckoo Hashing

28

a c g f e d

T1 T2 T3

d-ary Cuckoo Hash Table

H3

b

slide-29
SLIDE 29

Insertions with Cuckoo Hashing

29

a c g f e d

T1 T2 T3

d-ary Cuckoo Hash Table

b

H2

slide-30
SLIDE 30

Insertions with Cuckoo Hashing

30

a c b g f e d

T1 T2 T3

d-ary Cuckoo Hash Table

slide-31
SLIDE 31

Insertions with Cuckoo Hashing

31

a c b g f e d

T1 T2 T3

d-ary Cuckoo Hash Table

COLLISIONS PAGE SHARING PAGE SIZES

slide-32
SLIDE 32

Private Hashed Page Tables

32

a c b g f e d

T1 T2 T3

d-ary Cuckoo Hash Table

COLLISIONS PAGE SHARING PAGE SIZES

PRIVATE PAGE TABLES

slide-33
SLIDE 33

Cannot Be Too Big à Waste Memory

33

PRIVATE PAGE TABLES

Main Memory

COLLISIONS PAGE SHARING PAGE SIZES

Page Tables A Page Tables B

App A App B Private page tables cannot be too big

slide-34
SLIDE 34

Need to Dynamically Resize

34

Main Memory

COLLISIONS PAGE SHARING PAGE SIZES

Page Tables A

PA 4 VA 1 PA 1 VA 8

App A

PRIVATE PAGE TABLES

Private page tables cannot be too big Need to dynamically resize

slide-35
SLIDE 35

Need to Dynamically Resize

35

Main Memory

COLLISIONS PAGE SHARING PAGE SIZES

Page Tables A

PA 4 VA 1 PA 1 VA 8

New Page Tables A

App A

PRIVATE PAGE TABLES

Private page tables cannot be too big Need to dynamically resize

slide-36
SLIDE 36

Cannot Rehash All Entries at Once

36

Main Memory

COLLISIONS PAGE SHARING PAGE SIZES

Page Tables A

App A

New Page Tables A

PA 4 VA 1 PA 1 VA 8

PRIVATE PAGE TABLES

Private page tables cannot be too big Need to dynamically resize

slide-37
SLIDE 37

Cannot Rehash All Entries at Once

37

Main Memory

COLLISIONS PAGE SHARING PAGE SIZES

Page Tables A

Private page tables cannot be too big Need to dynamically resize App A

New Page Tables A

PA 4 VA 1 PA 1 VA 8

PRIVATE PAGE TABLES

slide-38
SLIDE 38

Cannot Rehash All Entries at Once

38

Main Memory

COLLISIONS PAGE SHARING PAGE SIZES

Page Tables A

Private page tables cannot be too big Need to dynamically resize App A

New Page Tables A

PRIVATE PAGE TABLES

While the program is running Gradual Resizing!

PA 4 VA 1 PA 1 VA 8

slide-39
SLIDE 39

Gradual Resizing Cuckoo Hash Tables

39

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

m

At every insert à Rehash one element

slide-40
SLIDE 40

Gradual Resizing Cuckoo Hash Tables

40

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

m

H'1

At every insert à Rehash one element

slide-41
SLIDE 41

Lookup During Gradual Resizing

41

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

m

H2 H1 H3 H'2 H'1 H'3

slide-42
SLIDE 42

Problem of Resizing: Double #Lookups

42

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

m

H2 H1 H3 H'2 H'1 H'3

2 x d Lookups!

slide-43
SLIDE 43

43

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3 m

Contribution: Elastic Cuckoo Hashing

Rehashing Pointers

slide-44
SLIDE 44

44

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3 m Migrated Region

Elastic Cuckoo Migration

slide-45
SLIDE 45

45

a c b g l f e k

T1 T2 T3

Old d-ary Cuckoo Hash Table

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3 m

H'1

Migrated Region

Elastic Cuckoo Migration

slide-46
SLIDE 46

46

a c g l f e

T1 T2 T3

Old d-ary Cuckoo Hash Table

m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3 Migrated Region b k

Elastic Cuckoo Migration

slide-47
SLIDE 47

47

a c g l f e

T1 T2 T3

Old d-ary Cuckoo Hash Table

b k m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3 Migrated Region

Elastic Cuckoo Migration

slide-48
SLIDE 48

48

a c g l f e

T1 T2 T3

Old d-ary Cuckoo Hash Table

b k m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3

H2 H1 H3

m Migrated Region

< < <

Elastic Cuckoo Lookup

slide-49
SLIDE 49

49

a c g l f e

T1 T2 T3

Old d-ary Cuckoo Hash Table

b k m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3

H2 H1 H3

m Migrated Region

H'1 H'3 < < <

Elastic Cuckoo Lookup

slide-50
SLIDE 50

50

a c g l f e

T1 T2 T3

Old d-ary Cuckoo Hash Table

b k m

T1’ T2’ T3’

New d-ary Cuckoo Hash Table

P1 P2 P3

H2 H1 H3

m Migrated Region

< < < H'1 H'3

Only need d lookups during resizing

Elastic Cuckoo Lookup

slide-51
SLIDE 51

pte d pte f pte a pte c pte k

Exploiting Parallelism in Virtual Translation

51

a c

T2

f k

T1 T3

d-ary Elastic Cuckoo Hash Table

H2 H1 H3

k d

slide-52
SLIDE 52

Exploiting Parallelism in Virtual Translation

52

a c

T2

f k d

T1 T3

d-ary Elastic Cuckoo Page Table

H2 H1 H3

VA

pte f pte a pte c pte k pte d

slide-53
SLIDE 53

Exploiting Parallelism in Virtual Translation

53

a c

T2

f k d

T1 T3

d-ary Elastic Cuckoo Page Table

H2 H1 H3

VAk

pte f pte a pte c pte k pte d

No sequential page walk (unlike radix) At most d accesses always Leverages multiple issue

  • ut-of-order

processors

slide-54
SLIDE 54

Exploiting Parallelism in Virtual Translation

54

a c

T2

f k d

T1 T3

d-ary Elastic Cuckoo Page Table

H2 H1 H3

VAk

pte f pte a pte c pte k pte d

Leverages multiple issue

  • ut-of-order

processors No sequential page walk (unlike radix) At most d accesses always

slide-55
SLIDE 55

Lookup Multiple Page Sizes in Parallel

55

VA a c

T2

f k d

T1 T3

d-ary Elastic Cuckoo Page Table 4KB PTE Entries

H2 H1 H3

VA4KB

pte f pte a pte c pte k pte d

T2

d

T1 T3

Elastic Cuckoo Page Table 2MB PMD Entries

H2 H1 H3

VA2MB

pmd a pmd c pmd d

c

T2 T1 T3

Elastic Cuckoo Page Table 1GB PUD Entries

H2 H1 H3

VA1GB

pud f pud c pud d

slide-56
SLIDE 56

New MMU Cache to Prune Parallelism

56

VA

a c

T2

f k d

T1 T3

d-ary Elastic Cuckoo Page Table 4KB PTE Entries

H2 H1 H3

VA4KB

pte f pte a pte c pte k pte d

c

T2 T1 T3

Elastic Cuckoo Page Table 1GB PUD Entries

H2 H1 H3

VA1GB

pud f pud c pud d

MMU Cache T2

d

T1 T3

Elastic Cuckoo Page Table 2MB PMD Entries

H2 H1 H3

VA2MB

pmd a pmd c pmd d

slide-57
SLIDE 57

New MMU Cache to Prune Parallelism

57

VA

T2

d

T1 T3

Elastic Cuckoo Page Table 2MB PMD Entries

H2 H1 H3

VA2MB

pmd a pmd c pmd d

MMU Cache

slide-58
SLIDE 58

New MMU Cache to Prune Parallelism

58

VA

T2

d

T1 T3

Elastic Cuckoo Page Table 2MB PMD Entries

H2 H1 H3

VA2MB

pmd a pmd c pmd d

MMU Cache

slide-59
SLIDE 59

New MMU Cache to Prune Parallelism

59

VA

H2

VA2MB

MMU Cache

pmd a

T2

Elastic Cuckoo Page Table 2MB PMD Entries

slide-60
SLIDE 60

Evaluation

slide-61
SLIDE 61

Application Speedup

0.4 0.6 0.8 1 1.2 1.4

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Speedup Radix 4KB

61

slide-62
SLIDE 62

Application Speedup

0.4 0.6 0.8 1 1.2 1.4

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Speedup Radix 4KB

62

slide-63
SLIDE 63

Application Speedup

0.4 0.6 0.8 1 1.2 1.4

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Speedup Radix 4KB Radix THP

3.31 1.64

63

slide-64
SLIDE 64

Application Speedup

0.4 0.6 0.8 1 1.2 1.4

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Speedup Radix 4KB Radix THP Cuckoo 4KB

3.31 1.64

64

slide-65
SLIDE 65

Application Speedup

0.4 0.6 0.8 1 1.2 1.4

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Speedup Radix 4KB Radix THP Cuckoo 4KB Cuckoo THP

3.31 3.45 1.64 1.82

65

slide-66
SLIDE 66

Application Speedup

0.4 0.6 0.8 1 1.2 1.4

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Speedup Radix 4KB Radix THP Cuckoo 4KB Cuckoo THP

3.31 3.45 1.64 1.82

66

Elastic Cuckoo Page Tables over Radix

3-28% (only 4KB pages) 3-18% (+Huge pages)

slide-67
SLIDE 67

Time Spent in Translations

0.2 0.4 0.6 0.8 1

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Time in Translations Radix 4KB Radix 2MB Cuckoo 4KB Cuckoo THP

67

slide-68
SLIDE 68

Time Spent in Translations

0.2 0.4 0.6 0.8 1

B C B F S C C D C D F S G U P S M U M m e r P R S S S P S y s B e n c h T C M e a n

Time in Translations Radix 4KB Radix 2MB Cuckoo 4KB Cuckoo THP

68

Elastic Cuckoo Page Tables Reduce Time Spent in Translation

by 41% on Average

slide-69
SLIDE 69

More in the Paper

  • Elastic cuckoo hashing operation
  • Design of MMU cache and other structures
  • More evaluation:
  • MMU and cache-subsystem characterization
  • Cuckoo walks characterization
  • Memory consumption of page tables

69

slide-70
SLIDE 70

Takeaway: Elastic Cuckoo Page Tables

Better alternative to existing radix page tables

  • Exploits parallelism in virtual translation for the first time
  • Reduces the cost of dynamic resizing of hash tables
  • Application speedup over state-of-the-art:
  • 3-28% with 4KB pages
  • 3-18% with Huge pages
  • Expected high performance impact on:
  • Virtualized environments and nested page tables (ongoing work)
slide-71
SLIDE 71

Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism

Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, Josep Torrellas University of Illinois at Urbana-Champaign skarlat2.web.engr.illinois.edu ASPLOS 2020