Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C - - PowerPoint PPT Presentation

hybrid tlb coal b coalescing i improving g tlb translati
SMART_READER_LITE
LIVE PREVIEW

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C - - PowerPoint PPT Presentation

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D Diver erse e Fr Fragm gmented M Mem emory A y Allocations Chang Hyun Park , Taekyung Heo, Jungi Jeong, and Jaehyuk Huh Introduction


slide-1
SLIDE 1

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D Diver erse e Fr Fragm gmented M Mem emory A y Allocations

Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh

slide-2
SLIDE 2

Introduction

  • Virtual memory provides rich features
  • Requires an address translation
  • Workloads have grown in size pressuring TLB
  • Contiguous memory allocations to the rescue!

22

slide-3
SLIDE 3

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

23

Physical Pages Virtual Pages V->P x 4

slide-4
SLIDE 4

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

24

Physical Pages Virtual Pages Large page V->P x 4

slide-5
SLIDE 5

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

25

Physical Pages Virtual Pages Large page Large page V->P x 4

slide-6
SLIDE 6

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

26

Physical Pages Virtual Pages Large page Large page V->P x 4 V->P x 1

slide-7
SLIDE 7

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

27

Physical Pages Virtual Pages Large page Large page V->P x 4 V->P x 1

slide-8
SLIDE 8

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

28

Physical Pages Virtual Pages Large page Large page V->P x 4 V->P x 1

slide-9
SLIDE 9

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

29

Physical Pages Virtual Pages Large page Large page Not Aligned V->P x 4 V->P x 1

slide-10
SLIDE 10

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

30

Physical Pages Virtual Pages Large page Large page Not Aligned V->P x 4 V->P x 1

slide-11
SLIDE 11

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

31

Physical Pages Virtual Pages Large page Large page Not Aligned V->P x 4 V->P x 1

slide-12
SLIDE 12

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

32

Physical Pages Virtual Pages Large page Large page Not Aligned Size mismatch V->P x 4 V->P x 1

slide-13
SLIDE 13

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

33

Physical Pages Virtual Pages Large page Large page Not Aligned Size mismatch

Efficient when large pages provided by OS

V->P x 4 V->P x 1

slide-14
SLIDE 14

Past P Proposals: : Large p pages

  • Large pages represent larger mappings (2MB)
  • Strict alignment required
  • Exact size match required

34

Physical Pages Virtual Pages Large page Large page Not Aligned Size mismatch

Efficient when large pages provided by OS

V->P x 4 V->P x 1

slide-15
SLIDE 15

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

35

Physical Pages Virtual Pages

[5] Pham et al. HPCA ’14

slide-16
SLIDE 16

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

36

Physical Pages Virtual Pages 1 3 2 V->P

[5] Pham et al. HPCA ’14

slide-17
SLIDE 17

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

37

Physical Pages Virtual Pages 1 3 2 V->P

Clustered

[5] Pham et al. HPCA ’14

slide-18
SLIDE 18

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

38

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered

[5] Pham et al. HPCA ’14

slide-19
SLIDE 19

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

39

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered Clustered

[5] Pham et al. HPCA ’14

slide-20
SLIDE 20

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

40

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered Clustered Clustered

1 X 2 V->P

[5] Pham et al. HPCA ’14

slide-21
SLIDE 21

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

41

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered Clustered Clustered

1 X 2 V->P

[5] Pham et al. HPCA ’14

slide-22
SLIDE 22

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

42

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered Clustered Clustered

1 X 2 V->P

[5] Pham et al. HPCA ’14

slide-23
SLIDE 23

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

43

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered Clustered Clustered

1 X 2 V->P

[5] Pham et al. HPCA ’14

slide-24
SLIDE 24

Past P Proposals: Cluster r TLB

  • HW oriented clustering[5]
  • Cluster TLB represents flexible mapping within cluster
  • Provides flexible mapping within cluster block
  • However cluster size is fixed at design time

44

Physical Pages Virtual Pages 1 3 2 V->P

Clustered Clustered Clustered Clustered

1 X 2 V->P

Efficient with small clustering

  • pportunities

[5] Pham et al. HPCA ’14

slide-25
SLIDE 25

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

45

Physical Pages Virtual Pages

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-26
SLIDE 26

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

46

Physical Pages Virtual Pages Base Limit Base Limit Offset

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-27
SLIDE 27

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

47

Physical Pages Virtual Pages Base Limit Base Limit Offset Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-28
SLIDE 28

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

48

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-29
SLIDE 29

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

49

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-30
SLIDE 30

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

50

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-31
SLIDE 31

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

51

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-32
SLIDE 32

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

52

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-33
SLIDE 33

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

53

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-34
SLIDE 34

Past Proposals: : Direct Segments

  • Segment based translation[1]
  • Three values represent contiguous translation of any size
  • Fully assoc. lookup for multiple segments (limits size of TLB)
  • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB

54

Physical Pages Virtual Pages Base Limit Offset Base Limit Offset Direct Segment Direct Segment

Efficient with small number of big memory chunks

[1] Basu et al. ISCA ’13 [6] Karakostas et al. ISCA ‘15

slide-35
SLIDE 35

Past P Proposals: Summary

  • Large pages
  • Affinity for large pages (2MB)
  • Cluster TLB
  • Affinity for clustering of mapping of up to 8 pages
  • Segment translations
  • Affinity for small number of large chunks (32 entry TLB)

55

slide-36
SLIDE 36

Past P Proposals: Summary

  • Large pages
  • Affinity for large pages (2MB)
  • Cluster TLB
  • Affinity for clustering of mapping of up to 8 pages
  • Segment translations
  • Affinity for small number of large chunks (32 entry TLB)

56

Prior proposals efficiently support specific memory mapping scenarios

slide-37
SLIDE 37

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

57

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

slide-38
SLIDE 38

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

58

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

slide-39
SLIDE 39

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

59

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-40
SLIDE 40

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

60

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-41
SLIDE 41

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

61

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-42
SLIDE 42

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

62

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-43
SLIDE 43

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

63

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-44
SLIDE 44

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

64

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-45
SLIDE 45

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

65

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-46
SLIDE 46

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

66

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-47
SLIDE 47

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

67

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-48
SLIDE 48

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

68

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-49
SLIDE 49

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

69

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-50
SLIDE 50

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

70

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-51
SLIDE 51

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

71

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-52
SLIDE 52

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

72

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-53
SLIDE 53

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

73

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

Cold Pages

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-54
SLIDE 54

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

74

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

Cold Pages Hot Pages

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-55
SLIDE 55

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

75

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

Cold Pages Hot Pages

  • Heterogeneous memory worsens non-uniformity [3][4]
slide-56
SLIDE 56

La Large C Conti tiguity vs. Memor

  • ry Non

Non-Uniform rmity

  • Conflicting goals of NUMA systems and large pages[2]
  • Memory traffic balance vs. efficient address translation

76

Node Node Node Node

Regular Pages

Node Node Node Node

Large Page

[2] Baptiste et al. ATC ’14 [3] Lee et al. ISCA ‘15 [4] Agarwal et al. ASPLOS ‘17

Cold Pages Hot Pages

  • Heterogeneous memory worsens non-uniformity [3][4]

Different systems have different memory mapping needs

slide-57
SLIDE 57
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

77

[7] Kwon et al. OSDI ’16

slide-58
SLIDE 58
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

78

CDF of process memory

[7] Kwon et al. OSDI ’16

slide-59
SLIDE 59
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

79

Well suited for Cluster

CDF of process memory

[7] Kwon et al. OSDI ’16

slide-60
SLIDE 60
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

80

Well suited for Cluster Well suited for Large pages

CDF of process memory

[7] Kwon et al. OSDI ’16

slide-61
SLIDE 61
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

81

Well suited for Cluster Well suited for Large pages Well suited for ??

CDF of process memory

[7] Kwon et al. OSDI ’16

slide-62
SLIDE 62
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

82

Well suited for Cluster Well suited for Large pages Well suited for ??

CDF of process memory

[7] Kwon et al. OSDI ’16

slide-63
SLIDE 63
  • Contiguity distribution varies among workloads
  • Also varies within the same workload[7]

Need f for a r an Al All-Roun unde der Solution

83

Well suited for Cluster Well suited for Large pages Well suited for ??

CDF of process memory

[7] Kwon et al. OSDI ’16

Can we make a TLB scheme that works well for diverse scenarios?

slide-64
SLIDE 64

Hyb ybrid T TLB LB C Coalesci cing

84

TLB Hardware Operating System Page Table

slide-65
SLIDE 65

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

85

TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-66
SLIDE 66

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

86

TLB TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-67
SLIDE 67

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

87

TLB TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-68
SLIDE 68

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

88

TLB TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-69
SLIDE 69

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

89

TLB TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-70
SLIDE 70

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

90

TLB TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-71
SLIDE 71

Hyb ybrid T TLB LB C Coalesci cing

  • HW-SW Joint Effort
  • HW offers adjustable TLB

coverage

  • Number of TLB entries fixed
  • Coverage of entry adjustable
  • OS decides best TLB coverage
  • Adjusts TLB coverage per process
  • OS identifies contiguous chunks
  • Marks onto process page table

91

TLB TLB Hardware Operating System Page Table

We propose a TLB with adjustable coverage

slide-72
SLIDE 72

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

92

Page Table Anchor Distance = 8

slide-73
SLIDE 73

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

93

Anchor Distance = 4 Page Table Anchor Distance = 8

slide-74
SLIDE 74

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

94

Anchor Distance = 4 Page Table Anchor Distance = 8

slide-75
SLIDE 75

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

95

Anchor Distance = 4 Page Table Anchor Distance = 8

slide-76
SLIDE 76

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

96

Anchor Distance = 4 Page Table Anchor Distance = 8

slide-77
SLIDE 77

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

97

Anchor Distance = 4 Page Table Anchor Distance = 8

slide-78
SLIDE 78

Anch nchor

  • Anchors are special entries in the page table
  • Placed at every alignments of anchor distance
  • Anchor distance is a power of 2 (for encoding efficiency)
  • Anchor distance configurable by OS

98

Anchor Distance = 4 Page Table Anchor Distance = 8

slide-79
SLIDE 79

An Anchor r Page Table

  • Uses the Page Table
  • Anchor covers up to distance(4) contiguous pages
  • Each anchor represents contiguity that begins at anchor
  • OS marks contiguity onto the anchor page table

99

Physical Pages 2 3 4 1 4 Virtual Pages Anchor Mappings Regular Mappings

slide-80
SLIDE 80

An Anchor r Page Table

  • Uses the Page Table
  • Anchor covers up to distance(4) contiguous pages
  • Each anchor represents contiguity that begins at anchor
  • OS marks contiguity onto the anchor page table

100

Physical Pages 2 3 4 1 4 Virtual Pages Anchor Mappings Regular Mappings

slide-81
SLIDE 81

An Anchor r Page Table

  • Uses the Page Table
  • Anchor covers up to distance(4) contiguous pages
  • Each anchor represents contiguity that begins at anchor
  • OS marks contiguity onto the anchor page table

101

Physical Pages 2 3 4 1 4 Virtual Pages Anchor Mappings Regular Mappings

slide-82
SLIDE 82

An Anchor r TLB

  • Integrated into the L2 TLB
  • L1 keeps regular entries
  • Caches both regular and anchor page table entries
  • Regular and anchor indexed differently

102

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 TLB Entries 3 | X 3 | X 3 | X Anchor Entry Regular Entry Tag | Contiguity

slide-83
SLIDE 83

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

103

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry

slide-84
SLIDE 84

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

104

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry

slide-85
SLIDE 85

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

105

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry 0 | 3

slide-86
SLIDE 86

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

106

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry 0 | 3

slide-87
SLIDE 87

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

107

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry 0 | 3

Offset (2) < Contiguity (3)

slide-88
SLIDE 88

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

108

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry 0 | 3

Offset (2) < Contiguity (3) HIT return Anchor PFN + offset

slide-89
SLIDE 89

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

109

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry 0 | 3

Offset (2) < Contiguity (3) HIT return Anchor PFN + offset MISS Start Page Walk

slide-90
SLIDE 90

An Anchor r TLB Lookup

  • On L1 TLB Miss Anchor TLB looks up
  • Regular TLB first
  • Anchor TLB next

110

2 3 4 4 Virtual Pages Anchor TLB (4 sets) 0 | 2 0 | 3 0 | 4 1 | 4 3 | X 3 | X 3 | X Anchor Entry Regular Entry 0 | 3

Offset (2) < Contiguity (3) HIT return Anchor PFN + offset MISS Start Page Walk

slide-91
SLIDE 91

Operati ting S System Responsibilities

  • OS periodically selects process anchor distance
  • Heuristic algorithm to minimize TLB entry count
  • OS adjusts anchor distance
  • Anchor distance based on selection algorithm
  • OS marks mapping contiguity
  • Memory mapping contiguity in anchor page table entry

111

slide-92
SLIDE 92

Simulation

  • n M

Met ethodol

  • logy
  • Trace based TLB simulator (Based on Intel Haswell)

112

TLB Configuration

Common L1 4KB: 64 entry, 4 way 2MB: 32 entry, 4 way Baseline L2 / THP 4KB/2MB: 1024 entry, 8 way Cluster Regular (4KB/2MB): 768 entry, 6 way Cluster-8: 320 entry, 5 way RMM (Multiple segments) Baseline L2 TLB + RMM: 32 entry, fully-assoc. Anchor (Selected/Static Ideal) 4KB/2MB/anchor: 1024 entry, 8 way

slide-93
SLIDE 93

Memory Mapping Scenarios

  • Two class of memory mapping scenarios
  • Two real system memory mappings
  • Four synthetic memory mappings

113

Name Trace information

demand Default Linux memory mapping eager ‘Eager’ allocation low 1– 16 pages (4KB – 64KB) medium 1 – 512 pages (4KB – 2MB) high 512 – 64K pages (2MB – 256MB) max Maximum contiguity

slide-94
SLIDE 94

Evaluati tion – TLB LB M Misses of demand mappi ping ng

10 20 30 40 50 60 70 80 90 100

Relative TLB Misses (%) THP Cluster RMM Anchor Selected Anchor Ideal

114

slide-95
SLIDE 95

Evaluati tion – TLB LB M Misses of demand mappi ping ng

10 20 30 40 50 60 70 80 90 100

Relative TLB Misses (%) THP Cluster RMM Anchor Selected Anchor Ideal

115

slide-96
SLIDE 96

Evaluati tion – TLB LB M Misses of demand mappi ping ng

10 20 30 40 50 60 70 80 90 100

Relative TLB Misses (%) THP Cluster RMM Anchor Selected Anchor Ideal

116

slide-97
SLIDE 97

Evaluati tion – TLB LB M Misses of demand mappi ping ng

10 20 30 40 50 60 70 80 90 100

Relative TLB Misses (%) THP Cluster RMM Anchor Selected Anchor Ideal

117

Anchor TLB adjusted to satisfy small contiguities

slide-98
SLIDE 98

Evaluati tion – TLB LB M Misses of medium mappi ping

10 20 30 40 50 60 70 80 90 100

Relative TLB Misses (%) THP Cluster RMM Anchor Selected Anchor Ideal

118

slide-99
SLIDE 99

Evaluati tion – TLB LB M Misses of medium mappi ping

10 20 30 40 50 60 70 80 90 100

Relative TLB Misses (%) THP Cluster RMM Anchor Selected Anchor Ideal

119

Anchor adjusted coverage to provide best TLB reduction

slide-100
SLIDE 100

Evaluati tion – TLB M Misses of all mappi ping ng

10 20 30 40 50 60 70 80 90 100 demand eager low cont. med cont. high cont. max cont.

Relative TLB Misses (%) Baseline THP Cluster RMM Anchor Selected Anchor Ideal

120

slide-101
SLIDE 101

Evaluati tion – TLB M Misses of all mappi ping ng

10 20 30 40 50 60 70 80 90 100 demand eager low cont. med cont. high cont. max cont.

Relative TLB Misses (%) Baseline THP Cluster RMM Anchor Selected Anchor Ideal

121

Anchor TLB performs well for diverse mapping scenarios

slide-102
SLIDE 102

Evaluati tion – TLB M Misses of all mappi ping ng

10 20 30 40 50 60 70 80 90 100 demand eager low cont. med cont. high cont. max cont.

Relative TLB Misses (%) Baseline THP Cluster RMM Anchor Selected Anchor Ideal

122

Anchor TLB performs well for diverse mapping scenarios

slide-103
SLIDE 103

Conclusion

  • Hybrid TLB Coalescing is a HW-SW joint effort
  • Anchor TLB provides adjustable coverage
  • TLB entry coverage grows and shrinks dynamically
  • OS provides contiguity hint using the page table
  • OS picks adequate contiguity per-process
  • Hybrid TLB Coalesce performs:
  • Best for Small-Intermediate contiguities
  • Similar to best prior scheme for Large contiguities

123