Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , - - PowerPoint PPT Presentation

latr lazy translation coherence
SMART_READER_LITE
LIVE PREVIEW

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , - - PowerPoint PPT Presentation

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , Sanidhya Kashyap, J y , an Vesel Zi Yan , Taesoo Kim, Abhishek Bhattacharjee , Tushar Krishna Rutgers University Georgia Institute of Technology * Co-First


slide-1
SLIDE 1

Latr: Lazy Translation Coherence

Mohan Kumar*, Steffen Maass*, Sanidhya Kashyap, J´ an Vesel´ y‡, Zi Yan‡, Taesoo Kim, Abhishek Bhattacharjee‡, Tushar Krishna Georgia Institute of Technology

‡Rutgers University

* Co-First Authors

March 28, 2018

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 1 / 24

slide-2
SLIDE 2

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines

slide-3
SLIDE 3

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines Terabytes of memory

slide-4
SLIDE 4

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines Terabytes of memory Microsecond latency

slide-5
SLIDE 5

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines Terabytes of memory Microsecond latency ⇒ Problem of Microsecond Latency in System Services ⇒ TLB Coherence is Contributor in Important Subset

slide-6
SLIDE 6

Impact of TLB coherence on applications

Multi-core MapReduce application

Prior research: 10x increase in shootdown time with increasing core counts

Web servers (e.g., Apache)

Prior research and our findings: ≈35% of time spent in TLB shootdown

Die-stacked Memory

Swapping between on-chip and off-chip memory

Disaggregated Memory

Swapping between local and remote memory

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 3 / 24

slide-7
SLIDE 7

Impact of TLB coherence on applications

Multi-core MapReduce application

Prior research: 10x increase in shootdown time with increasing core counts

Web servers (e.g., Apache)

Prior research and our findings: ≈35% of time spent in TLB shootdown

Die-stacked Memory

Swapping between on-chip and off-chip memory

Disaggregated Memory

Swapping between local and remote memory

⇒ Can we mitigate this costly TLB shootdown?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 3 / 24

slide-8
SLIDE 8

Table of contents

1

TLB Shootdown Background

2

Latr: Asynchronous TLB Shootdowns

3

Evaluation

4

Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 4 / 24

slide-9
SLIDE 9

Table of contents

1

TLB Shootdown Background

2

Latr: Asynchronous TLB Shootdowns

3

Evaluation

4

Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 5 / 24

slide-10
SLIDE 10

Translation lookaside buffer: Introduction

Cache for virtual → physical mapping, per-core structures Accessed on every load/store Unlike data caches (L3, etc.), coherence managed by OS TLB coherence significantly impacts application performance

Virtual Address PGD PUD PMD PTE TLB Hit: Physical Address Miss: Page Table Walk Physical Address

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 6 / 24

slide-11
SLIDE 11

TLB coherence: Background

Hardware-based Approaches

Providing cache coherence to TLBs ISA-level instruction support (ARM) Microcode-based approaches

Software-based Approaches

Current commodity OS design: Use Inter-Processor Interrupts (IPI) Optimization: Reduce number of shootdowns, better tracking Multikernel design: Use Message-Passing

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 7 / 24

slide-12
SLIDE 12

TLB coherence: Background

Hardware-based Approaches

Providing cache coherence to TLBs ISA-level instruction support (ARM) Microcode-based approaches

Software-based Approaches

Current commodity OS design: Use Inter-Processor Interrupts (IPI) Optimization: Reduce number of shootdowns, better tracking Multikernel design: Use Message-Passing

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 7 / 24

⇒ More Hardware Complexity ⇒ TLB shootdowns still significant

slide-13
SLIDE 13

TLB shootdown internals in Linux

munmap() on core 1, application running on cores 1, 2, and 5:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS TLB TLB TLB TLB TLB TLB TLB TLB

Timeline: Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-14
SLIDE 14

TLB shootdown internals in Linux

munmap() on core 1, application running on cores 1, 2, and 5:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS TLB TLB TLB TLB TLB TLB TLB TLB

Timeline:

❶ munmap()

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-15
SLIDE 15

TLB shootdown internals in Linux

Context switch on core 1, local TLB shootdown:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS TLB TLB TLB TLB TLB TLB TLB TLB

❷ ❶ ❷

Timeline:

❶ munmap() ❷ Local Shootdown

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-16
SLIDE 16

TLB shootdown internals in Linux

Notify cores 2 and 5 via IPI, application blocked on core 1:

TLB TLB TLB TLB TLB TLB TLB TLB App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS

Spin- wait

❸ ❶ ❷

2.2µs

Timeline:

❶ munmap() ❷ Local Shootdown ❸ Send IPIs

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-17
SLIDE 17

TLB shootdown internals in Linux

Execute context switch and TLB shootdown on cores 2 and 5:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS TLB TLB TLB TLB TLB TLB TLB TLB

❹ ❹

Spin- wait

❹ ❸ ❶ ❷

2.2µs

Timeline:

❶ munmap() ❷ Local Shootdown ❸ Send IPIs ❹ Remote Shootdown

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-18
SLIDE 18

TLB shootdown internals in Linux

Cores 2 and 5 respond ACK via shared memory:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS TLB TLB TLB TLB TLB TLB TLB TLB

❺ ❺

Spin- wait

❺ ❹ ❸ ❶ ❷

2.2µs

Timeline:

❶ munmap() ❷ Local Shootdown ❸ Send IPIs ❹ Remote Shootdown ❺ IPI ACK

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-19
SLIDE 19

TLB shootdown internals in Linux

Control is returned on all cores, TLB shootdown completed:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS

TLB TLB TLB TLB TLB TLB TLB TLB

❻ ❺ ❹ ❸ ❶ ❷

2.2µs

Timeline:

5.9µs

}

Savings potential for asynchronous approach with LATR

❶ munmap() ❷ Local Shootdown ❸ Send IPIs ❹ Remote Shootdown ❺ IPI ACK

munmap() complete

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

slide-20
SLIDE 20

Observation

Synchronous TLB shootdown is expensive:

Up to 6 µs delay with two sockets

Processing IPIs is expensive:

Interrupt handler on remote core Long wait time on initiating core

IPI send-and-wait delay:

Unicast delivery of the IPIs (one at a time)

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 9 / 24

slide-21
SLIDE 21

TLB shootdown: A necessary evil

Cost of a simple memory unmap operation (munmap()):

1 page on 16 cores with 2 sockets: up to 8 µs ≈ 70% from TLB shootdown alone

More expensive with more sockets:

1 2 3 4 5 6 7 8 2 4 6 8 10 12 14 16 1 Socket Latency (µs) Cores

munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24

slide-22
SLIDE 22

TLB shootdown: A necessary evil

Cost of a simple memory unmap operation (munmap()):

1 page on 16 cores with 2 sockets: up to 8 µs ≈ 70% from TLB shootdown alone

More expensive with more sockets:

1 2 3 4 5 6 7 8 2 4 6 8 10 12 14 16 1 Socket 2 Sockets Latency (µs) Cores

munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24

slide-23
SLIDE 23

TLB shootdown: A necessary evil

Cost of a simple memory unmap operation (munmap()):

1 page on 16 cores with 2 sockets: up to 8 µs ≈ 70% from TLB shootdown alone

More expensive with more sockets:

1 2 3 4 5 6 7 8 2 4 6 8 10 12 14 16 Latency (µs) Cores

munmap()

TLB Shootdown

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24

slide-24
SLIDE 24

Table of contents

1

TLB Shootdown Background

2

Latr: Asynchronous TLB Shootdowns

3

Evaluation

4

Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 11 / 24

slide-25
SLIDE 25

In this talk: Latr

Latr: Lazy Translation Coherence Perform asynchronous TLB shootdown

Remove remote shootdown from the critical path Take advantage of change in ABI without affecting applications’ correctness

Use shared memory instead of IPI

Eliminate send-and-wait delay of IPIs

Scope:

free operations (in this talk) migration operations (see our paper)

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 12 / 24

slide-26
SLIDE 26

In this talk: Latr

Latr: Lazy Translation Coherence Perform asynchronous TLB shootdown

Remove remote shootdown from the critical path Take advantage of change in ABI without affecting applications’ correctness

Use shared memory instead of IPI

Eliminate send-and-wait delay of IPIs

Scope:

free operations (in this talk) migration operations (see our paper)

⇒ But: How to perform asynchronous shootdown?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 12 / 24

slide-27
SLIDE 27

Latr States

Store virtual addresses to be flushed Remote cores shootdown local TLB during

OS context switch OS scheduler tick (upper bound: 1ms in Linux)

Core5 Core6 Core7 Core8 TLB TLB TLB TLB LATR States LATR States LATR States LATR States

...

S1: start; end; mm; flags; Core list; active S2 S64 LATR States Core1 Cache Coherency QPI Core1 Core2 Core3 Core4 TLB TLB TLB TLB LATR States LATR States LATR States LATR States

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 13 / 24

slide-28
SLIDE 28

Latr: Example

munmap() initiated on core 1:

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

❶ ❶

Timeline:

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-29
SLIDE 29

Latr: Example

munmap() initiated on core 1:

App1 Idle Idle App2 App5 Idle Idle Idle

Application Operating System

OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

❶ ❶

Timeline: Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-30
SLIDE 30

Latr: Example

Set up Latr state (for cores 2 and 5), local shootdown:

App1 Idle Idle App2 App5 Idle Idle Idle OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

start end 0x01 0x0F mm Core list active 0x1234 {2, 5} True

Core1, LATR State1:

flags 0x1

Timeline: Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap() ❷ Local Shootdown ❸ Create LATR State ❸ ❶ ❷ ❸

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-31
SLIDE 31

Latr: Example

Return control on core 1. Time taken: 2.3 µs, 70% reduction:

App1 Idle Idle App2 App5 Idle Idle Idle OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

Timeline:

start end 0x01 0x0F mm Core list active 0x1234 {2, 5} True

Core1, LATR State1:

flags 0x1

2.3µs

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap() ❷ Local Shootdown ❸ Create LATR State ❹ munmap() complete ❶ ❷ ❸ ❹ ❹

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-32
SLIDE 32

Latr: Example

Scheduler tick on core 2, local shootdown, reset state:

App1 Idle Idle App2 App5 Idle Idle Idle OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

start end 0x01 0x0F mm Core list active 0x1234 {5} True

Core1, LATR State1:

flags 0x1

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap() ❷ Local Shootdown ❸ Create LATR State ❹ munmap() complete ❺ Shootdown Core2 ❺ ❺

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-33
SLIDE 33

Latr: Example

Scheduler tick on core 5, local shootdown, reset state:

App1 Idle Idle App2 App5 Idle Idle Idle OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

start end 0x01 0x0F mm Core list active 0x1234 {} False

Core1, LATR State1:

flags 0x1

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap() ❷ Local Shootdown ❸ Create LATR State ❹ munmap() complete ❺ Shootdown Core2 ❻ Shootdown Core5 ❻

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-34
SLIDE 34

Latr: Example

Shootdown complete, Latr entry can be reused:

App1 Idle Idle App2 App5 Idle Idle Idle OS OS

...

OS LATR States LATR States LATR States LATR States LATR States LATR States LATR States LATR States

start end 0x01 0x0F mm Core list active 0x1234 {} False

Core1, LATR State1:

flags 0x1

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap() ❷ Local Shootdown ❸ Create LATR State ❹ munmap() complete ❺ Shootdown Core2 ❻ Shootdown Core5 ❼ Shootdown complete

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

slide-35
SLIDE 35

Lazy TLB shootdown: Correctness

Same physical memory or virtual memory is reused

Leads to memory corruption

⇒ Avoid same physical/virtual page reuse

Upper bound for TLB shootdown with Latr is 1ms OS physical/virtual memory reclamation delayed by two scheduler ticks (2ms) Memory overhead is bounded by 21 MB

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 15 / 24

slide-36
SLIDE 36

Lazy TLB shootdown: Incorrect accesses

Memory accesses before Latr shootdown:

Consequence of incorrect application: Use After Free Before Latr shootdown, access (reads and writes) allowed Exists in the current OS implementation After Latr shootdown, access results in segmentation fault

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 16 / 24

slide-37
SLIDE 37

Scope of Latr

ABI change for free operations Support for operations limited to few, frequently used operations:

Classification Operations Lazy operation possible Free

munmap(): unmap address range

madvise(): free memory range

✓ Migration AutoNUMA page migration (⇒ See paper) ✓ Page swap: swap page to disk ✓ Permission

mprotect(): change page permission

  • Ownership

CoW: Copy on Write

  • Remap

mremap(): change physical address

  • Mohan Kumar

Latr: Lazy Translation Coherence March 28, 2018 17 / 24

slide-38
SLIDE 38

Table of contents

1

TLB Shootdown Background

2

Latr: Asynchronous TLB Shootdowns

3

Evaluation

4

Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 18 / 24

slide-39
SLIDE 39

Evaluation: Questions

Latr prototype developed for Linux 4.10 Evaluation questions

What are Latr’s benefits with microbenchmarks? What are Latr’s benefits with real-world applications exhibiting many TLB shootdowns? What is the cost for Latr?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 19 / 24

slide-40
SLIDE 40

Microbenchmark on eight sockets

Linux and Latr calling munmap() with one page on 120 cores:

20 40 60 80 100 120 140 20 40 60 80 100 120 Cost of munmap 20 40 60 80 100 120 20 40 60 80 100 120 140 Cost of TLB Shootdown Latency (µs) Cores Linux Latr Latency (µs) Cores

⇒ Up to 66.7% reduction for munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 20 / 24

slide-41
SLIDE 41

Serving files with Apache

Linux, ABIS [ATC17], and Latr on 2 sockets:

0k 20k 40k 60k 80k 100k 120k 140k 160k 2 4 6 8 10 12 Apache Performance 2 4 6 8 10 120k 5k 10k 15k 20k 25k 30k 35k TLB Shootdowns per second Requests per second Cores Linux ABIS Latr TLB Shootdowns per second Cores

⇒ Up to 59.9% more requests

second than Linux, 37.9% higher than ABIS.

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 21 / 24

slide-42
SLIDE 42

Cost of Latr

Memory overhead is bounded by 21 MB Performance overheads for applications with few TLB shootdowns:

0.97 0.98 0.99 1.00 1.01 1.02 1.03 nginx1 Apache1 bodytrack16 canneal16 facesim16 ferret16 streamcluster16 20 40 60 80 100 Overhead TLB Shootdowns per sec Normalized application performance Shootdowns per second

⇒ Latr shows small performance overheads of up to 1.7% due to added

  • perations during scheduling.

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 22 / 24

slide-43
SLIDE 43

Future work

Further applications of Latr in: Disaggregated data centers Heterogeneous memory Applicability to PCID/ASID-based approaches Impact on new features such as KPTI, . . . ?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 23 / 24

slide-44
SLIDE 44

Latr: Takeaways

The synchronous TLB shootdown is expensive We propose a software-based asynchronous shootdown mechanism Significant improvement in application performance with Latr

70% reduction for munmap(), for 16-core and 120-core machines Improves Apache’s throughput by 60%

Asynchronous mechanism applicable to other services:

AutoNUMA (see our paper)

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 24 / 24

slide-45
SLIDE 45

Latr: Takeaways

The synchronous TLB shootdown is expensive We propose a software-based asynchronous shootdown mechanism Significant improvement in application performance with Latr

70% reduction for munmap(), for 16-core and 120-core machines Improves Apache’s throughput by 60%

Asynchronous mechanism applicable to other services:

AutoNUMA (see our paper)

Thanks!

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 24 / 24