Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , - PowerPoint PPT Presentation

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , Sanidhya Kashyap, J´ y ‡ , an Vesel´ Zi Yan ‡ , Taesoo Kim, Abhishek Bhattacharjee ‡ , Tushar Krishna ‡ Rutgers University Georgia Institute of Technology * Co-First Authors March 28, 2018 Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 1 / 24

Motivation Large NUMA machines Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

Motivation Large NUMA machines Terabytes of memory Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

Motivation Large NUMA machines Terabytes of memory Microsecond latency Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

Motivation ⇒ Problem of Microsecond Latency in System Services ⇒ TLB Coherence is Contributor in Important Subset Large NUMA machines Terabytes of memory Microsecond latency Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 2 / 24

Impact of TLB coherence on applications Multi-core MapReduce application Prior research: 10x increase in shootdown time with increasing core counts Web servers (e.g., Apache) Prior research and our findings: ≈ 35% of time spent in TLB shootdown Die-stacked Memory Swapping between on-chip and off-chip memory Disaggregated Memory Swapping between local and remote memory Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 3 / 24

Impact of TLB coherence on applications Multi-core MapReduce application Prior research: 10x increase in shootdown time with increasing core counts Web servers (e.g., Apache) Prior research and our findings: ≈ 35% of time spent in TLB shootdown Die-stacked Memory Swapping between on-chip and off-chip memory Disaggregated Memory Swapping between local and remote memory ⇒ Can we mitigate this costly TLB shootdown? Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 3 / 24

Table of contents TLB Shootdown Background 1 Latr : Asynchronous TLB Shootdowns 2 Evaluation 3 Conclusion 4 Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 4 / 24

Translation lookaside buffer: Introduction Cache for virtual → physical mapping, per-core structures Accessed on every load/store Unlike data caches (L3, etc.), coherence managed by OS TLB coherence significantly impacts application performance Virtual Address Miss: TLB PTE Page Table Physical Walk Address PMD Hit: PUD Physical Address PGD Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 6 / 24

TLB coherence: Background Hardware-based Approaches Providing cache coherence to TLBs ISA-level instruction support (ARM) Microcode-based approaches Software-based Approaches Current commodity OS design: Use Inter-Processor Interrupts (IPI) Optimization: Reduce number of shootdowns, better tracking Multikernel design: Use Message-Passing Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 7 / 24

TLB coherence: Background Hardware-based Approaches Providing cache coherence to TLBs ISA-level instruction support (ARM) ⇒ More Hardware Complexity Microcode-based approaches Software-based Approaches ⇒ TLB shootdowns still significant Current commodity OS design: Use Inter-Processor Interrupts (IPI) Optimization: Reduce number of shootdowns, better tracking Multikernel design: Use Message-Passing Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 7 / 24

TLB shootdown internals in Linux munmap() on core 1, application running on cores 1, 2, and 5: Application ❶ App 1 App 2 Idle Idle App 5 Idle Idle Idle ... OS OS OS Operating System Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ Timeline: Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux munmap() on core 1, application running on cores 1, 2, and 5: Application ❶ munmap() ❶ App 1 App 2 Idle Idle App 5 Idle Idle Idle ... OS OS OS Operating System Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ Timeline: Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux Context switch on core 1, local TLB shootdown: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ... OS OS OS ❷ Operating System Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ Timeline: Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux Notify cores 2 and 5 via IPI, application blocked on core 1: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❸ Operating System Spin- wait Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ Timeline: 2.2µs Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux Execute context switch and TLB shootdown on cores 2 and 5: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❹ Remote Shootdown ❹ ❹ Operating System Spin- wait Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ ❹ Timeline: 2.2µs Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux Cores 2 and 5 respond ACK via shared memory: Application ❶ munmap() ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❹ Remote Shootdown ❺ ❺ Operating System ❺ IPI ACK Spin- wait Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ ❹ ❺ Timeline: 2.2µs Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux Control is returned on all cores, TLB shootdown completed: Application ❶ munmap() ❻ ❷ Local Shootdown App 1 App 2 Idle Idle App 5 Idle Idle Idle ❸ Send IPIs ... OS OS OS ❹ Remote Shootdown Operating System ❺ IPI ACK ❻ munmap() complete Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8 TLB TLB TLB TLB TLB TLB TLB TLB ❶ ❷ ❸ ❹ ❺ ❻ Timeline: 2.2µs 5.9µs } Savings potential for asynchronous approach with L ATR Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 8 / 24

Observation Synchronous TLB shootdown is expensive: Up to 6 µ s delay with two sockets Processing IPIs is expensive: Interrupt handler on remote core Long wait time on initiating core IPI send-and-wait delay: Unicast delivery of the IPIs (one at a time) Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 9 / 24

TLB shootdown: A necessary evil Cost of a simple memory unmap operation ( munmap() ): 1 page on 16 cores with 2 sockets: up to 8 µ s ≈ 70% from TLB shootdown alone More expensive with more sockets: munmap() 8 7 6 Latency ( µ s) 5 4 3 2 1 Socket 1 0 2 4 6 8 10 12 14 16 Cores Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 10 / 24

TLB shootdown: A necessary evil Cost of a simple memory unmap operation ( munmap() ): 1 page on 16 cores with 2 sockets: up to 8 µ s ≈ 70% from TLB shootdown alone More expensive with more sockets: munmap() 8 7 6 Latency ( µ s) 5 4 3 2 1 Socket 2 Sockets 1 0 2 4 6 8 10 12 14 16 Cores Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 10 / 24

TLB shootdown: A necessary evil Cost of a simple memory unmap operation ( munmap() ): 1 page on 16 cores with 2 sockets: up to 8 µ s ≈ 70% from TLB shootdown alone More expensive with more sockets: munmap() 8 TLB Shootdown 7 6 Latency ( µ s) 5 4 3 2 1 0 2 4 6 8 10 12 14 16 Cores Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 10 / 24

In this talk: Latr Latr : La zy Tr anslation Coherence Perform asynchronous TLB shootdown Remove remote shootdown from the critical path Take advantage of change in ABI without affecting applications’ correctness Use shared memory instead of IPI Eliminate send-and-wait delay of IPIs Scope: free operations (in this talk) migration operations (see our paper) Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 12 / 24

In this talk: Latr Latr : La zy Tr anslation Coherence Perform asynchronous TLB shootdown Remove remote shootdown from the critical path Take advantage of change in ABI without affecting applications’ correctness Use shared memory instead of IPI Eliminate send-and-wait delay of IPIs Scope: free operations (in this talk) migration operations (see our paper) ⇒ But: How to perform asynchronous shootdown? Mohan Kumar Latr : Lazy Translation Coherence March 28, 2018 12 / 24

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , - PowerPoint PPT Presentation

Latr : Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , Sanidhya Kashyap, J y , an Vesel Zi Yan , Taesoo Kim, Abhishek Bhattacharjee , Tushar Krishna Rutgers University Georgia Institute of Technology * Co-First

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence

Coherence Intuition that the parts of a discourse hang together Local coherence: Consecutive

Coherence Coherence Coherence Holography Recording Holography Recording Let the object

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Imagine for a moment @trentmwillis Lazy Loading Engines: Anything But Lazy Engines allow

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Community Translation By Willem Stoeller Examples Community Translation Virtual Teams Powering

Lazy v. Yield Incremental, Linear Pretty-printing Oleg Kiselyov Simon Peyton-Jones Amr Sabry

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

Lazy Modules Keiko Nakata Institute of Cybernetics at Tallinn University of Technology

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Global Translation Services Website translation using post-edited machine translation and

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

the productive programmer: practice 10 ways to improve your code NEAL FORD software architect /

The p -adic integral geometry formula Avi Kulkarni Dartmouth College August 27, 2020 Joint work

Fibre frequency dissemination with a resolution below 10 -17 O. Lopez, A. Amy Klein, Ch. Daussy,

Introduction to Christoffel-Darboux kernels for polynomial optimization Edouard Pauwels POEMA

Phase Messaging Method for Time-of-flight Cameras Wenjia Yuan Richard Howard Kristin Dana

The Evolution of U.S. National Policy for Addressing the Threat of Space Debris Brian Weeden

Working with astrometric data - warnings and caveats - U. Bastian / X. Luri ESAC Nov 2016,

Dynamic Geometry Processing EG 2012 Tutorial Will Chang, Hao Li, Niloy Mitra, Mark Pauly,