Relaxed Systems Architecture: Instruction Fetching Ben Simner - PowerPoint PPT Presentation

Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41

Motivation Why? Want to understand: TLBs, Instruction Caches, Interrupts Want to prove: Operating Systems, JITs, Hypervisors 2/41

But first. . . Computers are fast. . . . . . but terrible ! 3/41

Intel (Skylake) die 2 2 Source: https://en.wikichip.org/wiki/intel/microarchitectures/ skylake_(client) 4/41

Intel (Skylake) die 5/41

x86: Observable complexity Dekker’s/Peterson’s mutual exclusion algorithm (extract) Thread A Thread B flagA ← 1 ; flagB ← 1 ; while flagB while flagA {} ; {} ; print (“ A ”) print (“ B ”) x86 hardware can execute both prints! 6/41

x86: TSO Architecture Source Code Model CPU0 CPU1 Thread A Thread B flagA = 1 flagB = 1 . . . . . . flagA ← 1 ; flagB ← 1 ; Store Buffer Store Buffer print ( flagA ) print ( flagB ) flagA = 0 flagB = 0 . . . RAM 7/41

State of the Art Models : ◮ Abstract Hardware Operational ◮ Axiomatic-Style 8/41

x86-TSO: Operational Semantics ◮ State = Abstracted Machine State � m : M : addr → value ; � B : tid → ( addr × value ) list ; ◮ Structural Operational Semantics m ′ = � m with B := m . B ⊕ ( t �→ (( x , v ) : m . B t )) WB t : Wx = v m m ′ 9/41

x86-TSO: Axiomatic-Style Source Code x ← 1 ; y ← 1 ; print ( y ) print ( x ) Potential Execution #1 Potential Execution #2 W y=1 W y=1 W x=1 W x=1 . . . R y=0 R x=1 R y=1 R x=0 10/41

A Candidate Execution Pre-execution = Set of Events + Induced Binary Relations (po/data/addr) Candidate = Pre-execution + Existentially Quantified Relations (co/rf) Definition of a valid Candidate Allowed Execution (“Axiomatic Model”): W y=1 W x=1 poWR = po ∩ ( W × R ) po po uniproc = po-loc ∪ ( po \ poWR ) rf fr = rf − 1 ; co rf R y=0 R x=1 tso = rf ∪ fr ∪ co axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From 11/41

TSO: Forbidden Execution Forbidden Execution Axiomatic Model: R y=1 W x=1 poWR = po ∩ ( W × R ) fr po po uniproc = po-loc ∪ ( po \ poWR ) rf fr = rf − 1 ; co rf W y=1 tso = rf ∪ fr ∪ co R x=0 axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From fr = From-Reads 12/41

TSO: Allowed Execution Allowed Execution Axiomatic Model: W y=1 W x=1 poWR = po ∩ ( W × R ) fr po po uniproc = po-loc ∪ ( po \ poWR ) fr fr = rf − 1 ; co rf rf R y=0 tso = rf ∪ fr ∪ co R x=0 axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From fr = From-Reads 13/41

“user-mode” concurrency Much work not covered here: ◮ Fences ◮ Atomics ◮ Mixed-size ◮ Multi-copy atomicity ◮ Other Architectures: IBM Power, Arm, RISC-V 14/41

Systems Architecture Semantics Exceptions and Interrupts Instruction Fetch ESOP2020 with Ohad Kammar Pagetables and TLBs Devices and NVME Future Work . . . 15/41

JITs Just-In-Time Compilation CALL f f : Jump 0x1000 . . . CALL g Jump 0x2000 . . . CALL f . . . g : . . . Jump Table . Source Code . . Compiled Code Optimized code now unsound, have to re-compile! 16/41

JITs JIT: de-opt after executing g CALL f Jump 0x1000 f : . . . CALL g Jump 0x2000 . . PC . CALL f . . . g : . . . Jump Table Source Code . . . Compiled Code Optimized code now unsound, have to re-compile! 17/41

JITs JIT: re-compile CALL f Jump 0x1000 f : . . . CALL g Jump 0x2000 Jump 0x3000 PC CALL f . . . . . . g : . . . f : Jump Table Source Code . . . Compiled Code Optimized code now unsound, have to re-compile! 18/41

ARMv8: How to safely modify code? 19/41

RISC-V/x86/Power: How to? Similar for IBM Power Much easier on x86 RISC-V not decided yet . . . Focus on ARMv8-A for rest of talk. . . 20/41

An Instruction Fetching Test Overwrite code of function f Write f = “ print ( 2 )” CALL f . . . Then, Call f f : print ( 1 ) RETURN . . . Memory 21/41

Real A64 Assembly Initial state: 0:W0="B l1", 0:X1=f Thread 0 f STR W0,[X1] f: B l0 BL f l1: MOV X0,#2 RET l0: MOV X0,#1 RET Allowed: 1:X0=1 Relaxed Result Observed in ~99% of experimental runs on multiple devices. 22/41

An Architectural Model! decode Fetch Queue Write f = “ print ( 2 )” per-thread new CALL f fetch Thread fetch request . . . Abstract icache print ( 1 ) f : write data add to icache RETURN read data Prefetching . . . Abstract Stale instructons global dcache Memory Source Code Data buffering 23/41

An Architectural Model! decode Fetch Queue Write f = “ print ( 2 )” per-thread new CALL f fetch Thread fetch request . . . Abstract icache print ( 1 ) f : write data add to icache RETURN read data Prefetching . . . Abstract Stale instructons dcache Memory Source Code Data buffering 23/41

Unexpected Coherence! Thread A Thread B f = “ print ( 2 )” CALL f . . . . . . f : print ( 1 ) If f executes print ( 2 ) Then print ( f ) must print the updated memory ( 2 ). print ( f ) RETURN . . . 24/41

Real A64 Assembly Initial state: 0:W0="B l1", 0:X1=f, 1:X2=f Thread 0 Thread 1 f STR W0,[X1] BL f f: B l0 LDR X1,[X2] l1: MOV X0,#2 RET l0: MOV X0,#1 RET Forbidden: 1:X0=2, 1:X1="B l0" 25/41

Other Phenomena Not Mentioned Here: ◮ (In)coherence ◮ Multiple images in I-cache ◮ Multiple images in D-cache(s) ◮ Direct Data Intervention ◮ Speculating cache maintenance ◮ O/S Migration ◮ and others . . . 26/41

Operational Model decode Fetch Queue per-thread new fetch Thread fetch request Abstract icache write data add to icache read data Abstract global dcache Memory 27/41

Operational State � ts : tid �→ instruction _ tree m : � ss : storage _ subsystem � storage _ subsystem : mem : write list icache : tid �→ write set dcache : write list � . . . 28/41

Thread State Explicit Speculation Sequential ISA Spec 29/41

Thread State Explicit Speculation Sequential ISA Spec Sequential ISA Spec 29/41

Thread State Explicit Speculation Explicit Speculation Sequential ISA Spec Sequential ISA Spec 29/41

Operational: Transitions Transitions: ◮ Step ISA Spec ◮ Memory Read/Write ◮ . . . ◮ Fetch Request ◮ Fetch Instruction (from icache) ◮ Decode Instruction New! ◮ . . . ◮ Update Instruction Cache ◮ Flow Writes into Memory ◮ Reset Instruction *exact names my vary 30/41

Operational Rule (prose) Flow Writes into Memory An instruction i in the state Perform_DC (address, state_cont) can complete if all po-previous DMB ISH and DSB ISH instructions have finished. Action: 1. For the most recent writes ws which are in the same data cache line of minimum size in the abstract data cache as address , update the memory with ws ; 2. Remove all those writes from the abstract data cache. 3. Set the state of i to Plain (state_cont) . 31/41

Operational Rule (lem) let flat_propagate_dc params state _cmr addr = (* remove all to that cacheline from buffer *) let (overlapping, fetch_buf) = List.partition (write_overlaps_with_addr (cache_line_fp addr)) state.flat_ss_fetch_buf in (* flow the overlapping writes into memory *) List.foldr (fun write state -> flat_write_to_memory params state write) (<| state with flat_ss_fetch_buf = fetch_buf |>) overlapping 32/41

RMEM https://www.cl.cam.ac.uk/~pes20/rmem/ 33/41

Relaxed Systems Architecture: Instruction Fetching Ben Simner - PowerPoint PPT Presentation

Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41 Motivation Why?

Relaxed Separation Logic Tutorial @ POPL14 Viktor Vafeiadis MPI-SWS 20 January 2014

A solution of A solution of the cusp problem the cusp problem in relaxed halos in relaxed

5th STL Workshop, June 2005 Title: Relaxed weak queues: an alternative to run-relaxed heaps

Planning and Optimization C2. Delete Relaxation: Finding Relaxed Plans Malte Helmert and Gabriele

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

Instruction Set Architecture Assembly Language View Computer Architecture: Instruction Set

Instruction Set Architecture ( ISA ) 1 / 28 instructions 2 / 28 Instruction Set Architecture

Program logics for relaxed consistency UPMARC Summer School 2014 Viktor Vafeiadis Max Planck

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Entropy, continued UNIT 4 Day 7 Demonstration Stretched vs. Relaxed Rubber Bands POLL: iClicker

Planning and Optimization October 16, 2019 C2. Delete Relaxation: Properties of Relaxed

Relaxed memory models No sequential consistency (SC) in chips today Chip designers

Community Detection by Decomposing a Graph into Relaxed Cliques Fabio Furini, Timo Gschwind,

An out-of-order thread-local semantics for something like volatile relaxed atomics in C and the

The structures of quenched galaxies Eric F. Bell University of Michigan Motivation

DECC Montgomery Mr. Chris Kemp Director 25 May 2011 Overview A Combat Support Agency

Changelog Changes not seen in fjrst lecture: 19 March 2020: move page usage slides later 19

F eature D iagrams & L ogic T here and B ack A gain Krzysztof Czarnecki University of

Welcome to the ESRD Network of New York Population Health Focused Quality Improvement Activity

Justice Reinvestment in Hawaii Montana Commission on Sentencing Third

Focus Groups Continuous Improvement Toolkit . www.citoolkit.com The Continuous Improvement Map

Fueling Change for Children and Families: Network Opportunities Anne Mosle Vice President, the

Relaxed Systems Architecture: Instruction Fetching Ben Simner - PowerPoint PPT Presentation

Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41 Motivation Why?

Relaxed Separation Logic Tutorial @ POPL14 Viktor Vafeiadis MPI-SWS 20 January 2014

A solution of A solution of the cusp problem the cusp problem in relaxed halos in relaxed

5th STL Workshop, June 2005 Title: Relaxed weak queues: an alternative to run-relaxed heaps

Planning and Optimization C2. Delete Relaxation: Finding Relaxed Plans Malte Helmert and Gabriele

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

Instruction Set Architecture Assembly Language View Computer Architecture: Instruction Set

Instruction Set Architecture ( ISA ) 1 / 28 instructions 2 / 28 Instruction Set Architecture

Program logics for relaxed consistency UPMARC Summer School 2014 Viktor Vafeiadis Max Planck

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Entropy, continued UNIT 4 Day 7 Demonstration Stretched vs. Relaxed Rubber Bands POLL: iClicker

Planning and Optimization October 16, 2019 C2. Delete Relaxation: Properties of Relaxed

Relaxed memory models No sequential consistency (SC) in chips today Chip designers

Community Detection by Decomposing a Graph into Relaxed Cliques Fabio Furini, Timo Gschwind,

An out-of-order thread-local semantics for something like volatile relaxed atomics in C and the

The structures of quenched galaxies Eric F. Bell University of Michigan Motivation

DECC Montgomery Mr. Chris Kemp Director 25 May 2011 Overview A Combat Support Agency

Changelog Changes not seen in fjrst lecture: 19 March 2020: move page usage slides later 19

F eature D iagrams &amp; L ogic T here and B ack A gain Krzysztof Czarnecki University of

Welcome to the ESRD Network of New York Population Health Focused Quality Improvement Activity

Justice Reinvestment in Hawaii Montana Commission on Sentencing Third

Focus Groups Continuous Improvement Toolkit . www.citoolkit.com The Continuous Improvement Map

Fueling Change for Children and Families: Network Opportunities Anne Mosle Vice President, the

F eature D iagrams & L ogic T here and B ack A gain Krzysztof Czarnecki University of