relaxed systems architecture instruction fetching ben
play

Relaxed Systems Architecture: Instruction Fetching Ben Simner - PowerPoint PPT Presentation

Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41 Motivation Why?


  1. Relaxed Systems Architecture: Instruction Fetching Ben Simner University of Cambridge In collaboration with Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon, Luc Maranget 1 and Peter Sewell 1 INRIA Paris 1/41

  2. Motivation Why? Want to understand: TLBs, Instruction Caches, Interrupts Want to prove: Operating Systems, JITs, Hypervisors 2/41

  3. But first. . . Computers are fast. . . . . . but terrible ! 3/41

  4. Intel (Skylake) die 2 2 Source: https://en.wikichip.org/wiki/intel/microarchitectures/ skylake_(client) 4/41

  5. Intel (Skylake) die 5/41

  6. Intel (Skylake) die 5/41

  7. Intel (Skylake) die 5/41

  8. Intel (Skylake) die 5/41

  9. x86: Observable complexity Dekker’s/Peterson’s mutual exclusion algorithm (extract) Thread A Thread B flagA ← 1 ; flagB ← 1 ; while flagB while flagA {} ; {} ; print (“ A ”) print (“ B ”) x86 hardware can execute both prints! 6/41

  10. x86: TSO Architecture Source Code Model CPU0 CPU1 Thread A Thread B flagA = 1 flagB = 1 . . . . . . flagA ← 1 ; flagB ← 1 ; Store Buffer Store Buffer print ( flagA ) print ( flagB ) flagA = 0 flagB = 0 . . . RAM 7/41

  11. State of the Art Models : ◮ Abstract Hardware Operational ◮ Axiomatic-Style 8/41

  12. x86-TSO: Operational Semantics ◮ State = Abstracted Machine State � m : M : addr → value ; � B : tid → ( addr × value ) list ; ◮ Structural Operational Semantics m ′ = � m with B := m . B ⊕ ( t �→ (( x , v ) : m . B t )) WB t : Wx = v m m ′ 9/41

  13. x86-TSO: Axiomatic-Style Source Code x ← 1 ; y ← 1 ; print ( y ) print ( x ) Potential Execution #1 Potential Execution #2 W y=1 W y=1 W x=1 W x=1 . . . R y=0 R x=1 R y=1 R x=0 10/41

  14. A Candidate Execution Pre-execution = Set of Events + Induced Binary Relations (po/data/addr) Candidate = Pre-execution + Existentially Quantified Relations (co/rf) Definition of a valid Candidate Allowed Execution (“Axiomatic Model”): W y=1 W x=1 poWR = po ∩ ( W × R ) po po uniproc = po-loc ∪ ( po \ poWR ) rf fr = rf − 1 ; co rf R y=0 R x=1 tso = rf ∪ fr ∪ co axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From 11/41

  15. TSO: Forbidden Execution Forbidden Execution Axiomatic Model: R y=1 W x=1 poWR = po ∩ ( W × R ) fr po po uniproc = po-loc ∪ ( po \ poWR ) rf fr = rf − 1 ; co rf W y=1 tso = rf ∪ fr ∪ co R x=0 axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From fr = From-Reads 12/41

  16. TSO: Allowed Execution Allowed Execution Axiomatic Model: W y=1 W x=1 poWR = po ∩ ( W × R ) fr po po uniproc = po-loc ∪ ( po \ poWR ) fr fr = rf − 1 ; co rf rf R y=0 tso = rf ∪ fr ∪ co R x=0 axiom : acyclic ( uniproc ∪ tso ) po = Program-Order rf = Reads-From fr = From-Reads 13/41

  17. “user-mode” concurrency Much work not covered here: ◮ Fences ◮ Atomics ◮ Mixed-size ◮ Multi-copy atomicity ◮ Other Architectures: IBM Power, Arm, RISC-V 14/41

  18. Systems Architecture Semantics Exceptions and Interrupts Instruction Fetch ESOP2020 with Ohad Kammar Pagetables and TLBs Devices and NVME Future Work . . . 15/41

  19. JITs Just-In-Time Compilation CALL f f : Jump 0x1000 . . . CALL g Jump 0x2000 . . . CALL f . . . g : . . . Jump Table . Source Code . . Compiled Code Optimized code now unsound, have to re-compile! 16/41

  20. JITs JIT: de-opt after executing g CALL f Jump 0x1000 f : . . . CALL g Jump 0x2000 . . PC . CALL f . . . g : . . . Jump Table Source Code . . . Compiled Code Optimized code now unsound, have to re-compile! 17/41

  21. JITs JIT: re-compile CALL f Jump 0x1000 f : . . . CALL g Jump 0x2000 Jump 0x3000 PC CALL f . . . . . . g : . . . f : Jump Table Source Code . . . Compiled Code Optimized code now unsound, have to re-compile! 18/41

  22. ARMv8: How to safely modify code? 19/41

  23. RISC-V/x86/Power: How to? Similar for IBM Power Much easier on x86 RISC-V not decided yet . . . Focus on ARMv8-A for rest of talk. . . 20/41

  24. An Instruction Fetching Test Overwrite code of function f Write f = “ print ( 2 )” CALL f . . . Then, Call f f : print ( 1 ) RETURN . . . Memory 21/41

  25. Real A64 Assembly Initial state: 0:W0="B l1", 0:X1=f Thread 0 f STR W0,[X1] f: B l0 BL f l1: MOV X0,#2 RET l0: MOV X0,#1 RET Allowed: 1:X0=1 Relaxed Result Observed in ~99% of experimental runs on multiple devices. 22/41

  26. An Architectural Model! decode Fetch Queue Write f = “ print ( 2 )” per-thread new CALL f fetch Thread fetch request . . . Abstract icache print ( 1 ) f : write data add to icache RETURN read data Prefetching . . . Abstract Stale instructons global dcache Memory Source Code Data buffering 23/41

  27. An Architectural Model! decode Fetch Queue Write f = “ print ( 2 )” per-thread new CALL f fetch Thread fetch request . . . Abstract icache print ( 1 ) f : write data add to icache RETURN read data Prefetching . . . Abstract Stale instructons dcache Memory Source Code Data buffering 23/41

  28. Unexpected Coherence! Thread A Thread B f = “ print ( 2 )” CALL f . . . . . . f : print ( 1 ) If f executes print ( 2 ) Then print ( f ) must print the updated memory ( 2 ). print ( f ) RETURN . . . 24/41

  29. Real A64 Assembly Initial state: 0:W0="B l1", 0:X1=f, 1:X2=f Thread 0 Thread 1 f STR W0,[X1] BL f f: B l0 LDR X1,[X2] l1: MOV X0,#2 RET l0: MOV X0,#1 RET Forbidden: 1:X0=2, 1:X1="B l0" 25/41

  30. Other Phenomena Not Mentioned Here: ◮ (In)coherence ◮ Multiple images in I-cache ◮ Multiple images in D-cache(s) ◮ Direct Data Intervention ◮ Speculating cache maintenance ◮ O/S Migration ◮ and others . . . 26/41

  31. Operational Model decode Fetch Queue per-thread new fetch Thread fetch request Abstract icache write data add to icache read data Abstract global dcache Memory 27/41

  32. Operational State � ts : tid �→ instruction _ tree m : � ss : storage _ subsystem � storage _ subsystem : mem : write list icache : tid �→ write set dcache : write list � . . . 28/41

  33. Thread State Explicit Speculation Sequential ISA Spec 29/41

  34. Thread State Explicit Speculation Sequential ISA Spec Sequential ISA Spec 29/41

  35. Thread State Explicit Speculation Explicit Speculation Sequential ISA Spec Sequential ISA Spec 29/41

  36. Operational: Transitions Transitions: ◮ Step ISA Spec ◮ Memory Read/Write ◮ . . . ◮ Fetch Request ◮ Fetch Instruction (from icache) ◮ Decode Instruction New! ◮ . . . ◮ Update Instruction Cache ◮ Flow Writes into Memory ◮ Reset Instruction *exact names my vary 30/41

  37. Operational Rule (prose) Flow Writes into Memory An instruction i in the state Perform_DC (address, state_cont) can complete if all po-previous DMB ISH and DSB ISH instructions have finished. Action: 1. For the most recent writes ws which are in the same data cache line of minimum size in the abstract data cache as address , update the memory with ws ; 2. Remove all those writes from the abstract data cache. 3. Set the state of i to Plain (state_cont) . 31/41

  38. Operational Rule (lem) let flat_propagate_dc params state _cmr addr = (* remove all to that cacheline from buffer *) let (overlapping, fetch_buf) = List.partition (write_overlaps_with_addr (cache_line_fp addr)) state.flat_ss_fetch_buf in (* flow the overlapping writes into memory *) List.foldr (fun write state -> flat_write_to_memory params state write) (<| state with flat_ss_fetch_buf = fetch_buf |>) overlapping 32/41

  39. RMEM https://www.cl.cam.ac.uk/~pes20/rmem/ 33/41

  40. Axiomatic-Style Model | [dmb.ld]; po; [R|W] let iseq = [W];(wco&scl);[DC]; | [A|Q]; po; [R|W] (wco&scl);[IC] | [W]; po; [dmb.st] | [dmb.st]; po; [W] (* Observed-by *) | [R|W]; po; [L] let obs = rfe | fr | wco | [R|W|F|DC|IC]; po; [dsb.ish] | irf | (ifr;iseq) | [dsb.ish]; po; [R|W|F|DC|IC] (* Fetch-ordered-before *) | [dmb.sy]; po; [DC] let fob = [IF]; fpo; [IF] (* Cache-op-ordered-before *) | [IF]; fe let cob = [R|W]; (po&scl); [DC] | [ISB]; fe − 1 ; fpo | [DC]; (po&scl); [DC] (* Dependency-ordered-before *) (* Ordered-before *) let dob = addr | data let ob = obs|fob|dob|aob|bob|cob | ctrl; [W] | (ctrl | (addr; po)); [ISB] (* Internal visibility requirement *) | addr; po; [W] acyclic (po-loc|fr|co|rf) as internal | (addr | data); rfi (* External visibility requirement *) (* Atomic-ordered-before *) acyclic ob as external let aob = rmw | [range(rmw)]; rfi; [A|Q] (* Atomic *) empty rmw & (fre; coe) as atomic (* Barrier-ordered-before *) let bob = [R|W]; po; [dmb.sy] (* Constrained unpredictable *) | [dmb.sy]; po; [R|W] let cff = ([W];loc;[IF]) \ | [L]; po; [A] ob+ − 1 \ (co;iseq;ob+) | [R]; po; [dmb.ld] cff_bad cff ≡ CU 34/41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend