Verification with the Check suite Yatin Manerkar Princeton - - PowerPoint PPT Presentation

verification with the check suite
SMART_READER_LITE
LIVE PREVIEW

Verification with the Check suite Yatin Manerkar Princeton - - PowerPoint PPT Presentation

Automated Full-Stack Memory Model Verification with the Check suite Yatin Manerkar Princeton University ARM Cambridge, July 20 th , 2018 http:/ ://check.cs.p .princeton.edu/ What are Memory (Consistency) Models? Memory Consistency Models


slide-1
SLIDE 1

Yatin Manerkar

Automated Full-Stack Memory Model Verification with the Check suite

http:/ ://check.cs.p .princeton.edu/

Princeton University ARM Cambridge, July 20th, 2018

slide-2
SLIDE 2

What are Memory (Consistency) Models?

JVM LLVM IR PTX SPIR Java Bytecode C11/ C++11 Cuda OpenCL x86 CPU ARM CPU Power CPU Nvidia GPU AMD GPU … … … Shared Virtual Memory

Memory Consistency Models (MCMs) Specify rules and guarantees about the ordering and visibility of accesses to shared memory [Sorin et al., 2011].

slide-3
SLIDE 3

What are Memory (Consistency) Models?

JVM LLVM IR PTX SPIR Java Bytecode C11/ C++11 Cuda OpenCL x86 CPU ARM CPU Power CPU Nvidia GPU AMD GPU … … … Shared Virtual Memory

HLL MCMs

Memory Consistency Models (MCMs) Specify rules and guarantees about the ordering and visibility of accesses to shared memory [Sorin et al., 2011].

slide-4
SLIDE 4

What are Memory (Consistency) Models?

JVM LLVM IR PTX SPIR Java Bytecode C11/ C++11 Cuda OpenCL x86 CPU ARM CPU Power CPU Nvidia GPU AMD GPU … … … Shared Virtual Memory

ISA-level MCMs

Memory Consistency Models (MCMs) Specify rules and guarantees about the ordering and visibility of accesses to shared memory [Sorin et al., 2011].

slide-5
SLIDE 5

Sequential Consistency (SC) - Interleaving Model

▪Defined by [Lamport 1979], execution is the same as if:

(R1) Memory ops of each processor appear in program order (R2) Memory ops of all processors were executed in some total order (load reads the value of last store to its address in the total order)

Core 0 x=1 y=1 Core 1 r1=y r2=x x=1 y=1 r1=y r2=x x=1 r1=y y=1 r2=x x=1 r1=y r2=x y=1 r1=y r2=x x=1 y=1 r1=y x=1 r2=x y=1 r1=y x=1 y=1 r2=x Program (mp litmus test) (all addrs initially 0) Legal Executions r1=1 r2=1 r1=0 r2=1 r1=0 r2=0 r1=1 r2=0 Illegal Outcome
slide-6
SLIDE 6

Sequential Consistency (SC) - Interleaving Model

▪Defined by [Lamport 1979], execution is the same as if:

(R1) Memory ops of each processor appear in program order (R2) Memory ops of all processors were executed in some total order (load reads the value of last store to its address in the total order)

Core 0 x=1 y=1 Core 1 r1=y r2=x x=1 y=1 r1=y r2=x x=1 r1=y y=1 r2=x x=1 r1=y r2=x y=1 r1=y r2=x x=1 y=1 r1=y x=1 r2=x y=1 r1=y x=1 y=1 r2=x Program (mp litmus test) (all addrs initially 0) Legal Executions r1=1 r2=1 r1=0 r2=1 r1=0 r2=0 r1=1 r2=0 Illegal Outcome
slide-7
SLIDE 7

Hardware Implements Weak Memory Models

▪Most processors don’t implement SC

  • x86: Total Store Order (TSO): Relaxes Write->Read ordering
  • ARMv8 and Power relax more orderings

▪Compilation to weak memory ISAs must maintain ordering guarantees

  • [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], …

atomic<int> x = 0; atomic<int> y = 0; Thread 0 Thread 1 x = 1; y = 1; r1 = y; r2 = x; C11 Forbids: r1 = 1, r2 = 0

C11 Source Code
slide-8
SLIDE 8

Hardware Implements Weak Memory Models

▪Most processors don’t implement SC

  • x86: Total Store Order (TSO): Relaxes Write->Read ordering
  • ARMv8 and Power relax more orderings

▪Compilation to weak memory ISAs must maintain ordering guarantees

  • [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], …

atomic<int> x = 0; atomic<int> y = 0; Thread 0 Thread 1 x = 1; y = 1; r1 = y; r2 = x; C11 Forbids: r1 = 1, r2 = 0

C11 Source Code
slide-9
SLIDE 9

Hardware Implements Weak Memory Models

▪Most processors don’t implement SC

  • x86: Total Store Order (TSO): Relaxes Write->Read ordering
  • ARMv8 and Power relax more orderings

▪Compilation to weak memory ISAs must maintain ordering guarantees

  • [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], …

atomic<int> x = 0; atomic<int> y = 0; Thread 0 Thread 1 x = 1; y = 1; r1 = y; r2 = x; C11 Forbids: r1 = 1, r2 = 0 Initially, [x] = [y] = 0 Core 0 Core 1 stl #1, [x] stl #1, [y] lda r1, [y] lda r2, [x] ARMv8 forbids: r1 = 1, r2 = 0

ARMv8 Assembly Language

Compile

C11 Source Code
slide-10
SLIDE 10

Hardware Implements Weak Memory Models

▪Most processors don’t implement SC

  • x86: Total Store Order (TSO): Relaxes Write->Read ordering
  • ARMv8 and Power relax more orderings

▪Compilation to weak memory ISAs must maintain ordering guarantees

  • [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], …

atomic<int> x = 0; atomic<int> y = 0; Thread 0 Thread 1 x = 1; y = 1; r1 = y; r2 = x; C11 Forbids: r1 = 1, r2 = 0 Initially, [x] = [y] = 0 Core 0 Core 1 stl #1, [x] stl #1, [y] lda r1, [y] lda r2, [x] ARMv8 forbids: r1 = 1, r2 = 0

ARMv8 Assembly Language

Compile

C11 Source Code

Is the ARMv8 hardware correctly implementing the ARMv8 MCM?

slide-11
SLIDE 11

MCM Verification is a Full-Stack Problem!

High-Level Languages (HLL) Compiler Architecture (ISA) OS

▪Each layer has responsibilities for ensuring correct MCM operation ▪Need MCM checking tools at all layers of the computing stack!

Is compiler maintaining HLL guarantees? Is the ISA-level MCM formally defined? [Batty et al. POPL 2011, POPL 2012] [Wickerson et al. OOPSLA 2015] … [Alglave et al. TOPLAS 2014]
slide-12
SLIDE 12

MCM Verification is a Full-Stack Problem!

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS Is hardware incorrectly reordering instructions? Are virtual memory mappings correct? Is RTL correctly implementing microarchitecture?

▪Each layer has responsibilities for ensuring correct MCM operation ▪Need MCM checking tools at all layers of the computing stack!

Is compiler maintaining HLL guarantees? Is the ISA-level MCM formally defined? Processor RTL [Batty et al. POPL 2011, POPL 2012] [Wickerson et al. OOPSLA 2015] … [Alglave et al. TOPLAS 2014]
slide-13
SLIDE 13

MCM Verification is a Full-Stack Problem!

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS Is hardware incorrectly reordering instructions? Are virtual memory mappings correct? Is RTL correctly implementing microarchitecture?

▪Each layer has responsibilities for ensuring correct MCM operation ▪Need MCM checking tools at all layers of the computing stack!

Is compiler maintaining HLL guarantees? Is the ISA-level MCM formally defined? Processor RTL [Batty et al. POPL 2011, POPL 2012] [Wickerson et al. OOPSLA 2015] … [Alglave et al. TOPLAS 2014]
slide-14
SLIDE 14

MCM Verification is a Full-Stack Problem!

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS Is hardware incorrectly reordering instructions? Are virtual memory mappings correct? Is RTL correctly implementing microarchitecture?

▪Each layer has responsibilities for ensuring correct MCM operation ▪Need MCM checking tools at all layers of the computing stack!

Is compiler maintaining HLL guarantees? Is the ISA-level MCM formally defined? Processor RTL [Batty et al. POPL 2011, POPL 2012] [Wickerson et al. OOPSLA 2015] … [Alglave et al. TOPLAS 2014]
slide-15
SLIDE 15

Check Suite: Full-Stack Automated MCM Analysis

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪Suite of tools at various levels of computing stack ▪Automated Full-Stack MCM checking across litmus test suites

PipeCheck & CCICheck [Lustig et al. MICRO 2014] [Manerkar et al. MICRO 2015] COATCheck [Lustig et al. ASPLOS 2016] TriCheck [Trippel et al. ASPLOS 2017] RTLCheck [Manerkar et al. MICRO 2017] Processor RTL
slide-16
SLIDE 16

Check Suite: Full-Stack Automated MCM Analysis

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪Suite of tools at various levels of computing stack ▪Automated Full-Stack MCM checking across litmus test suites

PipeCheck & CCICheck [Lustig et al. MICRO 2014] [Manerkar et al. MICRO 2015] COATCheck [Lustig et al. ASPLOS 2016] TriCheck [Trippel et al. ASPLOS 2017] RTLCheck [Manerkar et al. MICRO 2017] Processor RTL

Does microarchitecture correctly implement ISA MCM?

slide-17
SLIDE 17

Check Suite: Full-Stack Automated MCM Analysis

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪Suite of tools at various levels of computing stack ▪Automated Full-Stack MCM checking across litmus test suites

PipeCheck & CCICheck [Lustig et al. MICRO 2014] [Manerkar et al. MICRO 2015] COATCheck [Lustig et al. ASPLOS 2016] TriCheck [Trippel et al. ASPLOS 2017] RTLCheck [Manerkar et al. MICRO 2017] Processor RTL

Does RTL like Verilog correctly implement microarchitecture?

slide-18
SLIDE 18

Check Suite: Full-Stack Automated MCM Analysis

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪Suite of tools at various levels of computing stack ▪Automated Full-Stack MCM checking across litmus test suites

PipeCheck & CCICheck [Lustig et al. MICRO 2014] [Manerkar et al. MICRO 2015] COATCheck [Lustig et al. ASPLOS 2016] TriCheck [Trippel et al. ASPLOS 2017] RTLCheck [Manerkar et al. MICRO 2017] Processor RTL

Do HLL, Compiler, and microarchitecture work together correctly?

slide-19
SLIDE 19

Check Suite: Full-Stack Automated MCM Analysis

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪Suite of tools at various levels of computing stack ▪Automated Full-Stack MCM checking across litmus test suites

PipeCheck & CCICheck [Lustig et al. MICRO 2014] [Manerkar et al. MICRO 2015] COATCheck [Lustig et al. ASPLOS 2016] TriCheck [Trippel et al. ASPLOS 2017] RTLCheck [Manerkar et al. MICRO 2017] Processor RTL

So far, tools have found bugs in:

  • Widely-used gem5 Research simulator
  • Cache coherence paper (TSO-CC)
  • IBM XL C++ compiler (fixed in v13.1.5)
  • In-design commercial processors
  • RISC-V draft ISA specification
  • Compiler mapping proofs
  • C11 memory model
  • Open-source processor RTL
slide-20
SLIDE 20

Modelling Microarchitecture: Going below the ISA

▪Hardware enforces consistency model using smaller localized orderings

  • In-order fetch/decode/execute…
  • Orderings enforced by memory hierarchy
  • …and many more

Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch

Memory Hierarchy
slide-21
SLIDE 21

Modelling Microarchitecture: Going below the ISA

▪Hardware enforces consistency model using smaller localized orderings

  • In-order fetch/decode/execute…
  • Orderings enforced by memory hierarchy
  • …and many more

Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch

Pipeline stages may be FIFO to ensure in-order execution

Memory Hierarchy
slide-22
SLIDE 22

Modelling Microarchitecture: Going below the ISA

▪Hardware enforces consistency model using smaller localized orderings

  • In-order fetch/decode/execute…
  • Orderings enforced by memory hierarchy
  • …and many more

Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch

Pipeline stages may be FIFO to ensure in-order execution

Memory Hierarchy

Do individual orderings correctly work together to satisfy consistency model?

slide-23
SLIDE 23

Microarchitectural Consistency Checking

Axiom “Decode_is_FIFO": ... EdgeExists ((i1, Decode), (i2, Decode)) => AddEdge ((i1, Execute), (i2, Execute)). Axiom "PO_Fetch": ... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).

Mic icroarchit itecture Litm Litmus Tes est in in µspec ec DS DSL

slide-24
SLIDE 24

Microarchitectural Consistency Checking

Axiom “Decode_is_FIFO": ... EdgeExists ((i1, Decode), (i2, Decode)) => AddEdge ((i1, Execute), (i2, Execute)). Axiom "PO_Fetch": ... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).

Mic icroarchit itecture Litm Litmus Tes est in in µspec ec DS DSL

Each axiom specifies an ordering that µarch should respect

slide-25
SLIDE 25

Microarchitectural Consistency Checking

Axiom “Decode_is_FIFO": ... EdgeExists ((i1, Decode), (i2, Decode)) => AddEdge ((i1, Execute), (i2, Execute)). Axiom "PO_Fetch": ... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).

Mic icroarchit itecture Litm Litmus Tes est in in µspec ec DS DSL

slide-26
SLIDE 26

Microarchitectural Consistency Checking

Mic icroarchit itectural happens-before (µ (µhb hb) gr graphs

Axiom “Decode_is_FIFO": ... EdgeExists ((i1, Decode), (i2, Decode)) => AddEdge ((i1, Execute), (i2, Execute)). Axiom "PO_Fetch": ... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).

Mic icroarchit itecture Litm Litmus Tes est in in µspec ec DS DSL

slide-27
SLIDE 27

Microarchitectural Consistency Checking

Mic icroarchit itectural happens-before (µ (µhb hb) gr graphs

Axiom “Decode_is_FIFO": ... EdgeExists ((i1, Decode), (i2, Decode)) => AddEdge ((i1, Execute), (i2, Execute)). Axiom "PO_Fetch": ... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).

Mic icroarchit itecture Litm Litmus Tes est in in µspec ec DS DSL

  • Microarch. verification checks that

combination of axioms satisfies MCM

slide-28
SLIDE 28

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

Litmus Test mp Cor Core 0 Cor Core 1
slide-29
SLIDE 29

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp Cor Core 0 Cor Core 1 (i1)
slide-30
SLIDE 30

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp Cor Core 0 Cor Core 1 (i1)
slide-31
SLIDE 31

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp WB Mem. SB Mem Hier. Exec. Dec. Fetch Cor Core 0 Cor Core 1 (i1) (i2)
slide-32
SLIDE 32

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp WB Mem. SB Mem Hier. Exec. Dec. Fetch Cor Core 0 Cor Core 1 (i1) (i2)
slide-33
SLIDE 33

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp WB Mem. SB Mem Hier. Exec. Dec. Fetch WB Mem. Exec. Dec. Fetch WB Mem. Exec. Dec. Fetch Cor Core 0 Cor Core 1 (i1) (i2) (i3) (i4)
slide-34
SLIDE 34

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp WB Mem. SB Mem Hier. Exec. Dec. Fetch WB Mem. Exec. Dec. Fetch WB Mem. Exec. Dec. Fetch Cor Core 0 Cor Core 1 (i1) (i2) (i3) (i4)
slide-35
SLIDE 35

PipeCheck: Executions as µhb Graphs [Lustig et al. MICRO 2014]

WB Mem. SB Mem Hier. Exec. Dec. Fetch Litmus Test mp WB Mem. SB Mem Hier. Exec. Dec. Fetch WB Mem. Exec. Dec. Fetch WB Mem. Exec. Dec. Fetch Cor Core 0 Cor Core 1 (i1) (i2) (i3) (i4)
slide-36
SLIDE 36

▪Cycle in µhb graph => event has to happen before itself (impossible) ▪Cyclic graph → unobservable on µarch ▪Acyclic graph → observable on µarch ▪Exhaustively enumerate and check all possible execs of litmus test on µarch

  • Implemented using fast SMT solvers
  • Compare against ISA-level outcome from

herd [Alglave et al. TOPLAS 2014]

PipeCheck: Microarchitectural Correctness

Litmus Test mp
slide-37
SLIDE 37

▪Cycle in µhb graph => event has to happen before itself (impossible) ▪Cyclic graph → unobservable on µarch ▪Acyclic graph → observable on µarch ▪Exhaustively enumerate and check all possible execs of litmus test on µarch

  • Implemented using fast SMT solvers
  • Compare against ISA-level outcome from

herd [Alglave et al. TOPLAS 2014]

PipeCheck: Microarchitectural Correctness

Litmus Test mp

ISA-Level Outcome Observable (≥ 1 Graph Acyclic) Not Observable (All Graphs Cyclic) Allowed OK OK (stricter than necessary) Forbidden Consistency violation! OK

slide-38
SLIDE 38

▪Cycle in µhb graph => event has to happen before itself (impossible) ▪Cyclic graph → unobservable on µarch ▪Acyclic graph → observable on µarch ▪Exhaustively enumerate and check all possible execs of litmus test on µarch

  • Implemented using fast SMT solvers
  • Compare against ISA-level outcome from

herd [Alglave et al. TOPLAS 2014]

PipeCheck: Microarchitectural Correctness

Litmus Test mp

ISA-Level Outcome Observable (≥ 1 Graph Acyclic) Not Observable (All Graphs Cyclic) Allowed OK OK (stricter than necessary) Forbidden Consistency violation! OK

slide-39
SLIDE 39

▪Cycle in µhb graph => event has to happen before itself (impossible) ▪Cyclic graph → unobservable on µarch ▪Acyclic graph → observable on µarch ▪Exhaustively enumerate and check all possible execs of litmus test on µarch

  • Implemented using fast SMT solvers
  • Compare against ISA-level outcome from

herd [Alglave et al. TOPLAS 2014]

PipeCheck: Microarchitectural Correctness

Litmus Test mp

ISA-Level Outcome Observable (≥ 1 Graph Acyclic) Not Observable (All Graphs Cyclic) Allowed OK OK (stricter than necessary) Forbidden Consistency violation! OK

slide-40
SLIDE 40

▪Cycle in µhb graph => event has to happen before itself (impossible) ▪Cyclic graph → unobservable on µarch ▪Acyclic graph → observable on µarch ▪Exhaustively enumerate and check all possible execs of litmus test on µarch

  • Implemented using fast SMT solvers
  • Compare against ISA-level outcome from

herd [Alglave et al. TOPLAS 2014]

PipeCheck: Microarchitectural Correctness

Litmus Test mp

ISA-Level Outcome Observable (≥ 1 Graph Acyclic) Not Observable (All Graphs Cyclic) Allowed OK OK (stricter than necessary) Forbidden Consistency violation! OK

slide-41
SLIDE 41

▪Cycle in µhb graph => event has to happen before itself (impossible) ▪Cyclic graph → unobservable on µarch ▪Acyclic graph → observable on µarch ▪Exhaustively enumerate and check all possible execs of litmus test on µarch

  • Implemented using fast SMT solvers
  • Compare against ISA-level outcome from

herd [Alglave et al. TOPLAS 2014]

PipeCheck: Microarchitectural Correctness

Litmus Test mp

ISA-Level Outcome Observable (≥ 1 Graph Acyclic) Not Observable (All Graphs Cyclic) Allowed OK OK (stricter than necessary) Forbidden Consistency violation! OK

Abstracted memory hierarchy prevents verification of complex coherence issues!

slide-42
SLIDE 42

CCICheck: Coherence vs Consistency

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS Processor RTL Processor RTL

▪ Memory hierarchy is a collection of caches

  • Coherence protocols ensure that all caches agree on the value
  • f any variable

▪ CCICheck [Manerkar et al. MICRO 2015] shows that consistency verification often cannot simply treat memory hierarchy abstractly

  • No

Nomin inated for

  • r Best Pap

aper at t MIC ICRO 20 2015 15

Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch Memory Hierarchy
slide-43
SLIDE 43

CCICheck: Coherence vs Consistency

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS Processor RTL Processor RTL

▪ Memory hierarchy is a collection of caches

  • Coherence protocols ensure that all caches agree on the value
  • f any variable

▪ CCICheck [Manerkar et al. MICRO 2015] shows that consistency verification often cannot simply treat memory hierarchy abstractly

  • No

Nomin inated for

  • r Best Pap

aper at t MIC ICRO 20 2015 15

Coh Coherence Protocol (S (SWMR, , DVI VI, etc.) Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch
slide-44
SLIDE 44

Coherence Protocol Example

▪If P1 updates the value of x to 200, the stale value of x in other processors must be invalidated ▪If P3 wants to subsequently read/write x, it must request the new value ▪SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

P1 P2 P3 x = 100 x = 100 x = 100

Processors Caches

slide-45
SLIDE 45

Coherence Protocol Example

▪If P1 updates the value of x to 200, the stale value of x in other processors must be invalidated ▪If P3 wants to subsequently read/write x, it must request the new value ▪SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

P1 P2 P3 x = 100 x = 100 x = 100

Processors Caches St x = 200

slide-46
SLIDE 46

Coherence Protocol Example

▪If P1 updates the value of x to 200, the stale value of x in other processors must be invalidated ▪If P3 wants to subsequently read/write x, it must request the new value ▪SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

P1 P2 P3 x = 100 x = 100 x = 100

Processors Caches Invalidations

x = 100 x = 100

St x = 200

slide-47
SLIDE 47

Coherence Protocol Example

▪If P1 updates the value of x to 200, the stale value of x in other processors must be invalidated ▪If P3 wants to subsequently read/write x, it must request the new value ▪SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

P1 P2 P3 x = 100 x = 100 x = 100

Processors Caches

x = 200 x = 100 x = 100

slide-48
SLIDE 48

Coherence Protocol Example

▪If P1 updates the value of x to 200, the stale value of x in other processors must be invalidated ▪If P3 wants to subsequently read/write x, it must request the new value ▪SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

P1 P2 P3 x = 100 x = 100 x = 100

Processors Caches

x = 200 x = 100 x = 100

Request Data Ld x

slide-49
SLIDE 49

Coherence Protocol Example

▪If P1 updates the value of x to 200, the stale value of x in other processors must be invalidated ▪If P3 wants to subsequently read/write x, it must request the new value ▪SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

P1 P2 P3 x = 100 x = 100 x = 100

Processors Caches

x = 200 x = 100 x = 100 x = 200

Ld x Data Response

slide-50
SLIDE 50

Motivating Example – “Peekaboo” [Sorin et al. Primer 2011]

▪Three optimizations: correct individually, but not in combination

slide-51
SLIDE 51

Motivating Example – “Peekaboo” [Sorin et al. Primer 2011]

▪Three optimizations: correct individually, but not in combination

  • 1. Prefetching
slide-52
SLIDE 52

Motivating Example – “Peekaboo” [Sorin et al. Primer 2011]

▪Three optimizations: correct individually, but not in combination

  • 1. Prefetching
  • 2. Invalidation before use
  • Invalidation can arrive before data
  • Acknowledge Inv early rather than wait for data to arrive
  • But repeated inv before use → livelock [Kubiatowicz et al. ASPLOS 1992]
slide-53
SLIDE 53

Motivating Example – “Peekaboo” [Sorin et al. Primer 2011]

▪Three optimizations: correct individually, but not in combination

  • 1. Prefetching
  • 2. Invalidation before use
  • Invalidation can arrive before data
  • Acknowledge Inv early rather than wait for data to arrive
  • But repeated inv before use → livelock [Kubiatowicz et al. ASPLOS 1992]

3.

  • 3. Liv

ivelock avoid idance: allow destination core to perform one

  • peration on data when it arrives, even if

if alr lready in invalid lidated

[Sorin et al. Primer 2011]

  • Does not break coherence
  • Sometimes in

intentio ionall lly returns stale data

slide-54
SLIDE 54

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Shared y: Modified x: Invalid y: Invalid

[x] ← 1 [y] ← 1 r1 ← [y] r2 ← [x]

Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-55
SLIDE 55

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Shared y: Modified x: Invalid y: Invalid

[x] ← 1 [y] ← 1 r1 ← [y] r2 ← [x]

Prefetch x Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-56
SLIDE 56

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Shared y: Modified x: Invalid y: Invalid

[x] ← 1 [y] ← 1 r1 ← [y] r2 ← [x]

Prefetch x Data (x = 0) Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-57
SLIDE 57

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Shared y: Modified x: Invalid y: Invalid

[x] ← 1 [y] ← 1 r1 ← [y] r2 ← [x]

Prefetch x Data (x = 0) Inv Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-58
SLIDE 58

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Shared y: Modified x: Invalid y: Invalid

[x] ← 1 [y] ← 1 r1 ← [y] r2 ← [x]

Prefetch x Data (x = 0) Inv Inv-Ack Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-59
SLIDE 59

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Invalid y: Invalid

r1 ← [y] r2 ← [x]

Prefetch x Data (x = 0) Inv Inv-Ack

x: Modified y: Modified

[x] ← 1 [y] ← 1

Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-60
SLIDE 60

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Invalid y: Invalid

r1 ← [y] r2 ← [x]

Prefetch x Data (x = 0) Inv Inv-Ack

x: Modified y: Modified

[x] ← 1 [y] ← 1

Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-61
SLIDE 61

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

x: Invalid y: Invalid

r1 ← [y] r2 ← [x]

Prefetch x Data (x = 0) Inv Inv-Ack

x: Modified y: Modified

Request y

[x] ← 1 [y] ← 1

Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-62
SLIDE 62

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

Prefetch x Data (x = 0) Inv Inv-Ack Data (y = 1)

x: Modified y: Shared x: Invalid y: Shared

Request y

[x] ← 1 [y] ← 1 r1 r1 = 1 r2 ← [x]

Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-63
SLIDE 63

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

Prefetch x Inv Inv-Ack Data (y = 1)

x: Modified y: Shared x: Invalid y: Shared

Request y

[x] ← 1 [y] ← 1 r1 r1 = 1 r2 ← [x]

Data (x = 0) Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-64
SLIDE 64

Motivating Example – “Peekaboo”

▪ Consider mp with the livelock-avoidance mechanism:

Cor Core 0 Cor Core 1

Prefetch x Inv Inv-Ack Data (y = 1)

x: Modified y: Shared x: Invalid y: Shared

Request y

[x] ← 1 [y] ← 1 r1 r1 = 1 r2 r2 = 0

Data (x = 0) Optimizations:

  • 1. Prefetching
  • 2. Invalidation-before-use
  • 3. Livelock avoidance
slide-65
SLIDE 65

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence SWMR, DVI, No Stale Data Consistency

slide-66
SLIDE 66

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence SWMR, DVI, No Stale Data Consistency

slide-67
SLIDE 67

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence SWMR, DVI, No Stale Data Consistency

slide-68
SLIDE 68

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence SWMR, DVI, No Stale Data Consistency

slide-69
SLIDE 69

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence SWMR, DVI, No Stale Data Consistency

slide-70
SLIDE 70

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence Consistency SWMR, DVI, No Livelock

slide-71
SLIDE 71

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence Consistency SWMR, DVI, No Livelock

slide-72
SLIDE 72

The Coherence-Consistency Interface (CCI)

▪CCI = coherence protocol guarantees to microarch. +

  • rderings microarch. expects from coherence protocol

+ =

Expected Coherence SWMR, DVI, No Livelock CCI Mismatch Consistency Violation!

slide-73
SLIDE 73

ViCL: Value in Cache Lifetime

▪Need a way to model cache occupancy and coherence events for:

  • Coherence protocol optimizations (eg: Peekaboo)
  • Partial incoherence and lazy coherence (GPUs, etc)

▪A ViCL is a 4-tuple:

(cache_id, address, data_value, , generation_id)

▪cache_id and generation_id uniquely identify each cache line ▪A ViCL 4-tuple maps on to the period of time over which the cache line serves the data value for the address

slide-74
SLIDE 74

ViCLs in µhb Graphs

▪ViCLs start at a ViC iCL Create event and end at a ViC iCL Exp xpire event

  • Correspond to nodes in µhb graphs
  • Axioms over these nodes and

edges enforce coherence and data movement orderings

▪Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp
slide-75
SLIDE 75

ViCLs in µhb Graphs

▪ViCLs start at a ViC iCL Create event and end at a ViC iCL Exp xpire event

  • Correspond to nodes in µhb graphs
  • Axioms over these nodes and

edges enforce coherence and data movement orderings

▪Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp
slide-76
SLIDE 76

ViCLs in µhb Graphs

▪ViCLs start at a ViC iCL Create event and end at a ViC iCL Exp xpire event

  • Correspond to nodes in µhb graphs
  • Axioms over these nodes and

edges enforce coherence and data movement orderings

▪Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp
slide-77
SLIDE 77

ViCLs in µhb Graphs

▪ViCLs start at a ViC iCL Create event and end at a ViC iCL Exp xpire event

  • Correspond to nodes in µhb graphs
  • Axioms over these nodes and

edges enforce coherence and data movement orderings

▪Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp
slide-78
SLIDE 78

ViCLs in µhb Graphs

▪ViCLs start at a ViC iCL Create event and end at a ViC iCL Exp xpire event

  • Correspond to nodes in µhb graphs
  • Axioms over these nodes and

edges enforce coherence and data movement orderings

▪Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp
slide-79
SLIDE 79

µhb Graph for the Peekaboo Problem

▪Additional nodes represent ViCL requests and invalidations ▪Solu lution: Invalidated data only usable if accessing load/store is

  • ldest in program order at time of

request [Sorin et al. Primer 2011] ▪TSO-CC protocol [Elver and Nagarajan HPCA 2014] was vulnerable to variant of Peekaboo!

  • Now fixed
slide-80
SLIDE 80

µhb Graph for the Peekaboo Problem

▪Additional nodes represent ViCL requests and invalidations ▪Solu lution: Invalidated data only usable if accessing load/store is

  • ldest in program order at time of

request [Sorin et al. Primer 2011] ▪TSO-CC protocol [Elver and Nagarajan HPCA 2014] was vulnerable to variant of Peekaboo!

  • Now fixed
slide-81
SLIDE 81

µhb Graph for the Peekaboo Problem

▪Additional nodes represent ViCL requests and invalidations ▪Solu lution: Invalidated data only usable if accessing load/store is

  • ldest in program order at time of

request [Sorin et al. Primer 2011] ▪TSO-CC protocol [Elver and Nagarajan HPCA 2014] was vulnerable to variant of Peekaboo!

  • Now fixed
slide-82
SLIDE 82

µhb Graph for the Peekaboo Problem

▪Additional nodes represent ViCL requests and invalidations ▪Solu lution: Invalidated data only usable if accessing load/store is

  • ldest in program order at time of

request [Sorin et al. Primer 2011] ▪TSO-CC protocol [Elver and Nagarajan HPCA 2014] was vulnerable to variant of Peekaboo!

  • Now fixed
slide-83
SLIDE 83

µhb Graph for the Peekaboo Problem

▪Additional nodes represent ViCL requests and invalidations ▪Solu lution: Invalidated data only usable if accessing load/store is

  • ldest in program order at time of

request [Sorin et al. Primer 2011] ▪TSO-CC protocol [Elver and Nagarajan HPCA 2014] was vulnerable to variant of Peekaboo!

  • Now fixed
slide-84
SLIDE 84

CCICheck Takeaways

▪Coherence & consistency often closely coupled in implementations ▪In such cases, coherence & consistency cannot be verified separately ▪CCICheck: CCI-aware microarchitectural MCM checking

  • Uses ViCL (Value in Cache Lifetime) abstraction

▪Discovered bug in TSO-CC lazy coherence protocol

slide-85
SLIDE 85

Hardware

ISA-level MCMs in the Hardware-Software Stack

New ISA-level MCM High-Level Languages (HLLs)

slide-86
SLIDE 86

Hardware

ISA-level MCMs in the Hardware-Software Stack

New ISA-level MCM High-Level Languages (HLLs) Which orderings must be guaranteed by hardware?

slide-87
SLIDE 87

Hardware

ISA-level MCMs in the Hardware-Software Stack

New ISA-level MCM High-Level Languages (HLLs) Which orderings does the compiler need to enforce? Which orderings must be guaranteed by hardware?

slide-88
SLIDE 88

Hardware

ISA-level MCMs in the Hardware-Software Stack

New ISA-level MCM High-Level Languages (HLLs) Which orderings does the compiler need to enforce? Which orderings must be guaranteed by hardware?

TriCheck checks that HLL, compiler, ISA, and hardware align on MCM requirements

slide-89
SLIDE 89

TriCheck: Layers of the Stack are Intertwined

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS Processor RTL Processor RTL

▪ISA-level MCMs should allow microarchitectural

  • ptimizations but also be compatible with HLLs

▪TriCheck [Trippel et al. ASPLOS 2017] enables holistic analysis of HLL memory model, ISA-level MCM, compiler mappings, and microarchitectures

  • Mapping: translation of HLL synchronization primitives to
  • ne or more assembly language instructions

▪Also useful for checking HLL compiler mappings to ISA-level MCMs ▪Selected as one of 12 “Top Pic icks of f Comp. Arc rch. Conferences” for 2017

slide-90
SLIDE 90

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping

HLL Litmus Test Variants

HLL Model e.g. C11 µspec Microarch. Model Four Primary Inputs

slide-91
SLIDE 91

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping

HLL Litmus Test Variants

HLL Model e.g. C11 µspec Microarch. Model Examine all C11 memory_order combinations (release, acquire, relaxed, seq_cst) for HLL litmus tests

slide-92
SLIDE 92

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping

HLL Litmus Test Variants ISA-level litmus tests

HLL Model e.g. C11 µspec Microarch. Model Translate HLL Litmus Tests to ISA-level litmus tests

slide-93
SLIDE 93

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] ISA-level litmus tests

HLL Model e.g. C11 µspec Microarch. Model Use Herd to check HLL

  • utcomes
slide-94
SLIDE 94

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

HLL Model e.g. C11 µspec Microarch. Model Use µhb analysis to check microarch.

  • utcomes
slide-95
SLIDE 95

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

?

HLL Model e.g. C11 µspec Microarch. Model Compare HLL and

  • microarch. outcomes
slide-96
SLIDE 96

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

?

HLL Model e.g. C11 µspec Microarch. Model Compare HLL and

  • microarch. outcomes

Forbidden Observable

slide-97
SLIDE 97

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

HLL Model e.g. C11 µspec Microarch. Model Compare HLL and

  • microarch. outcomes

Forbidden Observable

BUG!

slide-98
SLIDE 98

TriCheck: Comparing HLL to Microarchitecture

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

HLL Model e.g. C11 µspec Microarch. Model Forbidden Observable

BUG!

If bugs found, iterate by changing the inputs and re-run

slide-99
SLIDE 99

Using TriCheck for ISA MCM Design: RISC-V

▪Ran TriCheck on draft RISC-V ISA MCM with

  • C11 HLL MCM [Batty et al. POPL 2011] [Batty et al. POPL 2016]
  • Compiler mappings based on RISC-V manual
  • Variety of microarchitectures that relaxed various memory orderings

− All legal according to draft RISC-V spec − Ranging from SC microarchitecture to one with reorderings allowed by ARM/Power

▪Draft RISC-V MCM for Base ISA incapable of correctly compiling C11:

  • C11 outcome forbidden, but impossible to forbid on hardware
  • RISC-V fences too weak to restore orderings that implementations could relax
slide-100
SLIDE 100

Current RISC-V Status

▪In response to our findings, RISC-V Memory Model Working Group was formed (we are members)

  • Mandate to create an MCM for RISC-V that satisfies community needs

▪Working Group has developed an MCM proposal that fixes the aforementioned bugs (and other issues) ▪MCM proposal recently passed the 45-day public feedback period!

  • Well on its way to being included in the next version of the RISC-V ISA spec
slide-101
SLIDE 101

TriCheck: Analysing Compiler Mappings

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

?

HLL Model e.g. C11 µspec Microarch. Model Fix HLL model, microarch model, and ISA-level MCM

slide-102
SLIDE 102

TriCheck: Analysing Compiler Mappings

HLL to ISA Compiler Mapping HLL Outcome Forbidden/Allowed?

  • Microarch. Outcome

Observable/Unobservable?

HLL Litmus Test Variants Herd [Alglave et al. TOPLAS 2014] µhb Analysis with Check ISA-level litmus tests

HLL Model e.g. C11 µspec Microarch. Model Forbidden Observable

BUG!

slide-103
SLIDE 103

Checking C11 Mappings to ARMv7/Power

▪Ran TriCheck on microarch. with reordering similar to ARMv7/Power

  • Utilised “trailing-sync” compiler mapping [Batty et al. POPL 2012]
  • Discovered 2 cases where C11 outcome forbidden, but allowed by hardware!
  • Deduced that the mapping must be flawed

▪Mapping was supposedly proven correct [Batty et al. POPL 2012]

  • Traced the loophole in the proof [Manerkar et al. CoRR’16]

▪Problem: C11 model slightly too strong for mappings

  • C11 has happens-before (ℎ𝑐) ordering and total order on all SC accesses (𝑡𝑑)
  • ℎ𝑐 and 𝑡𝑑 orders must agree with each other
  • Trailing-sync mapping does not guarantee this for our counterexamples
slide-104
SLIDE 104

Current state of C11

▪“Leading-sync” mapping [McKenney and Silvera 2011]

  • Counterexample discovered concurrently to us [Lahav et al. PLDI 2017]

▪Both mappings currently broken ▪Possible solutions under discussion by C11 memory model committee:

  • RC11 [Lahav et al. PLDI 2017]: remove req. that 𝑡𝑑 and ℎ𝑐 orders agree

− Current mappings work, but reduces intuition in an already complicated C11 model

  • Adding extra fences to mappings

− low performance, requires recompilation, counterexample pattern not common

slide-105
SLIDE 105

TriCheck Takeaways

▪Both HLL memory models and microarchitectural optimizations influence the design of ISA-level MCMs ▪TriCheck enables holistic analysis of HLL memory model, ISA-level MCM, compiler mappings, and microarchitectural implementations ▪TriCheck discovered numerous issues with draft RISC-V MCM

  • Influenced the design of the new RISC-V MCM

▪Discovered two counterexamples to C11 -> ARMv7/Power compiler mappings

  • Mappings were previously “proven” correct; isolated flaw in proof
slide-106
SLIDE 106 29 Coh
  • herence Prot
  • tocol
l (SWMR, DVI, etc.) Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch

Memory Consistency Checking for RTL

Microarchitecture Checking

slide-107
SLIDE 107 29

RTL implementation

Coh
  • herence Prot
  • tocol
l (SWMR, DVI, etc.) Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch [RTL Image: Christopher Batten]

How to ensure RTL maintains orderings?

Memory Consistency Checking for RTL

Microarchitecture Checking

slide-108
SLIDE 108 29

RTL implementation

Coh
  • herence Prot
  • tocol
l (SWMR, DVI, etc.) Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch [RTL Image: Christopher Batten]

How to ensure RTL maintains orderings?

Memory Consistency Checking for RTL

Microarchitecture Checking

slide-109
SLIDE 109 29

RTL implementation

Coh
  • herence Prot
  • tocol
l (SWMR, DVI, etc.) Lds. L2 WB Mem. SB L1 Exec. Dec. Fetch WB Mem. SB L1 Exec. Dec. Fetch [RTL Image: Christopher Batten]

How to ensure RTL maintains orderings?

Memory Consistency Checking for RTL

✓ 

Microarchitecture Checking

slide-110
SLIDE 110

RTLCheck: Checking RTL Implementations

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪RTLCheck [Manerkar et al. MICRO 2017] enables checking microarchitectural axioms against an implementation’s Verilog RTL for litmus test suites ▪This helps ensure that the RTL maintains orderings required for consistency ▪Selected as an Honorable Mention from the “Top Pic icks

  • f

f Comp. Arc rch. . Conferences” for 2017

Processor RTL
slide-111
SLIDE 111

RTL Verification is Maturing…

▪…but usually ignores memory consistency! ▪Often use SystemVerilog Assertions (SVA)

slide-112
SLIDE 112

RTL Verification is Maturing…

▪…but usually ignores memory consistency! ▪Often use SystemVerilog Assertions (SVA)

No MCM verification

ISA-Formal [Reid et al. CAV 2016]

  • Instr. Operational Semantics
slide-113
SLIDE 113

RTL Verification is Maturing…

▪…but usually ignores memory consistency! ▪Often use SystemVerilog Assertions (SVA)

No MCM verification

ISA-Formal [Reid et al. CAV 2016]

  • Instr. Operational Semantics

No multicore MCM verification (?)

DOGReL [Stewart et al. DIFTS 2014]

  • Memory subsystem transactions
slide-114
SLIDE 114

RTL Verification is Maturing…

▪…but usually ignores memory consistency! ▪Often use SystemVerilog Assertions (SVA)

No MCM verification

ISA-Formal [Reid et al. CAV 2016]

  • Instr. Operational Semantics

No multicore MCM verification (?)

DOGReL [Stewart et al. DIFTS 2014]

  • Memory subsystem transactions

Needs Bluespec design and manual proofs!

Kami [Vijayaraghavan et al. CAV 2015] [Choi et al. ICFP 2017]

  • MCM correctness for all programs, but…
slide-115
SLIDE 115

RTL Verification is Maturing…

▪…but usually ignores memory consistency! ▪Often use SystemVerilog Assertions (SVA)

No MCM verification

ISA-Formal [Reid et al. CAV 2016]

  • Instr. Operational Semantics

No multicore MCM verification (?)

DOGReL [Stewart et al. DIFTS 2014]

  • Memory subsystem transactions

Needs Bluespec design and manual proofs!

Kami [Vijayaraghavan et al. CAV 2015] [Choi et al. ICFP 2017]

  • MCM correctness for all programs, but…

Lack of automated memory consistency verification at RTL!

slide-116
SLIDE 116

RTLCheck: Checking RTL Consistency Orderings

RTL Design µspec Microarch. Axioms Litmus Test Mapping Functions Temporal SystemVerilog Assertions (SVA) Cadence JasperGold (RTL Verifier)

RTLCheck

Proven?

slide-117
SLIDE 117

RTLCheck: Checking RTL Consistency Orderings

RTL Design µspec Microarch. Axioms Litmus Test Mapping Functions Temporal SystemVerilog Assertions (SVA) Cadence JasperGold (RTL Verifier)

RTLCheck

Proven?

User-provided mapping functions translate microarch. primitives to RTL equivalents

slide-118
SLIDE 118

RTLCheck: Checking RTL Consistency Orderings

RTL Design µspec Microarch. Axioms Litmus Test Mapping Functions Temporal SystemVerilog Assertions (SVA) Cadence JasperGold (RTL Verifier)

RTLCheck

Proven?

RTLCheck automatically translates µarch.

  • rdering axioms to

temporal properties

slide-119
SLIDE 119

RTLCheck: Checking RTL Consistency Orderings

RTL Design µspec Microarch. Axioms Litmus Test Mapping Functions Temporal SystemVerilog Assertions (SVA) Cadence JasperGold (RTL Verifier)

RTLCheck

Proven?

Properties may be proven

  • r counterexample found
slide-120
SLIDE 120

Meaning can be Lost in Translation!

小心地滑

slide-121
SLIDE 121

Meaning can be Lost in Translation!

小心地滑

(Caution: Slippery Floor)

slide-122
SLIDE 122

Meaning can be Lost in Translation!

[Image: Barbara Younger] [Inspiration: Tae Jun Ham]

小心地滑

(Caution: Slippery Floor)

slide-123
SLIDE 123

RTLCheck: Checking Consistency at RTL

Axiomatic Microarch. Analysis

slide-124
SLIDE 124

RTLCheck: Checking Consistency at RTL

Axiomatic Microarch. Analysis Temporal RTL Verification (SVA, etc)

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7
slide-125
SLIDE 125

RTLCheck: Checking Consistency at RTL

Axiomatic Microarch. Analysis Temporal RTL Verification (SVA, etc)

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7

Abstract nodes and happens- before edges

slide-126
SLIDE 126

RTLCheck: Checking Consistency at RTL

Axiomatic Microarch. Analysis Temporal RTL Verification (SVA, etc)

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7

Abstract nodes and happens- before edges Concrete signals and clock cycles

slide-127
SLIDE 127

RTLCheck: Checking Consistency at RTL

Axiomatic Microarch. Analysis Temporal RTL Verification (SVA, etc)

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7

Axiomatic/Temporal Mismatch!

Abstract nodes and happens- before edges Concrete signals and clock cycles

slide-128
SLIDE 128

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; mp (Message Passing)

Outcome Filtering in Axiomatic Analysis

▪Outcome Filtering: Restrict test outcome to one particular outcome

  • Allows for more efficient verification

▪Axiomatic models make outcome filtering easy

slide-129
SLIDE 129

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; mp (Message Passing)

Outcome Filtering in Axiomatic Analysis

▪Outcome Filtering: Restrict test outcome to one particular outcome

  • Allows for more efficient verification

▪Axiomatic models make outcome filtering easy

Outcome: r1 = 1, r2 = 1

Execution examined as a whole, so outcome can be enforced!

slide-130
SLIDE 130

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; mp (Message Passing)

Outcome Filtering in Axiomatic Analysis

▪Outcome Filtering: Restrict test outcome to one particular outcome

  • Allows for more efficient verification

▪Axiomatic models make outcome filtering easy

Outcome: r1 = 1, r2 = 1

Execution examined as a whole, so outcome can be enforced!

slide-131
SLIDE 131

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; mp (Message Passing)

Outcome Filtering in Axiomatic Analysis

▪Outcome Filtering: Restrict test outcome to one particular outcome

  • Allows for more efficient verification

▪Axiomatic models make outcome filtering easy

Outcome: r1 = 1, r2 = 1

Execution examined as a whole, so outcome can be enforced!

slide-132
SLIDE 132

Outcome Filtering in Temporal Verification

▪Filtering executions by outcome requires expensive glo lobal analysis

  • Not done by many SVA verifiers, including JasperGold!

mp

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; Is r1 = 1, r2 = 0 possible?
slide-133
SLIDE 133

Outcome Filtering in Temporal Verification

▪Filtering executions by outcome requires expensive glo lobal analysis

  • Not done by many SVA verifiers, including JasperGold!

mp (i1) x = 1 Step 1

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; Is r1 = 1, r2 = 0 possible?
slide-134
SLIDE 134

Outcome Filtering in Temporal Verification

▪Filtering executions by outcome requires expensive glo lobal analysis

  • Not done by many SVA verifiers, including JasperGold!

mp (i1) x = 1 Step 1 Step 2 (i2) y = 1 (i3) r1 = y = 1 Step 3 (i4) r2 = x = 1 Step 4

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; Is r1 = 1, r2 = 0 possible?
slide-135
SLIDE 135

Outcome Filtering in Temporal Verification

▪Filtering executions by outcome requires expensive glo lobal analysis

  • Not done by many SVA verifiers, including JasperGold!

mp (i1) x = 1 Step 1 Step 2 (i2) y = 1 (i3) r1 = y = 1 Step 3 (i4) r2 = x = 0? (i4) r2 = x = 1 Step 4

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; Is r1 = 1, r2 = 0 possible?
slide-136
SLIDE 136

Outcome Filtering in Temporal Verification

▪Filtering executions by outcome requires expensive glo lobal analysis

  • Not done by many SVA verifiers, including JasperGold!

mp (i1) x = 1 Step 1 Step 2 (i2) y = 1 (i3) r1 = y = 1 Step 3 (i4) r2 = x = 0? (i4) r2 = x = 1 Step 4

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; Is r1 = 1, r2 = 0 possible?

(i3) r1 = y = 0

… … … …

Need to examine all possible paths from current step to end of execution: too expensive!

slide-137
SLIDE 137

Outcome Filtering in Temporal Verification

▪Filtering executions by outcome requires expensive glo lobal analysis

  • Not done by many SVA verifiers, including JasperGold!

mp (i1) x = 1 Step 1 Step 2 (i2) y = 1 (i3) r1 = y = 1 Step 3 (i4) r2 = x = 0? (i4) r2 = x = 1 Step 4

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; Is r1 = 1, r2 = 0 possible?

(i3) r1 = y = 0

… … … …

Need to examine all possible paths from current step to end of execution: too expensive!

SVA Verifier Approximation: Only check if constraints hold up to current step Makes Outcome Filtering impossible!

slide-138
SLIDE 138

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

µspec Analysis Uses Outcome Filtering

Note: Axioms abstracted for brevity

mp

slide-139
SLIDE 139

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

µspec Analysis Uses Outcome Filtering

Note: Axioms abstracted for brevity

mp

slide-140
SLIDE 140

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

µspec Analysis Uses Outcome Filtering

Note: Axioms abstracted for brevity

mp

No write for load to read from!

slide-141
SLIDE 141

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

µspec Analysis Uses Outcome Filtering

Note: Axioms abstracted for brevity

mp

Outcome Filtering leads to simpler axioms!

slide-142
SLIDE 142 Core[0].Commit Core[1].Commit clk Core[1].LData Core[0].SData

Temporal Outcome Filtering Fails!

Filtered Read_Values: Unless Load returns non-zero value, Load happens before all stores to its address

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp Note: Axioms/properties abstracted for brevity Time (cycles)
slide-143
SLIDE 143

After 3 cycles:

Core[0].Commit Core[1].Commit clk Core[1].LData Core[0].SData St x 0x1 3

Temporal Outcome Filtering Fails!

Filtered Read_Values: Unless Load returns non-zero value, Load happens before all stores to its address

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp 2 1 Note: Axioms/properties abstracted for brevity Time (cycles)
slide-144
SLIDE 144

After 3 cycles: Store happens before load! Property Violated?

Core[0].Commit Core[1].Commit clk Core[1].LData Core[0].SData St x 0x1 3

Temporal Outcome Filtering Fails!

Filtered Read_Values: Unless Load returns non-zero value, Load happens before all stores to its address

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp 2 1 Note: Axioms/properties abstracted for brevity Time (cycles)
slide-145
SLIDE 145

After 6 cycles: Load does not read 0 No Violation! After 3 cycles: Store happens before load! Property Violated?

Core[0].Commit Core[1].Commit clk Core[1].LData Core[0].SData St x 0x1 3 St y 0x1 4 Ld y 0x1 5 Ld x 0x1 6

Temporal Outcome Filtering Fails!

Filtered Read_Values: Unless Load returns non-zero value, Load happens before all stores to its address

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp 2 1 Note: Axioms/properties abstracted for brevity Time (cycles)
slide-146
SLIDE 146

After 6 cycles: Load does not read 0 No Violation! But SVA verifiers don’t check future cycles! After 3 cycles: Store happens before load! Property Violated?

Core[0].Commit Core[1].Commit clk Core[1].LData Core[0].SData St x 0x1 3 St y 0x1 4 Ld y 0x1 5 Ld x 0x1 6

Temporal Outcome Filtering Fails!

Filtered Read_Values: Unless Load returns non-zero value, Load happens before all stores to its address

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp 2 1 Note: Axioms/properties abstracted for brevity Time (cycles)
slide-147
SLIDE 147

After 6 cycles: Load does not read 0 No Violation! But SVA verifiers don’t check future cycles! After 3 cycles: Store happens before load! Property Violated?

Core[0].Commit Core[1].Commit clk Core[1].LData Core[0].SData St x 0x1 3

Temporal Outcome Filtering Fails!

Filtered Read_Values: Unless Load returns non-zero value, Load happens before all stores to its address

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp 2 1 Note: Axioms/properties abstracted for brevity

Counterexample flagged despite hardware doing nothing wrong!

Time (cycles)
slide-148
SLIDE 148

Property to check: mapNode(Ld x → St x, Ld x == 0) or mapNode(St x → Ld x, Ld x == 1);

▪Don’t simplify axioms; translate all cases ▪Tag each case with appropriate load value constraints

  • reflect the data constraints required for edge(s)

Solution: Load Value Constraints

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp Note: Axioms and properties abstracted for brevity
slide-149
SLIDE 149

Property to check: mapNode(Ld x → St x, Ld x == 0) or mapNode(St x → Ld x, Ld x == 1);

▪Don’t simplify axioms; translate all cases ▪Tag each case with appropriate load value constraints

  • reflect the data constraints required for edge(s)

Solution: Load Value Constraints

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp Note: Axioms and properties abstracted for brevity
slide-150
SLIDE 150

Property to check: mapNode(Ld x → St x, Ld x == 0) or mapNode(St x → Ld x, Ld x == 1);

▪Don’t simplify axioms; translate all cases ▪Tag each case with appropriate load value constraints

  • reflect the data constraints required for edge(s)

Solution: Load Value Constraints

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp Note: Axioms and properties abstracted for brevity
slide-151
SLIDE 151

Property to check: mapNode(Ld x → St x, Ld x == 0) or mapNode(St x → Ld x, Ld x == 1);

▪Don’t simplify axioms; translate all cases ▪Tag each case with appropriate load value constraints

  • reflect the data constraints required for edge(s)

Solution: Load Value Constraints

Axiom "Read_Values": Every load either reads BeforeAllWrites OR reads FromLatestWrite

Core 0 Core 1 (i1) x = 1; (i3) r1 = y; (i2) y = 1; (i4) r2 = x; SC Forbids: r1 = 1, r2 = 0 mp Note: Axioms and properties abstracted for brevity
slide-152
SLIDE 152

Multi-V-scale: a Multicore Case Study

Core 0 Core 1 Core 2 Core 3

Arbiter Memory WB DX IF WB DX IF WB DX IF WB DX IF

slide-153
SLIDE 153

Multi-V-scale: a Multicore Case Study

Core 0 Core 1 Core 2 Core 3

Arbiter Memory WB DX IF WB DX IF WB DX IF WB DX IF

3-stage in-order pipelines

slide-154
SLIDE 154

Multi-V-scale: a Multicore Case Study

Core 0 Core 1 Core 2 Core 3

Arbiter Memory WB DX IF WB DX IF WB DX IF WB DX IF

Arbiter enforces that

  • nly one core

can access memory at any time

slide-155
SLIDE 155

▪ V-scale memory internally writes stores to wdata register ▪ wdata pushed to memory when subsequent store occurs ▪ Akin to single-entry store buffer ▪ When two stores are sent to memory in successive cycles, first of two stores is dropped by memory! ▪ Fixed bug by eliminating wdata ▪ V-scale has since been deprecated by RISC-V Foundation

Bug Discovered in V-scale

Core 0 Core 1 Core 2 Core 3

Arbiter WB DX IF WB DX IF WB DX IF WB DX IF

Memory

wdata

Mem array Stores

x = 1 y = 1

slide-156
SLIDE 156

▪ V-scale memory internally writes stores to wdata register ▪ wdata pushed to memory when subsequent store occurs ▪ Akin to single-entry store buffer ▪ When two stores are sent to memory in successive cycles, first of two stores is dropped by memory! ▪ Fixed bug by eliminating wdata ▪ V-scale has since been deprecated by RISC-V Foundation

Bug Discovered in V-scale

Core 0 Core 1 Core 2 Core 3

Arbiter WB DX IF WB DX IF WB DX IF WB DX IF

Memory

wdata

Mem array Stores

x = 1 y = 1

slide-157
SLIDE 157

▪ V-scale memory internally writes stores to wdata register ▪ wdata pushed to memory when subsequent store occurs ▪ Akin to single-entry store buffer ▪ When two stores are sent to memory in successive cycles, first of two stores is dropped by memory! ▪ Fixed bug by eliminating wdata ▪ V-scale has since been deprecated by RISC-V Foundation

Bug Discovered in V-scale

Core 0 Core 1 Core 2 Core 3

Arbiter WB DX IF WB DX IF WB DX IF WB DX IF

Memory

wdata

Mem array Stores

x = 1 y = 1

slide-158
SLIDE 158

RTLCheck Takeaways

▪Microarchitectural models must be validated against RTL ▪RTLCheck: Automated translation of microarch. axioms into equivalent temporal SVA properties for litmus test suites

  • Translation is complicated by the axiomatic-temporal mismatch
  • JasperGold was able to prove 90% of properties/test in 11 hours runtime

▪Last piece of the Check suite; now have tools at all levels of the stack!

slide-159
SLIDE 159

Conclusion

High-Level Languages (HLL) Compiler Architecture (ISA) Microarchitecture OS

▪The Check suite provides automated full-stack MCM checking of implementations ▪Litmus-test based verification to concentrate on error-prone cases ▪Can check:

  • Implementation of HLL requirements
  • Virtual memory implementation
  • HLL Compiler mappings
  • Microarchitectural Orderings (including coherence)
  • and even RTL (Verilog)!

▪All tools are open-source and publicly available!

Processor RTL
slide-160
SLIDE 160

With Thanks to…

▪Collaborators:

  • Margaret Martonosi
  • Daniel Lustig
  • Caroline Trippel
  • Michael Pellauer
  • Aarti Gupta

▪Funding:

  • Princeton Wallace Memorial Honorific Fellowship
  • STARnet C-FAR (Center for Future Architectures Research)
  • JUMP ADA Center (Applications Driving Architectures)
  • National Science Foundation
slide-161
SLIDE 161

Questions?

http:/ ://check.cs.p .princeton.edu/ http:/ ://www.c .cs.p .princeton.edu/~manerkar

  • Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael Pellauer. RTLCheck: Verifying the Memory Consistency of
RTL Designs. The 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2017.
  • Yatin A. Manerkar, Caroline Trippel, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. Counterexamples and Proof
Loophole for the C/C++ to POWER and ARMv7 Trailing-Sync Compiler Mappings. CoRR abs/1611.01507, November 2016.
  • Caroline Trippel, Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. TriCheck: Memory Model
Verification at the Trisection of Software, Hardware, and ISA. The 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2017.
  • Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. CCICheck: Using µhb Graphs to Verify the
Coherence-Consistency Interface. The 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2015.
slide-162
SLIDE 162

Coherence and Consistency

Con

  • nceptual

Coherence Consistency

▪Most coherence protocols are not that simple!

  • Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
  • Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]

▪CCI: Coherence-Consistency Interface

slide-163
SLIDE 163

Coherence and Consistency

Con

  • nceptual

Real l Im Imple lementations Coherence and consistency often interwoven Coherence Consistency

▪Most coherence protocols are not that simple!

  • Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
  • Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]

▪CCI: Coherence-Consistency Interface

slide-164
SLIDE 164

Coherence and Consistency

Con

  • nceptual

Real l Im Imple lementations Coherence and consistency often interwoven Verifiers can’t ignore consistency implications! Coherence Consistency Verifiers can’t assume abstract coherence/memory hierarchy!

▪Most coherence protocols are not that simple!

  • Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
  • Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]

▪CCI: Coherence-Consistency Interface

slide-165
SLIDE 165

Coherence and Consistency

Con

  • nceptual

Real l Im Imple lementations Coherence and consistency often interwoven Verifiers can’t ignore consistency implications! Coherence Consistency Verifiers can’t assume abstract coherence/memory hierarchy!

C C I

▪Most coherence protocols are not that simple!

  • Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
  • Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]

▪CCI: Coherence-Consistency Interface

slide-166
SLIDE 166

Issue with Draft RISC-V MCM: Cumulativity

▪Consider this litmus test variant (WRC):

  • C11 atomics can specify memory orderings: REL = release, ACQ = acquire

▪RISC-V lacked cumulative fences to enforce this ordering:

  • (x5 and x6 contain addresses of x and y)
Thread 0 Thread 1 Thread 2 St (x, 1, REL) r0 = Ld (x, ACQ) r1 = Ld (y, ACQ) St (y, 1, REL) r2 = Ld (x, ACQ) Forbidden by C11: r0 = 1, r1 = 1, r2 = 0 Core 0 Core 1 Core 2 sw x1, (x5) lw x2, (x5) lw x3, (x6) fence r, rw fence r, rw fence rw, w lw x4, (x5) sw x2, (x6) Allowed by draft RISC-V: x1 = 1, x2 = 1, x3 = 1, x4 = 0
slide-167
SLIDE 167

Issue with Draft RISC-V MCM: Cumulativity

▪Consider this litmus test variant (WRC):

  • C11 atomics can specify memory orderings: REL = release, ACQ = acquire

▪RISC-V lacked cumulative fences to enforce this ordering:

  • (x5 and x6 contain addresses of x and y)
Thread 0 Thread 1 Thread 2 St (x, 1, REL) r0 = Ld (x, ACQ) r1 = Ld (y, ACQ) St (y, 1, REL) r2 = Ld (x, ACQ) Forbidden by C11: r0 = 1, r1 = 1, r2 = 0 Core 0 Core 1 Core 2 sw x1, (x5) lw x2, (x5) lw x3, (x6) fence r, rw fence r, rw fence rw, w lw x4, (x5) sw x2, (x6) Allowed by draft RISC-V: x1 = 1, x2 = 1, x3 = 1, x4 = 0
slide-168
SLIDE 168

Issue with Draft RISC-V MCM: Cumulativity

▪Consider this litmus test variant (WRC):

  • C11 atomics can specify memory orderings: REL = release, ACQ = acquire

▪RISC-V lacked cumulative fences to enforce this ordering:

  • (x5 and x6 contain addresses of x and y)
Thread 0 Thread 1 Thread 2 St (x, 1, REL) r0 = Ld (x, ACQ) r1 = Ld (y, ACQ) St (y, 1, REL) r2 = Ld (x, ACQ) Forbidden by C11: r0 = 1, r1 = 1, r2 = 0 Core 0 Core 1 Core 2 sw x1, (x5) lw x2, (x5) lw x3, (x6) fence r, rw fence r, rw fence rw, w lw x4, (x5) sw x2, (x6) Allowed by draft RISC-V: x1 = 1, x2 = 1, x3 = 1, x4 = 0
slide-169
SLIDE 169

Issue with Draft RISC-V MCM: Cumulativity

▪Consider this litmus test variant (WRC):

  • C11 atomics can specify memory orderings: REL = release, ACQ = acquire

▪RISC-V lacked cumulative fences to enforce this ordering:

  • (x5 and x6 contain addresses of x and y)
Thread 0 Thread 1 Thread 2 St (x, 1, REL) r0 = Ld (x, ACQ) r1 = Ld (y, ACQ) St (y, 1, REL) r2 = Ld (x, ACQ) Forbidden by C11: r0 = 1, r1 = 1, r2 = 0 Core 0 Core 1 Core 2 sw x1, (x5) lw x2, (x5) lw x3, (x6) fence r, rw fence r, rw fence rw, w lw x4, (x5) sw x2, (x6) Allowed by draft RISC-V: x1 = 1, x2 = 1, x3 = 1, x4 = 0
slide-170
SLIDE 170

Issue with Draft RISC-V MCM: Cumulativity

▪Consider this litmus test variant (WRC):

  • C11 atomics can specify memory orderings: REL = release, ACQ = acquire

▪RISC-V lacked cumulative fences to enforce this ordering:

  • (x5 and x6 contain addresses of x and y)
Thread 0 Thread 1 Thread 2 St (x, 1, REL) r0 = Ld (x, ACQ) r1 = Ld (y, ACQ) St (y, 1, REL) r2 = Ld (x, ACQ) Forbidden by C11: r0 = 1, r1 = 1, r2 = 0 Core 0 Core 1 Core 2 sw x1, (x5) lw x2, (x5) lw x3, (x6) fence r, rw fence r, rw fence rw, w lw x4, (x5) sw x2, (x6) Allowed by draft RISC-V: x1 = 1, x2 = 1, x3 = 1, x4 = 0
slide-171
SLIDE 171

Issue with Draft RISC-V MCM: Cumulativity

▪Consider this litmus test variant (WRC):

  • C11 atomics can specify memory orderings: REL = release, ACQ = acquire

▪RISC-V lacked cumulative fences to enforce this ordering:

  • (x5 and x6 contain addresses of x and y)
Thread 0 Thread 1 Thread 2 St (x, 1, REL) r0 = Ld (x, ACQ) r1 = Ld (y, ACQ) St (y, 1, REL) r2 = Ld (x, ACQ) Forbidden by C11: r0 = 1, r1 = 1, r2 = 0 Core 0 Core 1 Core 2 sw x1, (x5) lw x2, (x5) lw x3, (x6) fence r, rw fence r, rw fence rw, w lw x4, (x5) sw x2, (x6) Allowed by draft RISC-V: x1 = 1, x2 = 1, x3 = 1, x4 = 0
slide-172
SLIDE 172

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪With the trailing-sync mapping, this compiles to the following:

  • Allowed on Power [Sarkar et al. PLDI 2011] and ARMv7 [Alglave et al. TOPLAS

2014]

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 Core 0 Core 1 Core 2 Core 3 str 1, [x] str 1, [y] ldr r1, [x] ldr r3, [y] ctrlisb/ctrlisync ctrlisb/ctrlisync ldr r2, [y] ldr r4, [x] Allowed by Power/ARMv7: r1 = 1, r2 = 0, r3 = 1, r4 = 0
slide-173
SLIDE 173

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪SC total order must respect happens-before i.e. (sb U sw)+

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 [Generated with CPPMEM from Cambridge]
slide-174
SLIDE 174

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪SC total order must respect happens-before i.e. (sb U sw)+

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 [Generated with CPPMEM from Cambridge]
slide-175
SLIDE 175

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪SC total order must respect happens-before i.e. (sb U sw)+

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 [Generated with CPPMEM from Cambridge]
slide-176
SLIDE 176

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪SC total order must respect happens-before i.e. (sb U sw)+

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 [Generated with CPPMEM from Cambridge]

c: Wsc x = 1 d: Wsc y = 1 f: Rsc y = 0 h: Rsc x = 0

slide-177
SLIDE 177

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪SC reads must be before later SC writes

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 [Generated with CPPMEM from Cambridge]

c: Wsc x = 1 d: Wsc y = 1 f: Rsc y = 0 h: Rsc x = 0

slide-178
SLIDE 178

ARMv7/Power Trailing-Sync Counterexample

▪Consider this litmus test variant (IRIW):

  • Total order over all SC atomic accesses is required

▪SC reads must be before later SC writes

Thread 0 Thread 1 Thread 2 Thread 3 St (x, 1, SC) St (y, 1, SC) r0 = Ld (x, ACQ) r2 = Ld (y, ACQ) r1 = Ld (y, SC) r3 = Ld (x, SC) Forbidden by C11: r0 = 1, r1 = 0, r2 = 1, r3 = 0 [Generated with CPPMEM from Cambridge]

c: Wsc x = 1 d: Wsc y = 1 f: Rsc y = 0 h: Rsc x = 0

  • Cycle in the SC order implies outcome is forbidden
  • But compiled code allows the behaviour!
slide-179
SLIDE 179

What went wrong?

▪It was thought that program order and coherence edges directly between SC accesses were all that needed enforcing [Batty et al. POPL 2012] ▪But ℎ𝑐 edges can arise between SC accesses through the transitive composition of edges to and from a non-SC in intermediate access ▪Occurs in IRIW counterexample:

slide-180
SLIDE 180

What went wrong?

▪It was thought that program order and coherence edges directly between SC accesses were all that needed enforcing [Batty et al. POPL 2012] ▪But ℎ𝑐 edges can arise between SC accesses through the transitive composition of edges to and from a non-SC in intermediate access ▪Occurs in IRIW counterexample:

slide-181
SLIDE 181

What went wrong?

▪It was thought that program order and coherence edges directly between SC accesses were all that needed enforcing [Batty et al. POPL 2012] ▪But ℎ𝑐 edges can arise between SC accesses through the transitive composition of edges to and from a non-SC in intermediate access ▪Occurs in IRIW counterexample:

slide-182
SLIDE 182

▪Need to restrict executions to those of litmus test ▪Three classes of assumptions:

  • Memory initialization

− Instr. mem and data mem

  • Register initialization
  • Value assumptions

− Loa

  • ad valu

alue ass assumptio ions: loads return correct value (whe hen the they oc

  • ccur)

− Fin Final l val alue ass assumptio ions: Required final values of memory are respected

▪RTLCheck generates SystemVerilog Assumptions to constrain executions

  • Utilises user-provided program mapping fu

function

Assumption Generation

slide-183
SLIDE 183

▪Covering tr trace: execution where assumption condition is enforced

  • Eg: execution where load of x returns 0
  • Must obey all assumptions

▪Covering final value assum. == finding forbidden execution!

  • No covering trace => equivalent to verifying overall test!

▪Quicker verification for some tests

  • Expect benefit to be largest for small designs

Assumption Generation

slide-184
SLIDE 184

▪Why generate final value assumptions if test has no final conditions? ▪Answer: Co Covering tr traces can lead to faster verification ▪These are traces where assumption condition occurs and can be enforced

The Benefits of Final Value Assumptions

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7
slide-185
SLIDE 185

▪Why generate final value assumptions if test has no final conditions? ▪Answer: Co Covering tr traces can lead to faster verification ▪These are traces where assumption condition occurs and can be enforced

The Benefits of Final Value Assumptions

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7 Covering trace for final val assumption is com
  • mplete
executio ion of litmus test
slide-186
SLIDE 186

▪Why generate final value assumptions if test has no final conditions? ▪Answer: Co Covering tr traces can lead to faster verification ▪These are traces where assumption condition occurs and can be enforced

The Benefits of Final Value Assumptions

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7 Covering trace for final val assumption is com
  • mplete
executio ion of litmus test Covering trace must also obey other assumptions, including loa
  • ad val
al ass assumptio ions (For mp, Ld y = 1 and Ld x = 0)
slide-187
SLIDE 187

▪Why generate final value assumptions if test has no final conditions? ▪Answer: Co Covering tr traces can lead to faster verification ▪These are traces where assumption condition occurs and can be enforced

The Benefits of Final Value Assumptions

Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St x St y St y Ld y Ld y Ld x Ld x 0x1 0x1 Core[0].SData 0x1 0x1 2 3 4 5 6 7 Covering trace for final val assumption is com
  • mplete
executio ion of litmus test Covering trace must also obey other assumptions, including loa
  • ad val
al ass assumptio ions (For mp, Ld y = 1 and Ld x = 0)

Thus, covering trace for mp final val assumption (full execution with Ld y=1 and Ld x=0) is eq equiv ivale lent to finding forb

  • rbidden executio

ion of mp!

slide-188
SLIDE 188

▪Two configurations (Hybrid and Full_Proof), avg. runtime 6.2 hrs

  • See paper for configuration details

Results: Time to Prove Properties

2 4 6 8 10 12 safe006 lb safe007 mp safe022 safe010 ssl safe000 safe008 n4 n5 co-mp safe001 wrc sb safe018 podwr000 safe003 mp+staleld safe012 safe002 safe014 iwp23b safe009 safe029 safe027 rwc n2 rfi013 safe030 safe011 rfi015 rfi003 safe021 iriw n7 iwp24 podwr001 safe017 rfi012 n6 safe019 rfi001 rfi000 rfi011 safe026 safe004 safe016 rfi002 rfi005 rfi014 rfi004 rfi006 n1 amd3 co-iriw Mean Time (hours) Hybrid Full_Proof
slide-189
SLIDE 189

▪Two configurations (Hybrid and Full_Proof), avg. runtime 6.2 hrs

  • See paper for configuration details

Results: Time to Prove Properties

2 4 6 8 10 12 safe006 lb safe007 mp safe022 safe010 ssl safe000 safe008 n4 n5 co-mp safe001 wrc sb safe018 podwr000 safe003 mp+staleld safe012 safe002 safe014 iwp23b safe009 safe029 safe027 rwc n2 rfi013 safe030 safe011 rfi015 rfi003 safe021 iriw n7 iwp24 podwr001 safe017 rfi012 n6 safe019 rfi001 rfi000 rfi011 safe026 safe004 safe016 rfi002 rfi005 rfi014 rfi004 rfi006 n1 amd3 co-iriw Mean Time (hours) Hybrid Full_Proof

Complete quickly due to covering traces

slide-190
SLIDE 190

▪Two configurations (Hybrid and Full_Proof), avg. runtime 6.2 hrs

  • See paper for configuration details

Results: Time to Prove Properties

2 4 6 8 10 12 safe006 lb safe007 mp safe022 safe010 ssl safe000 safe008 n4 n5 co-mp safe001 wrc sb safe018 podwr000 safe003 mp+staleld safe012 safe002 safe014 iwp23b safe009 safe029 safe027 rwc n2 rfi013 safe030 safe011 rfi015 rfi003 safe021 iriw n7 iwp24 podwr001 safe017 rfi012 n6 safe019 rfi001 rfi000 rfi011 safe026 safe004 safe016 rfi002 rfi005 rfi014 rfi004 rfi006 n1 amd3 co-iriw Mean Time (hours) Hybrid Full_Proof

Max runtime 11 hours (if some properties unproven)

slide-191
SLIDE 191

▪Full_Proof generally better (90%/test) than Hybrid (81%/test) ▪On average, Full_Proof can prove more properties in same time

Results: Proven Properties

10 20 30 40 50 60 70 80 90 100 safe006 lb safe007 safe000 n4 safe011 safe016 safe030 rfi000 safe017 safe019 safe004 safe021 rfi011 rfi006 n1 rfi012 n7 co-iriw rfi005 safe002 n2 iriw rfi002 safe012 rfi003 safe003 safe014 safe001 iwp24 rfi015 rfi001 safe026 safe027 podwr001 safe008 rfi014 n6 n5 wrc safe018 rwc safe009 rfi004 amd3 mp+staleld rfi013 mp safe022 safe010 ssl co-mp sb podwr000 iwp23b safe029 Mean % Proven Properties Hybrid Full_Proof
slide-192
SLIDE 192

▪Full_Proof generally better (90%/test) than Hybrid (81%/test) ▪On average, Full_Proof can prove more properties in same time

Results: Proven Properties

10 20 30 40 50 60 70 80 90 100 safe006 lb safe007 safe000 n4 safe011 safe016 safe030 rfi000 safe017 safe019 safe004 safe021 rfi011 rfi006 n1 rfi012 n7 co-iriw rfi005 safe002 n2 iriw rfi002 safe012 rfi003 safe003 safe014 safe001 iwp24 rfi015 rfi001 safe026 safe027 podwr001 safe008 rfi014 n6 n5 wrc safe018 rwc safe009 rfi004 amd3 mp+staleld rfi013 mp safe022 safe010 ssl co-mp sb podwr000 iwp23b safe029 Mean % Proven Properties Hybrid Full_Proof

Hybrid better for only a few tests