Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader - - PowerPoint PPT Presentation

adam mclaughlin duane merrill michael garland and david a
SMART_READER_LITE
LIVE PREVIEW

Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader - - PowerPoint PPT Presentation

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware designs require mi mill llions ons


slide-1
SLIDE 1

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures

Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader

slide-2
SLIDE 2

Challenges of Design Verification

  • Contemporary hardware designs require

mi mill llions

  • ns of lines of RTL code

– More lines of code written for verification than for the implementation itself

  • Tradeoff between performance and design

complexity

– Speculative execution, shared caches, instruction reordering – Performance wins out

GTC 2016, San Jose, CA

2

slide-3
SLIDE 3

Performance vs. Design Complexity

  • Programmer burden

– Requires correct usage of synchronization

  • Time to market

– Earlier remediation of bugs is less costly – Re-spins on tapeout are expensive

  • Significant time spent of verification

– Verification techniques are often NP- complete

GTC 2016, San Jose, CA

3

slide-4
SLIDE 4

Memory Consistency Models

  • Contract between SW and HW regarding the

semantics of memory operations

  • Classic example: Sequential Consistency (SC)

– All processors observe the same ordering of

  • perations serviced by memory

– Too strict for modern optimizations/architectures

  • Nomenclature

– ST[A] → 1 1 “Wrote a value of 1 to location A” – LD[B] ← 2 2 “Read a value of 2 from location B”

GTC 2016, San Jose, CA

4

slide-5
SLIDE 5

ARM Idiosyncrasies

  • Our focus: ARMv8

Mv8

  • Speculative

eculative Ex Execution cution is allowed

  • Threads can reo

eorde rder rea eads ds an and wr writes tes

– Assuming no dependency exists

  • Writes are not

t guar aranteed anteed to to be be simu multaneo ltaneously usly vi visible ible to other cores

GTC 2016, San Jose, CA

5

slide-6
SLIDE 6

Problem Setup

  • 1. Construct an initial graph

– Vertices represent load, store, and barrier insts – Edges represent memory

  • rdering
  • Based on architectural rules

2.

  • 2. Iter

erati tive vely ly infer er add ddit itio ional ed edge ges to the e gr graph ph

– Based on existing relationships

  • 3. Check for cycles

– If one exists: contradiction!

GTC 2016, San Jose, CA

6

LD[B] ← 92 LD[A] ← 2 ST[B] → 92 LD[B] ← 93 LD[B] ← 92 CPU 0 CPU 1 ST[B] → 90

  • Given an inst. trace from a simulator, RTL, or silicon
slide-7
SLIDE 7

TSOtool

  • Hangal et al., ISCA ’04

– Designed for SPARC, but portable to ARM

  • Each store writes a unique value to memory

– Easily map a load to the store that wrote its data

  • Tradeoff between accuracy and runtime

– Polynomial time, but false positives are possible – If a cycle is found, a bug indeed exists – If no cycles are found, execution appears ars consistent

GTC 2016, San Jose, CA

7

slide-8
SLIDE 8

Need for Scalability

  • Must run many tests to maximize coverage

– Stress different portions of the memory subsystem

  • Longer tests put supporting logic in more interesting

states

– Many instructions are required to build history in an LRU cache, for instance

  • Using a CPU cluster does not suffice

– The results of one set of tests dictate the structure of the ensuing tests – Faster tests help with interactivity!

  • Solution: Efficient algorithms and parallelism

GTC 2016, San Jose, CA

8

slide-9
SLIDE 9

Inferred Edge Insertions (Rule 6)

  • S can reach X
  • X does not load data

from S

GTC 2016, San Jose, CA

9

W: ST[A] → 2 X: LD[A] ← 2 S: ST[A] → 1

slide-10
SLIDE 10

Inferred Edge Insertions (Rule 6)

  • S can reach X
  • X does not load data

from S

  • S co

comes mes be before

  • re th

the node that stored X’s dat ata

GTC 2016, San Jose, CA

10

W: ST[A] → 2 X: LD[A] ← 2 S: ST[A] → 1

slide-11
SLIDE 11

Inferred Edge Insertions (Rule 7)

  • S can reach X
  • Loads read data from S, not X

GTC 2016, San Jose, CA

11

L: LD[A] → 1 X: ST[A] → 2 S: ST[A] → 1 M: LD[A] → 1

slide-12
SLIDE 12

Inferred Edge Insertions (Rule 7)

  • S can reach X
  • Loads read data from S, not X
  • Load

ads s ca came me be befor

  • re

e X

GTC 2016, San Jose, CA

12

L: LD[A] → 1 X: ST[A] → 2 S: ST[A] → 1 M: LD[A] → 1

slide-13
SLIDE 13

Initial Algorithm for Inferring Edges

for_each(store vertex S) { for_each(reachable vertex X from S) //Getting this set is expensive! { if(location[S] == location[X]) { if((type[X] == LD) && (data[S] != data[X])) { //Add Rule 6 edge from S to W, the store that X read from } else if(type[X] == ST) { for_each(load vertex L that reads data from S) { //Add Rule 7 edge from L to X } } //End if instruction type is store } //End if location } //End for each reachable vertex } //End for each store

GTC 2016, San Jose, CA

13

slide-14
SLIDE 14

Virtual Processors (vprocs)

  • Split instructions from physical to virtual processors
  • Each vproc is sequentially consistent

– Program order ↔ Memory order

GTC 2016, San Jose, CA

14

ST[B] → 91 ST[A] → 1 LD[A] ← 2 ST[B] → 92 VPROC 0 ST[A] → 1 LD[A] ← 2 VPROC 1 VPROC 2 ST[B] → 91 ST[B] → 92 CPU 0

slide-15
SLIDE 15

Reverse Time Vector Clocks (RTVC)

  • Consider the RTVC of

ST[B] = 90

Purple: ST[B] = 92 Blue: NULL Green: LD[B] = 92 Orange: LD[B] = 92

  • Track the earliest

successor from each vertex to each vproc

– Captures transitivity

GTC 2016, San Jose, CA

15

LD[B] ← 92 LD[A] ← 2 ST[B] → 92 LD[B] ← 93 LD[B] ← 92 CPU 0 CPU 1 ST[B] → 90

Complexity of inferring edges: 𝑃 𝑜2𝑞2𝑒𝑛𝑏𝑦

slide-16
SLIDE 16

Updating RTVCs

  • Computing RTVCs once is fast

– Process vertices in the reverse

  • rder of a topological sort

– Check neighbors directly, then their RTVCs

  • Every time a new edge is

inserted, RTVC values need to change

– # of edge insertions ≈ 𝑛

GTC 2016, San Jose, CA

16

  • TSOtool implements both vprocs and RTVCs
slide-17
SLIDE 17

Facilitating Parallelism

  • Repeatedly updating RTVCs is expensive

– For 𝑙 edge insertions, RTVC updates take 𝑃(𝑙𝑞𝑜) time

  • 𝑙 = 𝑃 𝑜2 , but usually is a small multiple of 𝑜
  • Idea: Update RTVCs once per iteration rather than

per edge insertion

– For 𝑗 iterations RTVC updates take 𝑃(𝑗𝑞𝑜) time

  • 𝑗 ≪ 𝑙 (less than 10 for all test cases)

– Less communication between threads

  • Complexity of inferring edges: 𝑃(𝑜2𝑞)

GTC 2016, San Jose, CA

17

slide-18
SLIDE 18

Correctness

  • Inferred edges found by our approach will not be the

same as the edges found by TSOtool

– Might not infer an edge that TSOtool does

  • RTVC for TSOtool can change mid-iteration

– Might infer an edge that TSOtool does not

  • Our approach will have “stale” RTVC values
  • Both approaches make forward progress

– Number of edges monotonic

  • tonically

ally increases eases

  • Any edge inserted by our approach could have been

inserted by the naïve approach [Thm 1]

  • If TSOtool finds a cycle, we will also find a cycle [Thm 2]

GTC 2016, San Jose, CA

18

slide-19
SLIDE 19

Parallel Implementations

  • OpenMP

– Each thread keeps its own partition of added edges – After each iteration of inferring edges, reduce

  • CUDA

– Assign threads to each store instruction – Threads independently traverse the vprocs of this store – Atomically add edges to a preallocated array in global memory

GTC 2016, San Jose, CA

19

slide-20
SLIDE 20

Experimental Setup

  • Intel Core i7-2600K CPU

– Quad core, 3.4GHz, 8MB LLC, 16GB DRAM

  • NVIDIA GeForce GTX Titan

– 14 SMs, 837 MHz base clock, 6GB DRAM

  • ARM system under test

– Cortex-A57, quad core

  • Instruction graphs range from 𝑜 = 218 to 𝑜 = 222

vertices, 𝑜 ≈ 𝑛

– Sparse, high-diameter, low-degree – Tests vary by their distribution of LD/ST/DMB instructions, # of vprocs, and inst dependencies

GTC 2016, San Jose, CA

20

slide-21
SLIDE 21

Importance of Scaling

GTC 2016, San Jose, CA

21

  • 512K instructions

per core

  • 2M total

instructions

slide-22
SLIDE 22

Speedup over TSOtool (Application)

Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x 128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x 256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x 512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x 1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x

GTC 2016, San Jose, CA

22

  • GPU is always best; scales much better to larger tests
  • Extreme case: 9 hours

rs using TSOtool → unde der 10 min inutes es using our GPU approach

  • Avg. Parallel speedups over our improved sequential

approach:

– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)

slide-23
SLIDE 23

Summary

  • Relaxing the updates to RTVCs lead to a better

sequential approach and and facilitated parallel implementations

– Trade off between redundant work and parallelism

  • Faster execution leads to interactive bug-finding
  • The GPU scales well to larger problem instances

– Helpful for corner case bugs that slip through pre-silicon verification

  • For the twelve largest test cases our GPU

implementation achieves a 26.36x average application speedup

GTC 2016, San Jose, CA

23

slide-24
SLIDE 24

Acknowledgments

  • Shankar Govindaraju, and Tom Hart for their

help on understanding NVIDIA’s implementation of TSOtool for ARM

GTC 2016, San Jose, CA

24

slide-25
SLIDE 25

Questions

“To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science.”– Albert Einstein

25

GTC 2016, San Jose, CA

slide-26
SLIDE 26

Backup

26

GTC 2016, San Jose, CA

slide-27
SLIDE 27

Sequential Consistency Examples

  • Valid
  • Invalid

GTC 2016, San Jose, CA

27

P1: ST[x]→1 P2: LD[x]←1 LD[x]←2 P3: LD[x]←1 LD[x]←2 P4: ST[x]→2 t=0 t=1 t=2 P1: ST[x]→1 P2: LD[x]←1 LD[x]←2 P3: LD[x]←2 LD[x]←1 P4: ST[x]→2 t=0 t=1 t=2

  • ST[x]→1 handled before

ST[x]→2

  • Writes propagate to P2

and P3 in a different

  • rder

– Valid for weaker memory models

slide-28
SLIDE 28

Weaker Models

  • SC is intuitive, but is too strict

– Prevents common compiler/arch. optimizations

  • Commercial products use weaker models

– x86: Total Store Order (TSO) – Power/ARM: Relaxed Memory Ordering (RMO)

  • Weaker models allow for greater optimization
  • pportunities

– Cost: More complicated semantics

GTC 2016, San Jose, CA

28

slide-29
SLIDE 29

Initial Algorithm: Weaknesses

  • Expensive to compute

– 𝑃(𝑜3), assuming edges can be inserted in 𝑃(1) time – Repeated iteratively until a fixed point is reached

  • Requires the transitive closure of the graph

– Expensive to store – Capturing 𝑜2 relationships (does vertex 𝑗 reach vertex 𝑘?)

  • Adds lots of redundant edges

– Should leverage transitivity when possible

GTC 2016, San Jose, CA

29

A B C

slide-30
SLIDE 30

Reverse Time Vector Clocks (RTVCs)

  • vprocs provide implicit orderings

GTC 2016, San Jose, CA

30

ST[B] → 91 ST[B] → 92 ST[A] → 1

slide-31
SLIDE 31

Reverse Time Vector Clocks (RTVCs)

  • vprocs provide implicit orderings
  • Reverse Vector Time Clock

– Track the earliest successor from each vertex to each vproc

  • Bounds the number of reachable edges to be

inspected by 𝑞, the number of vprocs

– No need to compute or store the transitive closure!

GTC 2016, San Jose, CA

31

ST[B] → 91 ST[B] → 92 ST[A] → 1

slide-32
SLIDE 32

Reverse Time Vector Clocks (RTVCs)

  • Track the earliest successor from each vertex to

each vproc

– Captures transitivity

  • Traverse vprocs rather than the graph itself

– No need to check every reachable vertex

  • Bounds the number of reachable edges to be

inspected by 𝑞, the number of vprocs

– No need to compute or store the transitive closure!

GTC 2016, San Jose, CA

32

slide-33
SLIDE 33

Superfluous work?

  • Our approach tends

to add more edges than TSOtool, some of which are redundant

– Worst case: 36% additional edges – The redundancy is well worth the performance benefits

GTC 2016, San Jose, CA

33

slide-34
SLIDE 34

Test Info

𝒐 = 𝑾 𝒏 = 𝑭 TSOtool Inferred Iterations ST/LD/BAR (%) 2,097,963 3,799,254 4,487,224 5 76/24/0 2,098,219 3,686,624 4,411,887 4 79/21/0 1,977,832 4,453,340 5,179,108 5 46/53/1 2,097,741 3,875,831 4,635,852 7 77/23/0 1,936,321 5,109,990 5,236,671 5 44/54/2 2,098,321 2,491,062 4,257,077 6 80/20/0 2,097,809 4,321,793 4,404,753 7 78/21/1 1,871,831 3,660,617 4,861,044 6 44/54/2 2,097,809 4,434,120 4,418,555 5 80/20/0 4,195,405 6,934,725 9,338,902 7 76/23/1 4,194,961 7,960,567 8,963,281 6 78/22/0

GTC 2016, San Jose, CA

34

slide-35
SLIDE 35

Speedup over TSOtool (Inferring edges)

Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K 27 15.09x 29.31x 53.45x 57.90x 128K*4 = 512K 27 16.41x 31.49x 57.34x 76.98x 256K*4 = 1M 23 14.51x 27.98x 51.68x 72.32x 512K*4 = 2M 10 4.01x 7.52x 14.19x 42.90x 1M*4 = 4M 2 3.08x 5.70x 10.39x 45.16x

GTC 2016, San Jose, CA

35

  • Number of tests decreases with test size because
  • f industrial time constraints

– Motivation for this work

  • Avg. Parallel speedups over our improved

sequential approach:

– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)

slide-36
SLIDE 36

Problem Setup

  • 1. Construct an initial graph

– Vertices represent load, store, and barrier insts – Edges represent memory

  • rdering
  • Based on architectural rules
  • 2. Iteratively infer additional

edges to the graph

– Based on existing relationships

  • 3. Check for cycles

– If one exists: contradiction!

GTC 2016, San Jose, CA

36

LD[B] ← 92 LD[A] ← 2 ST[B] → 92 LD[B] ← 93 LD[B] ← 92 CPU 0 CPU 1 ST[B] → 90

  • Given an inst. trace from a simulator, RTL, or silicon
slide-37
SLIDE 37

Importance of Scaling

GTC 2016, San Jose, CA

37

  • 128K instructions

per core

  • 512K total

instructions

slide-38
SLIDE 38

Importance of Scaling

GTC 2016, San Jose, CA

38

  • 256K instructions

per core

  • 1M total

instructions