Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader - PowerPoint PPT Presentation

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader

Challenges of Design Verification • Contemporary hardware designs require mi mill llions ons of lines of RTL code – More lines of code written for verification than for the implementation itself • Tradeoff between performance and design complexity – Speculative execution, shared caches, instruction reordering – Performance wins out GTC 2016, San Jose, CA 2

Performance vs. Design Complexity • Programmer burden – Requires correct usage of synchronization • Time to market – Earlier remediation of bugs is less costly – Re-spins on tapeout are expensive • Significant time spent of verification – Verification techniques are often NP- complete GTC 2016, San Jose, CA 3

Memory Consistency Models • Contract between SW and HW regarding the semantics of memory operations • Classic example: Sequential Consistency (SC) – All processors observe the same ordering of operations serviced by memory – Too strict for modern optimizations/architectures • Nomenclature – ST[A] → 1 1 “Wrote a value of 1 to location A” – LD[B] ← 2 2 “Read a value of 2 from location B” GTC 2016, San Jose, CA 4

ARM Idiosyncrasies • Our focus: ARMv8 Mv8 • Speculative eculative Ex Execution cution is allowed • Threads can reo eorde rder rea eads ds an and wr writes tes – Assuming no dependency exists • Writes are not t guar aranteed anteed to to be be simu multaneo ltaneously usly vi visible ible to other cores GTC 2016, San Jose, CA 5

Problem Setup • Given an inst. trace from a simulator, RTL, or silicon 1. Construct an initial graph – Vertices represent load, store, CPU 0 and barrier insts ST[B] → 90 – Edges represent memory ordering ST[B] → 92 CPU 1 • Based on architectural rules 2. Iter 2. erati tive vely ly infer er add ddit itio ional LD[B] ← 92 LD[A] ← 2 edge ed ges to the e gr graph ph LD[B] ← 92 LD[B] ← 93 – Based on existing relationships 3. Check for cycles – If one exists: contradiction! GTC 2016, San Jose, CA 6

TSOtool • Hangal et al. , ISCA ’04 – Designed for SPARC, but portable to ARM • Each store writes a unique value to memory – Easily map a load to the store that wrote its data • Tradeoff between accuracy and runtime – Polynomial time, but false positives are possible – If a cycle is found, a bug indeed exists – If no cycles are found, execution appears ars consistent GTC 2016, San Jose, CA 7

Need for Scalability • Must run many tests to maximize coverage – Stress different portions of the memory subsystem • Longer tests put supporting logic in more interesting states – Many instructions are required to build history in an LRU cache, for instance • Using a CPU cluster does not suffice – The results of one set of tests dictate the structure of the ensuing tests – Faster tests help with interactivity! • Solution: Efficient algorithms and parallelism GTC 2016, San Jose, CA 8

Inferred Edge Insertions (Rule 6) • S can reach X S: ST[A] → 1 • X does not load data from S W: ST[A] → 2 X: LD[A] ← 2 GTC 2016, San Jose, CA 9

Inferred Edge Insertions (Rule 6) • S can reach X S: ST[A] → 1 • X does not load data from S W: ST[A] → 2 • S co comes mes be before ore th the node that stored X’s X: LD[A] ← 2 dat ata GTC 2016, San Jose, CA 10

Inferred Edge Insertions (Rule 7) • S can reach X • Loads read data from S, not X S: ST[A] → 1 L: LD[A] → 1 M: LD[A] → 1 X: ST[A] → 2 GTC 2016, San Jose, CA 11

Inferred Edge Insertions (Rule 7) • S can reach X • Loads read data from S, not X • Load ads s ca came me be befor ore e X S: ST[A] → 1 L: LD[A] → 1 M: LD[A] → 1 X: ST[A] → 2 GTC 2016, San Jose, CA 12

Initial Algorithm for Inferring Edges for_each(store vertex S) { for_each(reachable vertex X from S) //Getting this set is expensive! { if(location[S] == location[X]) { if((type[X] == LD) && (data[S] != data[X])) { //Add Rule 6 edge from S to W, the store that X read from } else if(type[X] == ST) { for_each(load vertex L that reads data from S) { //Add Rule 7 edge from L to X } } //End if instruction type is store } //End if location } //End for each reachable vertex } //End for each store GTC 2016, San Jose, CA 13

Virtual Processors (vprocs) • Split instructions from physical to virtual processors • Each vproc is sequentially consistent – Program order ↔ Memory order CPU 0 VPROC 0 VPROC 1 ST[B] → 91 ST[A] → 1 ST[B] → 91 ST[A] → 1 VPROC 2 ST[B] → 92 LD[A] ← 2 LD[A] ← 2 ST[B] → 92 GTC 2016, San Jose, CA 14

Reverse Time Vector Clocks (RTVC) Consider the RTVC of • CPU 0 ST[B] = 90 ST[B] → 90 Purple: ST[B] = 92 Blue: NULL ST[B] → 92 CPU 1 Green: LD[B] = 92 Orange: LD[B] = 92 LD[B] ← 92 LD[A] ← 2 • Track the earliest successor from each LD[B] ← 92 LD[B] ← 93 vertex to each vproc – Captures transitivity Complexity of inferring edges: 𝑃 𝑜 2 𝑞 2 𝑒 𝑛𝑏𝑦 GTC 2016, San Jose, CA 15

Updating RTVCs • Computing RTVCs once is fast – Process vertices in the reverse order of a topological sort – Check neighbors directly, then their RTVCs • Every time a new edge is inserted, RTVC values need to change – # of edge insertions ≈ 𝑛 • TSOtool implements both vprocs and RTVCs GTC 2016, San Jose, CA 16

Facilitating Parallelism • Repeatedly updating RTVCs is expensive – For 𝑙 edge insertions, RTVC updates take 𝑃(𝑙𝑞𝑜) time • 𝑙 = 𝑃 𝑜 2 , but usually is a small multiple of 𝑜 • Idea: Update RTVCs once per iteration rather than per edge insertion – For 𝑗 iterations RTVC updates take 𝑃(𝑗𝑞𝑜) time • 𝑗 ≪ 𝑙 (less than 10 for all test cases) – Less communication between threads • Complexity of inferring edges: 𝑃(𝑜 2 𝑞) GTC 2016, San Jose, CA 17

Correctness • Inferred edges found by our approach will not be the same as the edges found by TSOtool – Might not infer an edge that TSOtool does • RTVC for TSOtool can change mid-iteration – Might infer an edge that TSOtool does not • Our approach will have “stale” RTVC values • Both approaches make forward progress – Number of edges monotonic otonically ally increases eases • Any edge inserted by our approach could have been inserted by the naïve approach [Thm 1] • If TSOtool finds a cycle, we will also find a cycle [Thm 2] GTC 2016, San Jose, CA 18

Parallel Implementations • OpenMP – Each thread keeps its own partition of added edges – After each iteration of inferring edges, reduce • CUDA – Assign threads to each store instruction – Threads independently traverse the vprocs of this store – Atomically add edges to a preallocated array in global memory GTC 2016, San Jose, CA 19

Experimental Setup • Intel Core i7-2600K CPU – Quad core, 3.4GHz, 8MB LLC, 16GB DRAM • NVIDIA GeForce GTX Titan – 14 SMs, 837 MHz base clock, 6GB DRAM • ARM system under test – Cortex-A57, quad core • Instruction graphs range from 𝑜 = 2 18 to 𝑜 = 2 22 vertices, 𝑜 ≈ 𝑛 – Sparse, high-diameter, low-degree – Tests vary by their distribution of LD/ST/DMB instructions, # of vprocs, and inst dependencies GTC 2016, San Jose, CA 20

Importance of Scaling • 512K instructions per core • 2M total instructions GTC 2016, San Jose, CA 21

Speedup over TSOtool (Application) Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x 128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x 256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x 512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x 1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x • GPU is always best; scales much better to larger tests • Extreme case: 9 hours rs using TSOtool → unde der 10 min inutes es using our GPU approach • Avg. Parallel speedups over our improved sequential approach: – 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU) GTC 2016, San Jose, CA 22

Summary • Relaxing the updates to RTVCs lead to a better sequential approach and and facilitated parallel implementations – Trade off between redundant work and parallelism • Faster execution leads to interactive bug-finding • The GPU scales well to larger problem instances – Helpful for corner case bugs that slip through pre-silicon verification • For the twelve largest test cases our GPU implementation achieves a 26.36x average application speedup GTC 2016, San Jose, CA 23

Acknowledgments • Shankar Govindaraju, and Tom Hart for their help on understanding NVIDIA’s implementation of TSOtool for ARM GTC 2016, San Jose, CA 24

Questions “ To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science .”– Albert Einstein GTC 2016, San Jose, CA 25

Backup GTC 2016, San Jose, CA 26

Sequential Consistency Examples • ST[x] → 1 handled before • Valid ST[x] → 2 P1: ST[x] → 1 P2: LD[x] ← 1 LD[x] ← 2 P3: LD[x] ← 1 LD[x] ← 2 P4: ST[x] → 2 t=0 t=1 t=2 • Writes propagate to P2 • Invalid and P3 in a different P1: ST[x] → 1 P2: LD[x] ← 1 LD[x] ← 2 order P3: LD[x] ← 2 LD[x] ← 1 – Valid for weaker memory P4: ST[x] → 2 t=0 t=1 t=2 models GTC 2016, San Jose, CA 27

Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader - PowerPoint PPT Presentation

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware designs require mi mill llions ons

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

CUSTOMER LOGO MEETING WITH CUSTOMER LOCAL, DATE GARLAND With over 240 years in the market

NORTH GARLAND MEDICAL CENTER 2300 LOOKOUT DRIVE, GARLAND, TX 19.35 AC 6.611 AC 42.0 AC LOOKOUT

Bojan Zdrnja, Nevil Brownlee and Duane Wessels Bojan Zdrnja , Nevil Brownlee and Duane

By: John McLaughlin June 2014 - Post Elect McLaughlin & Associates www.mclaughlinonline.com

Portable EXPath Portable EXPath Extension Functions Extension Functions Adam Retter Adam

Bank of America Merrill Lynch Bank of America Merrill Lynch Global Transportation

Signals, Information and Sampling Steve McLaughlin University of Edinburgh, 16 th January,

Geospatial I ntelligence at the Environm ental Protection Agency Casey McLaughlin, GISP

From Uncertainty to Belief: Inferring the Specification Within Stephen McLaughlin Stephen

shelter diversion COMBATING ANIMAL WELFARE . Jen Clarkson + Kelly McLaughlin MEET YOUR

T ennessee State Council Meeting Monday, April 16, 2018 2:00PM CST Webinar 1 SUCCESSFUL

Council Meeting Wednesday, May 1, 2019 10:00 AM Central Webinar SUCCESSFUL EDUCATIONAL

Surface Simplification Using Quadric Error Metrics Michael Garland and Paul S. Heckbert 1 The

GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN * , INDRANI PAUL , JOSEPH GREATHOUSE , SRILATHA MANNE

What is the Welfare State? A Sociological Restatement Professor David Garland Professor Nicola

2. . 3. . 4. .. .. 11. Reasoning with respect to Time 2 U NDERSTANDING T IME IN T EXT

Medical T ext Data Sendong (Stan) Zhao + , Meng Jiang * , Ming Liu + , Bing Qin + , Ting Liu + +

Calibrating misspecified ERGMs for Bayesian inference Nial Friel University College Dublin

of a Single Logical Form for Inference in Court AI & Evidentiary Inference Workshop in

Modeling (Salons 6 & 7) Mathematcal psychology in the wild - why and how? Insights from

Decision Procedures and Verifjcation NAIL094 Petr Kuera Charles University 2019/20 (6th

Implementing CIDOC CRM Search Based on Fundamental Relations and OWLIM Rules Vladimir Alexiev,

Chapter 2: Typicality and the Classical View of Categories G. Murphy (2002) The Big Book of

Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader - PowerPoint PPT Presentation

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware designs require mi mill llions ons

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

CUSTOMER LOGO MEETING WITH CUSTOMER LOCAL, DATE GARLAND With over 240 years in the market

NORTH GARLAND MEDICAL CENTER 2300 LOOKOUT DRIVE, GARLAND, TX 19.35 AC 6.611 AC 42.0 AC LOOKOUT

Bojan Zdrnja, Nevil Brownlee and Duane Wessels Bojan Zdrnja , Nevil Brownlee and Duane

By: John McLaughlin June 2014 - Post Elect McLaughlin &amp; Associates www.mclaughlinonline.com

Portable EXPath Portable EXPath Extension Functions Extension Functions Adam Retter Adam

Bank of America Merrill Lynch Bank of America Merrill Lynch Global Transportation

Signals, Information and Sampling Steve McLaughlin University of Edinburgh, 16 th January,

Geospatial I ntelligence at the Environm ental Protection Agency Casey McLaughlin, GISP

From Uncertainty to Belief: Inferring the Specification Within Stephen McLaughlin Stephen

shelter diversion COMBATING ANIMAL WELFARE . Jen Clarkson + Kelly McLaughlin MEET YOUR

T ennessee State Council Meeting Monday, April 16, 2018 2:00PM CST Webinar 1 SUCCESSFUL

Council Meeting Wednesday, May 1, 2019 10:00 AM Central Webinar SUCCESSFUL EDUCATIONAL

Surface Simplification Using Quadric Error Metrics Michael Garland and Paul S. Heckbert 1 The

GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN * , INDRANI PAUL , JOSEPH GREATHOUSE , SRILATHA MANNE

What is the Welfare State? A Sociological Restatement Professor David Garland Professor Nicola

2. . 3. . 4. .. .. 11. Reasoning with respect to Time 2 U NDERSTANDING T IME IN T EXT

Medical T ext Data Sendong (Stan) Zhao + , Meng Jiang * , Ming Liu + , Bing Qin + , Ting Liu + +

Calibrating misspecified ERGMs for Bayesian inference Nial Friel University College Dublin

of a Single Logical Form for Inference in Court AI &amp; Evidentiary Inference Workshop in

Modeling (Salons 6 &amp; 7) Mathematcal psychology in the wild - why and how? Insights from

Decision Procedures and Verifjcation NAIL094 Petr Kuera Charles University 2019/20 (6th

Implementing CIDOC CRM Search Based on Fundamental Relations and OWLIM Rules Vladimir Alexiev,

Chapter 2: Typicality and the Classical View of Categories G. Murphy (2002) The Big Book of

By: John McLaughlin June 2014 - Post Elect McLaughlin & Associates www.mclaughlinonline.com

of a Single Logical Form for Inference in Court AI & Evidentiary Inference Workshop in

Modeling (Salons 6 & 7) Mathematcal psychology in the wild - why and how? Insights from