or

..or Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev ! - PowerPoint PPT Presentation

Correctness Checking Concepts and Tools for High Performance Computing ..or Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev ! School of Computing University of Utah Salt Lake City, UT 84112 URL: http://www.cs.utah.edu/fv Supported by


  1. A “feel” of HPC Correctness • Constant pressure : The “most science per dollar” ! • Many dimensions of correctness ! • HPC explores unknown aspects of Sciences ! • Algorithmic Approximations are often made ! RIKEN K machine • Growing heterogeneity in HPC platforms ! • Floating-point representation is inexact ! • “Bit flips” ! • Correctness training lacks ! • Busy-enough doing Science ! • Finding and keeping “Pi men” is difficult ! • Always makes sense to switch to latest HW ! • Often the poorest documented Sciences HPC ! 31 (Lazowka)

  2. A “feel” of HPC Correctness • Constant pressure : The “most science per dollar” ! • Many dimensions of correctness ! • HPC explores unknown aspects of Sciences ! • Algorithmic Approximations are often made ! RIKEN K machine • Growing heterogeneity in HPC platforms ! • Floating-point representation is inexact ! • “Bit flips” ! • Correctness training lacks ! • Busy-enough doing Science ! • Finding and keeping “Pi men” is difficult ! • Always makes sense to switch to latest HW ! • Often the poorest documented Sciences HPC HPC FM ! 32 (Lazowka) (Our twist)

  3. A Heterogeneity-induced bug ! (Berzins, Meng, Humphrey, XSEDE’12) P"="0.421874999999999944488848768742172978818416595458984375"" C"="0.0026041666666666665221063770019327421323396265506744384765625"" Compute:"floor("P"/"C")" Xeon% Xeon% Expecting Sent Phi% 161 msgs 162 msgs "P"/"C"="161.9999…" "P"/"C"="162" floor("P"/"C")"=" 161% floor("P"/"C")"=" 162% ! 33

  4. A Heterogeneity-induced bug ! (Berzins, Meng, Humphrey, XSEDE’12) P"="0.421874999999999944488848768742172978818416595458984375"" C"="0.0026041666666666665221063770019327421323396265506744384765625"" Compute:"floor("P"/"C")" Xeon% Xeon% Expecting Sent Phi% 161 msgs 162 msgs "P"/"C"="161.9999…" "P"/"C"="162" floor("P"/"C")"=" 161% floor("P"/"C")"=" 162% Authors’ fix : used double-precision for P/C ! Question: Is there a more deft solution ? ! 34

  5. A Heterogeneity-induced bug ! (Berzins, Meng, Humphrey, XSEDE’12) P"="0.421874999999999944488848768742172978818416595458984375"" C"="0.0026041666666666665221063770019327421323396265506744384765625"" Compute:"floor("P"/"C")" Xeon% Xeon% Expecting Sent Phi% 161 msgs 162 msgs "P"/"C"="161.9999…" "P"/"C"="162" floor("P"/"C")"=" 161% floor("P"/"C")"=" 162% Authors’ fix : used double-precision for P/C ! Question: Is there a more deft solution ? ! More important question : What exactly went wrong ?? ! (the XSEDE’12 authors moved along…) ! 35

  6. Resilience • ~7 B transistors per GPU (and many B for CPUs) and a ton of memory ! • 10^18 Transistors Throbbing at GHz for Weeks ! • Some bit changes MUST be unplanned ones ! • In HPC, results combine more (than, say, in “cloud”) ! • “Bit flip” is a catch-all term for ! • High speed-variability of devices coupled with DVFS jitter ! • Local hot spots develop, aging chip electronics ! • Particle strikes ! • Energy is the main currency ! • Some of the energy-saving “games” that must be played (this invites bit-flips) ! • Dynamic Slack Detection, followed by lowering voltage + frequency ! • One PNNL study (Kevin Baker) : 36KW -> 18KW ! 36

  7. Our Position (1) • Despite “bit flips” and such, it is amply clear that sequential and concurrency bugs still ought to be our principal focus ! • They occur quite predictably (unlike bit flips) ! • They are something we can control (and eliminate in many cases) ! 37

  8. Our Position (2) • Unless we can debug in the small, there is NO WAY we can debug in the large ! 38

  9. Our Observations (3) • There are SO MANY instances where experts are getting it wrong — and spreading the wrong ! 39

  10. Example-1 • IBM Documentation: “If you debug your MPI program under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks” • It can ! 40

  11. Example-1 • IBM Documentation: “If you debug your MPI program under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks” • It can ! 41

  12. Example-2 • A reduction kernel given as an early-chapter example of a recent Cuda book is broken ! • Reason: Assumes that CUDA atomic-add has a “fence” semantics ! • Erratum has been issued on book website ! 42

  13. Example-3 • A work-stealing queue in “GPU gems” is incorrect ! • Reason: Assumes “store store” ordering between two sequentially issued stores (must have used a fence in-between) ! 43

  14. Feature of GPU programming • Programmers face concurrency corner-cases quite frequently • As opposed to (e.g.) OS where low-level concurrency is usually hidden within the kernel ! 44

  15. Example-4 • If your code ran correctly in FORTRAN, it will also run correctly in C ! 45

  16. Example-4 invalidated ! 46

  17. Example-5 ! Simple questions can’t be answered by today’s tools ! Does this program deadlock? (Yes.) ! 47

  18. Example-5 ! Simple questions can’t be answered by today’s tools ! Does this program deadlock? (Yes.) Match ! 48

  19. Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads? Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…( ' ! 49

  20. Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads? Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' The'hardware'schedules'these'instrucKons'in' ''y[index]'='x[index]'+'y[index];' “warps”'(SIMD'groups).'' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…( ' ! 50

  21. Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads? Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' The'hardware'schedules'these'instrucKons'in' ''y[index]'='x[index]'+'y[index];' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…( ' ! 51

  22. Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads? Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' The'hardware'schedules'these'instrucKons'in' ''y[index]'='x[index]'+'y[index];' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' E.g.'When'compiling'with'opKmizaKons' }' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…( ' New(Answer:(0,(2,(4,(6,(8,(… ' ! 52

  23. Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads? Vola$le(x[],(y[].. ' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' The'hardware'schedules'these'instrucKons'in' ''y[index]'='x[index]'+'y[index];' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' But'if'you'read'the'CUDA'documentaKon' Carefully,'you'noKce'you'had'to'use'a'' }' C'VolaKle'that'restored'“correct”'answers!' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…( ' ! 53

  24. Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads? Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' The'hardware'schedules'these'instrucKons'in' ''y[index]'='x[index]'+'y[index];' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' But'the'ability'to'“rescue'correct'answer”' is'no'longer'a'guarantee'(since'CUDA'5.0)' }' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…( ' ! 54

  25. So you really trust your compilers? • Talk to Prof. John Regehr of Utah ! • C-Smith : Differential testing of compilers ! • The single most impressive compiler testing work (IMHO) in recent times ! • Has found goof-ups in -O0 for short programs ! • Many bugs around C volatiles ! • Learned that NOTHING is known about how compilers (ought to) treat floating-point ! 55

  26. Without swift action, the “din” of ! the blind leading the blind will sow more confusion Some threads offer advice ranging from “ use volatiles” ! ! (was in early CUDA documentation; gone since 5.0) ! ! Others advocate the use of __syncthreads (barriers) ! Or query device registers to know warp size ! https://devtalk.nvidia.com/default/topic/512376/ https://devtalk.nvidia.com/default/topic/499715/ https://devtalk.nvidia.com/default/topic/382928/ ! And there are several threads simply discuss this issue ! https://devtalk.nvidia.com/default/topic/632471 https://devtalk.nvidia.com/default/topic/377816/ ! There isn’t a comprehensive picture of dos and don’t and WHY ! ! 56

  27. Discussions on “warp-synchronous” code https://devtalk.nvidia.com/default/topic/499715/are-threads-of- a-warp-really-sync-/?offset=2 ! 57

  28. Example-8 ! Do GPUs obey coherence? ! (Coherence = per-location Seq Consistency) • Ask me after the talk……. :) • We are stress testing real GPUs • and finding things out ! • (work is inspired by Bill Collier who called it “X-raying real machines” in his famous RAPA book) ! 58

  29. Our (humble) suggestion • There is NO WAY the complexity of anything can be conquered without mathematics ! • The complexity of debugging needs the “mathematics of debugging” — the true mathematics of Software Engineering ! • i.e. formal methods ! • Must develop the “right kind” of formal methods ! • Coexist with the grubby ! • Take on problems in context ! • Win practitioner friends early — and KEEP THEM ! 59

  30. What is hard about HPC Concurrency? • The Scale of Concurrency and the Number of interacting APIs ! • MPI-2, MPI-3, OpenMP, CUDA, OpenCL, OpenACC, PThreads, use of NonBlocking Data Structures, dynamic Scheduling ! • Each API thinks it “owns” the machine ! • Exposure of Everyday Programmer to Low Level Concurrency is a worrisome reality! ! • Memory Consistency Models Matter ! • Governs visibility across threads / fences ! • Yet, very poorly specified / understood ! • Compiler Optimizations — not even basic studies exist ! 60

  31. Is there a role for Formal Methods? • Yes indeed! ! • For instance, why is it that microprocessors don’t do “Pentium FDIV” any more? ! • Processor ALUs have only become even more complex ! • Answer : Formal gets serious use in the industry ! • Intel : Symbolic Trajectory Evaluation ! • Others : similar methods ! • Processors get FV to varying degrees for other subsystems ! • E.g. Cache coherence (at a protocol level) ! 61

  32. Is there a role for Formal Methods? • Yes indeed! ! • there are a fascinating array of correctness challenges ! • Very little involvement from mainstream CS side ! • lack of exposure, limited interactions across departments, ! • Need “cool show-pieces” to draw students to HPC research… ! 62

  33. An example “cool project” ! Utah Pi “cluster” built by PhD students at Utah ! “Mo” Mohammed Saeed Al Mahfoudh ! and Simone Atzeni ! ! (Under $500 ; Runs MPI, Habanero Java, …) ! 63

  34. Anyone wanting to do software testing for concurrency must slay two exponentials ! 64

  35. Anyone wanting to do software testing for concurrency must slay two exponentials ! 65

  36. A FM Grab-bag for anyone wanting to debug concurrent programs • Slay input-space exponential using ! • Symbolic Execution ! • Slay schedule-space exponential by ! • Not jiggling schedules that are Happens- Before equivalent ! 66

  37. Not Exploring HB-Equivalent Schedules ! 67

  38. A FM Grab-bag for anyone wanting to debug concurrent programs • Concepts in the fuel-tank must include ! • Lamport’s “happens before” ! • Define concurrency coverage using it ! • Design active-testing methods that systematically explore schedule-space ! • Memory consistency models ! • Data races and how to detect them ! • Symbolic execution ! • Helps achieve input-space Coverage ! 68

  39. Overview of our (active) projects • HPC Concurrency ! • Dynamic Verification Methods for MPI : CACM, Dec 2011 ! • GPU data-race checking : PPoPP’12, SC’12, SC’14 ! • Floating-point ! • Finding inputs that cause highest relative error (“sour spot search”) : PPoPP’14 ! • Detecting and Root-Causing Non-determinism ! • Pruner project at LLNL - combined static / dynamic analysis for OpenMP race checking ! • System Resilience ! • We have developed an LLVM-level Fault Injector called KULFI ! • Using Coalesced Stack Trace Graphs to Highlight Behavioral Differences ! • Our main focus continues to be correctness tools for HPC Concurrency ! 69

  40. Biggest Gain due to Formal Methods: ! Conceptual Cohesion! • Example : Helps understand that Concurrency and Sequential Abstractions Tessellate • Helps Understand that Sequential == Deterministic • Helps Understand Data Races as Breaking the Sequential Contract ! 70

  41. Concurrency and Sequential Abstractions Tessellate ! Solving*A*x*=*B* Shared*memory*or*Msg** Passing*based*Parallelism* Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Concurrent*State*Machines* Using*Gates*and*Flops* Sequen6al*view*of* Boolean*Func6ons*(gates)* Fine%grained*concurrency** of*transistor%level*circuits* ! 71

  42. Why Fixate on Data Races? • Key assumption that enables sequential thinking ! • Sequential almost always means Deterministic ! • In an Out of Order CPU, nothing is sequential ! • Yet we think of assembly programs as “sequential” ! • Only because they yield deterministic results ! • Create Hazards (say in a time-sensitive way) ! • Then we lose this sequential / deterministic abstraction ! • Parallel Programming Almost Always Strives to produce Sequential i.e. Deterministic Outcomes! ! 72

  43. Races and Race-Free Generalized Solving*A*x*=*B* Shared*memory*or*Msg** Passing*based*Parallelism* Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Concurrent*State*Machines* Using*Gates*and*Flops* Critical races ! Sequen6al*view*of* gives gates ! Boolean*Func6ons*(gates)* that spike ! Fine%grained*concurrency** of*transistor%level*circuits* (broken Boolean ! Abstraction) ! 73

  44. Races and Race-Free Generalized Solving*A*x*=*B* Shared*memory*or*Msg** Passing*based*Parallelism* Races between ! Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Clocks and Data ! Concurrent*State*Machines* Breaks ! Using*Gates*and*Flops* Seq. Abstraction. Sequen6al*view*of* Boolean*Func6ons*(gates)* Fine%grained*concurrency** of*transistor%level*circuits* ! 74

  45. Races and Race-Free Generalized Data Races ! Break Sequential ! Solving*A*x*=*B* Consistency ! Shared*memory*or*Msg** ( Unsynchronized ! Passing*based*Parallelism* Interleavings ! Sequen6al*Program** Matter ) Abstrac6ons*(e.g.*ISA)* Concurrent*State*Machines* Using*Gates*and*Flops* Sequen6al*view*of* Boolean*Func6ons*(gates)* Fine%grained*concurrency** of*transistor%level*circuits* ! 75

  46. Results on UT Lonestar Benchmarks ! 76

  47. Results on UIUC Parboil Benchmarks ! 77

  48. Uintah: A Scalable Computational Framework for Multi-physics problems • Under continuous development over the past decade • Scalability to 700K CPU cores possible now • ~1M LOC or more ! • Modular extensibility to accommodate GPUs and Xeon Phis • Partitions concerns • App developer writes sequential apps ! • Infrastructure developer tunes / improves perf ! 78

  49. Uintah Organization Application Packages ICE MPM ARCHES Abstract Directed Acyclic Tast Graph t2 t3 t1 t4 t5 t6 t10 t7 t8 t9 t13 t12 t11 Runtime System Simulation Load Controller Balancer Scheduler ! 79

  50. Case Study: Data Warehouse Error ! Collect Coalesced call-paths leading to DW::put(). ! Diffed across two scheduler versions to isolate bug ����� ����� ����� � � � � � � � � � � � � � � � � � ������������� ������������� ������������� ������������� ������������� ������������� � � � � � � ������������� ������������� ������������� ������������� ������������� ������������� ������������� ������������ ������������� ������������� ������������ � � � � � � � � � � � ������������ ����������������������� ������������������������� ����������������������� ������������������������� ����������������������� ������������������������� � � � � � �� � � ��� � � � � ����������������������� ����������������������� ����������������������� ������������������������� ����������������������� ������������������������� � � � � � ��� � � � ��������������������� ������������������������������� ��������������������� ������������������������� ������������������������������� ������������������������� � � � � � ��� � � � �������������������� ������������������������������ ������������������ ������������������������������ ������������������ � � � � � � � �������������� ���������� ������������ ������������� ���������� � � � � � �� � �� � �� � �� ��������������� ��������������� �������������� ��������������� ��������������� �������������� ��������������� ��������������� �������������� � � ��������������� � � � � � � � ��������������� � � � � � � � � � ��������������� � � � � � � ��������������������������������� �� ������������������������������� � � � � ����������������� ��������������������������������� �� ������������������������������� � � � � ����������������� � ������������������������������������ ������������������������������� � � � ����������������� � ������������������� ������������������� ������������������������������ � ����������������� ������������������� ������������������� ������������������������������ � ����������������� ������������������������������ ����������������� � ������������������� ������������������� ��������������� ������������������������������������ ������������������������������ ���������������������������������� ��������������� ������������������������������������ ������������������������������ ���������������������������������� ��������������� ��������������������������������� ������������������������������ ���������������������������������� � � � � � ������������������� � �������������������������� � ������������������������������ � � � � � � � ������������������� � �������������������������� � ������������������������������ � � � � � � �������������������������� � ������������������������������ � � � ������������������� � ������������������������������� ������������������������� �� � � � � ��������������������������������������� ������������������������������� ������������������������� �� � � � � ��������������������������������������� ������������������������� ������������������������������� � � � � ��������������������������������������� � � � � � � � � � � � � � � � ������� ������� ������� ! 80

  51. Conceptual view of Uintah equipped with a monitoring network (future work) Hierarchical Task Graph Active Testing Task Compilation and Graph to Generate Monitoring Salient using High-Level Internal Post MPI Standardized Internal Ready Queue Events Ready Task Receive Interfaces to Cross-Check Post MPI Completed CPU Sends Task Check MPI External Ready Queue Receive Device Post Device Enabled Copy Check Host to Device to GPU Ready Queue Device Device Copy Host Copy Automata Automaton Learning from Traces to Trigger ./sus CSTG 0 1 0 0 0 Collection Tailor Learning for AMRSim::run+A AMRSim::run+B AMRSim::run+E AMRSim::run+C AMRSim::run+D Hybrid Concurrency Events 0 1 0 0 0 Build Cross-Layer AMRSim::executeTimestep AMRSim::doInitialTimestep DW::override Monitoring Hierarchies 4 -4 69 -69 1 MPIScheduler::execute+B UnifiedScheduler::execute MPIScheduler::execute+A Derive System Control Invariants Static 73 -73 1 to Document + Debug via CSTG Analysis MPIScheduler::runTask UnifiedScheduler::runTask MPIScheduler::initiateReduction Helps Refine 73 -73 1 CSTGs DetailedTask::doit MPIScheduler::runReductionTask Static Analysis of 0 1 DWH and Scheduler Task::doit DW::reduceMPI ! 81

  52. Concluding Remarks • Slaying bugs in HPC essential for Exascale ! • Need a mix of empirical to formal ! • Formal helps with concurrency coverage ! • Formal helps write clear unambiguous and validated specs ! • and educate sure-footedly ! 82

  53. thanks! • www.cs.utah.edu/fv • Thanks to my former students who have taught me everything I know about FV and its relevance in the industry ! 83

  54. The rest of the talk • Some results in GPU Data Race Checking ! • Demo of Symbolic Execution and GKLEE ! • Data Race Detection in GPU Programs ! • Computational Frameworks ! • Uintah ! • How Coalesced Stack Trace Graphs help debug ! • Other projects : Floating-Point Correctness and System Resilience ! • Concluding Remarks ! 84

  55. The rest of the talk • Some results in GPU Data Race Checking ! • Demo of Symbolic Execution and GKLEE ! • Data Race Detection in GPU Programs ! • Computational Frameworks ! • Uintah ! • How Coalesced Stack Trace Graphs help debug ! • Other projects : Floating-Point Correctness and System Resilience ! • Concluding Remarks ! 85

  56. The key to data race checking • For the most part, CUDA code is synchronized via barriers (__syncthread) • Thus, explore a “canonical” interleaving, hoping to detect the “first race” if there is any race ! 86

  57. Interleaving exploration For$Example:$ If$the$green$dots$are$ local$thread$ac6ons,$ $then$ all$schedules$ $that$arrive$ at$the$“cut$line”$ $are$equivalent!$ ! 87

  58. Finding Representative Interleavings For$Example:$ If$the$green$dots$are$ local$thread$ac6ons,$ $then$ all$schedules$ $that$arrive$ at$the$“cut$line”$ $are$equivalent!$ ! 88

  59. Finding Representative Interleavings For$Example:$ If$the$green$dots$are$ local$thread$ac6ons,$ $then$ all$schedules$ $that$arrive$ at$the$“cut$line”$ $are$equivalent!$ ! 89

  60. GKLEE Examines Canonical Schedule Instead(of(considering(all( Schedules(and(( All(Poten5al(Races…( ! 90

  61. GKLEE Examines Canonical Schedule Consider(JUST(THIS(SINGLE( Instead(of(considering(all( CANONICAL(SCHEDULE(!!( Schedules(and(( All(Poten5al(Races…( Folk(Theorem((proved(in(our(paper):( “We(will(find(A(RACE( If(there(is(ANY(race”(!!( ! 91

  62. An Example with Two Data Races ! 92

  63. An Example with Two Data Races The “classic race” ! Threads i and i+1 race ! 93

  64. An Example with Two Data Races The “classic race” ! Threads i and i+1 race Not explained in any CUDA book as a race ! This is the “porting race” (evaluation order between ! divergent warps is unspecified) ! 94

  65. GKLEE’s steps ! 95

  66. GKLEE’s steps Symbolic Execution ! 96

  67. GKLEE’s steps Compute ! Conflicts ! Symbolic Execution and solve ! for races Compute ! Conflicts ! and solve ! for races ! 97

  68. GKLEE of PPoPP 2012 • "Deadlocks" • "Data"races" • "Concrete"test"inputs" • "Bank"conflicts" C++$CUDA$Programs$with$ Error$$ • "Warp"divergences" Symbolic$Variable$ Monitors$ • "Non9coalesced"" Declara2ons$ • $Test$Cases$ • $Provide$high$coverage$ • $Can$be$run$on$HW$ LLVM$byte) Symbolic$ code$ LLVM)GCC$ Analyzer$and$ Scheduler$ instruc2ons$ ! 98

  69. The advantages of a symbolic-execution based GPU Race Checker: Produces concrete witnesses! __global__'void'histogram64Kernel(unsigned'*d_Result,'unsigned'*d_Data,'int'dataN)'{' ''const'int'threadPos'='((threadIdx.x'&'(~63))'>>'0)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'15)'<<'2)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'48)'>>'4);'' ''...' ''__syncthreads();' ''for'(int'pos'='IMUL(blockIdx.x,'blockDim.x)'+'threadIdx.x;'pos'<'dataN;'' ''''''''''pos'+='IMUL(blockDim.x,'gridDim.x))''{' ''''unsigned'data4'='d_Data[pos];'' ''''...' ''''addData64(s_Hist,'threadPos,'(data4'>>'26)'&'0x3FU);'}' ''''__syncthreads();'...' }' inline'void'addData64(unsigned'char'*s_Hist,'int'threadPos,'unsigned'int'data)' {''s_Hist['threadPos'+'IMUL(data,'THREAD_N)']++;'}' “GKLEE:'Is'there'a'Race'?”' ! 99

  70. The advantages of a symbolic-execution based GPU Race Checker: Produces concrete witnesses! __global__'void'histogram64Kernel(unsigned'*d_Result,'unsigned'*d_Data,'int'dataN)'{' ''const'int'threadPos'='((threadIdx.x'&'(~63))'>>'0)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'15)'<<'2)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'48)'>>'4);'' ''...' ''__syncthreads();' ''for'(int'pos'='IMUL(blockIdx.x,'blockDim.x)'+'threadIdx.x;'pos'<'dataN;'' ''''''''''pos'+='IMUL(blockDim.x,'gridDim.x))''{' ''''unsigned'data4'='d_Data[pos];'' ''''...' ''''addData64(s_Hist,'threadPos,'(data4'>>'26)'&'0x3FU);'}' ''''__syncthreads();'...' }' inline'void'addData64(unsigned'char'*s_Hist,'int'threadPos,'unsigned'int'data)' {''s_Hist['threadPos'+'IMUL(data,'THREAD_N)']++;'}' Threads'5'and'and'13''have'a''WW'race'' GKLEE'' when'd_Data[5]'='0x04040404' and' d_Data[13]'='0 .'' ! 100

Recommend


More recommend