Formal Analysis Techniques for GPU kernels Nathan Chong - - PowerPoint PPT Presentation

formal analysis techniques for gpu kernels
SMART_READER_LITE
LIVE PREVIEW

Formal Analysis Techniques for GPU kernels Nathan Chong - - PowerPoint PPT Presentation

Formal Analysis Techniques for GPU kernels Nathan Chong (nyc04@imperial.ac.uk) Leap Conference, 22 May 2013 1 Reports and Articles Social Processes and Proofs of Theorems and Programs Richard A. De Millo Georgia Institute of Technology


slide-1
SLIDE 1

Formal Analysis Techniques for GPU kernels

Nathan Chong (nyc04@imperial.ac.uk) Leap Conference, 22 May 2013

1

slide-2
SLIDE 2

Reports and Articles

Social Processes and Proofs of Theorems and Programs

Richard A. De Millo Georgia Institute of Technology Richard J. Lipton and Alan J. Perlis Yale University

It is argued that formal verifications of programs, no matter how obtained, will not play the same key role in the development of computer science and software engineering as proofs do in mathematics. Furthermore the absence of continuity, the inevitability of change, and the complexity of specification of significantly many real programs make the formal verification process difficult to justify and manage. It is felt that ease of formal verification should not dominate program language design. Key Words and Phrases: formal mathematics, mathematical proofs, program verification, program specification CR Categories: 2.10, 4.6, 5.24

I should like to ask the same question that Descartes asked. You are proposing to give a precise definition of logical correctness which is to be the same as my vague intuitive feeling for logical

  • correctness. How do you intend to show that they are the same?

... The average mathematician should not forget that intuition is the final authority.

  • J. Barkley Rosser

Many people have argued that computer program- ming should strive to become more like mathematics. Maybe so, but not in the way they seem to think. The aim of program verification, an attempt to make pro- gramming more mathematics-like, is to increase dramat- ically one's confidence in the correct functioning of a piece of software, and the device that verifiers use to achieve this goal is a long chain of formal, deductive

  • logic. In mathematics, the aim is to increase one's con-

fidence in the correctness of a theorem, and it's true that

“It is argued that formal verifications of programs, no matter how obtained, will not play the same key role in the development

  • f computer science and software

engineering as proofs do in mathematics”

2

slide-3
SLIDE 3

Verification as a powerful and practical complement to Testing

3

slide-4
SLIDE 4

“It was a real bug, and it caused real issues in the results. It took significant debugging time to find the problem.” Lars Nyland (Senior Architect, NVIDIA)

4

slide-5
SLIDE 5

Schedule

5

  • Data races and Barrier Divergence
  • Examples, Examples, Examples
  • Anatomy of GPUVerify
  • Further Examples
  • Close and Questions
slide-6
SLIDE 6

Data Races and Barrier Divergence

6

slide-7
SLIDE 7

gpu

global memory local memory

host cpu

7

slide-8
SLIDE 8

global memory local memory

8

slide-9
SLIDE 9

global memory local memory

9

slide-10
SLIDE 10

global memory local memory

X intra- group race

10

slide-11
SLIDE 11

global memory local memory

X inter- group race

11

slide-12
SLIDE 12

__kernel void add_nbor(__local int *A, int offset) { int tid = get_local_id(0); A[tid] += A[tid+offset]; }

12

slide-13
SLIDE 13

s s+offset __kernel void add_nbor(__local int *A, int offset) { int tid = get_local_id(0); A[tid] += A[tid+offset]; }

13

slide-14
SLIDE 14

s s+offset __kernel void add_nbor(__local int *A, int offset) { int tid = get_local_id(0); A[tid] += A[tid+offset]; }

14

slide-15
SLIDE 15

t t+offset s s+offset t __kernel void add_nbor(__local int *A, int offset) { int tid = get_local_id(0); A[tid] += A[tid+offset]; }

15

slide-16
SLIDE 16

t t+offset s s+offset t X __kernel void add_nbor(__local int *A, int offset) { int tid = get_local_id(0); A[tid] += A[tid+offset]; }

16

slide-17
SLIDE 17

__kernel void diverge() { int tid = get_local_id(0); if (tid == 0) barrier(); else barrier(); }

17

slide-18
SLIDE 18

18

If barrier is inside a conditional statement, then all threads must enter the conditional if any thread enters the conditional statement and executes the barrier. If barrier is inside a loop, all threads must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier. OpenCL Specification (6.12.8 Synchronization Functions)

slide-19
SLIDE 19

Reduction Demo

19

0,2,4,6 1,3,5,7 SUM 0,4 1,5 2,6 3,7 1 2 3 4 5 6 7

slide-20
SLIDE 20

Examples, Examples, Examples

20

slide-21
SLIDE 21

Be Skeptical

21

  • Is the verification easier or harder than

building a test harness?

  • A common or rare type of bug?
  • The impact of not catching this bug
  • Limitations of technique
slide-22
SLIDE 22

1 Races

22

slide-23
SLIDE 23

t t+offset s s+offset t X __kernel void add_nbor(__local int *A, int offset) { int tid = get_local_id(0); A[tid] += A[tid+offset]; }

23

slide-24
SLIDE 24
  • Run GPUVerify on nbor.cl
  • Can you fix the datarace?
  • Does GPUVerify like your fix?
  • Are there more problems with this kernel?

24

$ cd 1_simple_race $ gpuverify --local_size=8 --num_groups=1 nbor.cl

slide-25
SLIDE 25

Lessons

  • GPUVerify can find possible data races,

giving a counterexample for you to evaluate

  • By fixing bugs, you increase your

confidence in the verification result

  • But still, the verification is limited. For

example, we don’t prove absence of array- bounds or functional correctness

25

slide-26
SLIDE 26

2 Benign Races

26

slide-27
SLIDE 27

__kernel void allsame(__local int *p, int val) { *p = val; }

27

slide-28
SLIDE 28
  • Run GPUVerify on allsame.cl
  • Try adding “--no-benign” to the command
  • Change “val” to “get_local_id(0)”
  • Have a look at the example in find.cl

28

$ cd 2_benign_race $ gpuverify --local_size=8 --num_groups=1 allsame.cl

slide-29
SLIDE 29

Lessons

  • Benign data races do not lead to

nondeterminism

  • Use --no-benign flag to warn about

benign data races

29

slide-30
SLIDE 30

3 Barrier Divergence

30

slide-31
SLIDE 31

__kernel void diverge() { int tid = get_local_id(0); if (tid == 0) barrier(); else barrier(); }

31

slide-32
SLIDE 32

__kernel void inloop() { int x = tid == 0 ? 4 : 1; int y = tid == 0 ? 1 : 4; int i = 0; while (i < x) { int j = 0; while (j < y) { barrier(); j++; } i++; } }

32

slide-33
SLIDE 33
  • Run GPUVerify on these examples
  • Is the inloop kernel barrier divergent?
  • What does the inloop kernel try to do?

33

$ cd 3_barrier_divergence $ gpuverify --local_size=8 --num_groups=1 diverge.cl $ gpuverify --local_size=8 --num_groups=1 inloop.cl

slide-34
SLIDE 34

34

If barrier is inside a conditional statement, then all threads must enter the conditional if any thread enters the conditional statement and executes the barrier. If barrier is inside a loop, all threads must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier. OpenCL Specification (6.12.8 Synchronization Functions)

slide-35
SLIDE 35

35

GPU Final state of A NVIDIA Tesla C2050 {{0,1,0,1},{1,0,1,0}} AMD Tahiti {{0,1,2,3},{1,2,3,0}} ARM Mali-T600 {{0,1,2,3},{3,0,1,2}} Intel Xeon X5650 {{*,*,*,1},{3,0,1,2}}

slide-36
SLIDE 36

Lessons

  • Barrier divergence results in undefined

behaviour

  • GPUVerify can detect such problems
  • Arguably, this is a rare bug?

36

slide-37
SLIDE 37

4 Asserts and Assumes

37

slide-38
SLIDE 38

__kernel void simple(__local int *A) { A[tid] = tid; __assert(A[tid] == tid); __assert(A[tid] != get_local_size(0)); __assert(__implies( __write(A), __write_offset(A)/sizeof(int) == tid)); }

38

slide-39
SLIDE 39
  • Run GPUVerify on these examples
  • Try writing your own assertions
  • Have a look at vacuous.cl
  • Does this surprise you?

39

$ cd 4_asserts_and_assumes $ gpuverify --local_size=8 --num_groups=1 assert.cl

slide-40
SLIDE 40

Lessons

  • Use asserts to state expected details of

your kernel at a particular program point

  • The dangers of inconsistent assumptions
  • Use __assert(false) to test for

inconsistency

40

slide-41
SLIDE 41

5 Loops

41

slide-42
SLIDE 42

__kernel void inc(int x) { int i = 0; while (i < x) { i = i + 1; } __assert(i == x); }

42

slide-43
SLIDE 43

__kernel void inc(int x) { __requires (0 < x); int i = 0; while (i < x) { i = i + 1; } __assert(i == x); }

43

slide-44
SLIDE 44

__kernel void inc(int x) { __requires (0 < x); int i = 0; while (__invariant(?), i < x) { i = i + 1; } __assert(i == x); }

44

slide-45
SLIDE 45
  • Run GPUVerify on these examples
  • Try running with the “--findbugs” flag
  • Can you find an invariant for the loop?
  • Take a look at stride.cl

45

$ cd 5_loops $ gpuverify --local_size=8 --num_groups=1 inc.cl

slide-46
SLIDE 46

Lessons

  • Loop invariants are assertions that are true

at every loop iteration

  • GPUVerify attempts to guess invariants
  • They may be necessary to strengthen

verification to avoid false-positives

  • Use --findbugs to do loop unwinding

46

slide-47
SLIDE 47

Anatomy of GPUVerify

47

slide-48
SLIDE 48

2-thread reduction

s t X

48

slide-49
SLIDE 49

barrier() // b1 barrier() // b2

Arbitrary threads s and t

49

slide-50
SLIDE 50

barrier() // b1 barrier() // b2

run s from b1 to b2 log all accesses Arbitrary threads s and t

50

slide-51
SLIDE 51

barrier() // b1 barrier() // b2

run s from b1 to b2 log all accesses run t from b1 to b2 check all accesses against s abort on race Arbitrary threads s and t

51

slide-52
SLIDE 52

2-thread reduction gives scalable verification

52

slide-53
SLIDE 53

Translate parallel kernel K into sequential program P such that P correct implies K is race-free

53

slide-54
SLIDE 54

54

OpenCL kernel CUDA kernel

Kernel Transformation Engine

sequential Boogie program

Z3 SMT Solver

candidate loop invariants

Frontend (built on LLVM/CLANG) Boogie Verification Engine

slide-55
SLIDE 55

55

OpenCL kernel CUDA kernel

Frontend (built on LLVM/CLANG) Kernel Transformation Engine

sequential Boogie program

Boogie Verification Engine Z3 SMT Solver

candidate loop invariants Widely used, very robust The only magic is here

slide-56
SLIDE 56

Further Examples

56

slide-57
SLIDE 57

__kernel void dbl_indirect(__local int *A) { A[tid] = tid; barrier(); A[A[(tid+1)%N]] = tid; }

57

1 2 3 4 5 6 7

slide-58
SLIDE 58

__kernel void dbl_indirect(__local int *A) { A[tid] = tid; barrier(); A[A[(tid+1)%N]] = tid; }

58

7 1 2 3 4 5 6

slide-59
SLIDE 59

barrier() // b1 barrier() // b2 barrier() // b3

run s from b1 to b2 log all accesses run t from b1 to b2 check all accesses against s run s from b2 to b3 log all accesses run t from b2 to b3 check all accesses against s

59

slide-60
SLIDE 60

barrier() // b1 barrier() // b2 barrier() // b3

run s from b1 to b2 log all accesses run t from b1 to b2 check all accesses against s run s from b2 to b3 log all accesses run t from b2 to b3 check all accesses against s

unsound

60

slide-61
SLIDE 61

barrier() // b1 barrier() // b2 barrier() // b3

run s from b1 to b2 log all accesses run t from b1 to b2 check all accesses against s run s from b2 to b3 log all accesses run t from b2 to b3 check all accesses against s havoc shared state

61

slide-62
SLIDE 62

Shared state abstraction is necessary for soundness

62

slide-63
SLIDE 63

GPUVerify: sound and scalable verification for GPU kernels

GPUVerify: A Verifier for GPU Kernels ∗

Adam Betts1 Nathan Chong1 Alastair F. Donaldson1 Shaz Qadeer2 Paul Thomson1

1Department of Computing, Imperial College London, UK 2Microsoft Research, Redmond, USA

{abetts,nyc04,afd,pt1110}@imperial.ac.uk qadeer@microsoft.com

Abstract

We present a technique for verifying race- and divergence- freedom of GPU kernels that are written in mainstream ker- nel programming languages such as OpenCL and CUDA. Our approach is founded on a novel formal operational se- mantics for GPU programming termed synchronous, delayed visibility (SDV) semantics. The SDV semantics provides a precise definition of barrier divergence in GPU kernels and allows kernel verification to be reduced to analysis of a sequential program, thereby completely avoiding the need to reason about thread interleavings, and allowing existing modular techniques for program verification to be leveraged. We describe an efficient encoding for data race detection and propose a method for automatically inferring loop invari- ants required for verification. We have implemented these techniques as a practical verification tool, GPUVerify, which can be applied directly to OpenCL and CUDA source code. We evaluate GPUVerify with respect to a set of 163 kernels drawn from public and commercial sources. Our evaluation demonstrates that GPUVerify is capable of efficient, auto- matic verification of a large number of real-world kernels. Categories and Subject Descriptors F3.1 [Logics and Meanings of Programs]: Specifying, Verifying & Reason- such as AMD and NVIDIA, have become widely available to end-users. Accelerators offer tremendous compute power at a low cost, and tasks such as media processing, medical imaging and eye-tracking can be accelerated to beat CPU performance by orders of magnitude. GPUs present a serious challenge for software develop-

  • ers. A system may contain one or more of the plethora of

devices on the market, with many more products anticipated in the immediate future. Applications must exhibit portable correctness, operating correctly on any GPU accelerator. Software bugs in media processing domains can have serious financial implications, and GPUs are being used increasingly in domains such as medical image processing [37] where safety is critical. Thus there is an urgent need for verifica- tion techniques to aid construction of correct GPU software. This paper addresses the problem of static verification

  • f GPU kernels written in kernel programming languages

such as OpenCL [17], CUDA [30] and C++ AMP [28]. We focus on two classes of bugs which make writing correct GPU kernels harder than writing correct sequential code: data races and barrier divergence. In contrast to the well-understood notion of data races, there does not appear to be a formal definition of barrier di- vergence for GPU programming. Our work begins by giving

In OOPSLA’12

63

slide-64
SLIDE 64

A B C D E F G H 1 1 1 1 1 1 2 3 3 3 4 A B C D E F G H A C D G data flag idx compact

  • ut

64

slide-65
SLIDE 65

65

slide-66
SLIDE 66

66

s

slide-67
SLIDE 67

67

s t X

slide-68
SLIDE 68

68

slide-69
SLIDE 69

__kernel void iterall_edges( __local uint2 *edges, __local uint *edgecolour, __local float *node_val ) { __requires(?); for (uint c=0; c < MAX_COLOUR; c++) { if (c == edgecolour[tid]) { node_val[edges[tid].lo] = ...; node_val[edges[tid].hi] = ...; } barrier(); } }

69

slide-70
SLIDE 70

70

  • Write a precondition that satisfies the

colouring requirement

  • Preconditions and assertions are a kind of

executable documentation

$ cd 6_further $ gpuverify --local_size=8 --num_groups=1 graph.cl

slide-71
SLIDE 71

71

1 2 3 4 5 6 7

1 2 3 4 5 6 7

1

8 9 10 11 12 13 14 15

2

16 17 18 19 20 21 22 23

3

24 25 26 27 28 29 30 31

4

32 33 34 35 36 37 38 39

5

40 41 42 43 44 45 46 47

6

48 49 50 51 52 53 54 55

7

56 57 58 59 60 61 62 63

height = 8 width = 8 Row Major Aij stored at i + (width*j) (0,0) (0,1) (1,0) (1,1)

slide-72
SLIDE 72
  • Check out transpose.cu
  • Involves tricky loop invariants for reasoning

about data accesses of individual threads

  • More invariants than lines of code!

72

$ cd 6_further $ gpuverify --blockDim=[4,2] --gridDim=[2,2]

  • DWIDTH=8 -DHEIGHT=8 -DTILE_DIM=4 -DBLOCK_ROWS=2

transpose.cu

slide-73
SLIDE 73

Lessons

  • Valuable to know the limitations of the

tools you use

  • Discovering loop invariants can be time-

consuming (but rewarding!)

  • It is possible to reason about complicated

kernels if the engineering investment is worthwhile

73

slide-74
SLIDE 74

Closing

74

slide-75
SLIDE 75

Verification as a powerful and practical complement to Testing

75

slide-76
SLIDE 76

Formal reasoning as a valuable discipline

76

slide-77
SLIDE 77

Search ‘GPUVerify’ on YouTube

77

slide-78
SLIDE 78

78

http://multicore.doc.ic.ac.uk/tools/GPUVerify

slide-79
SLIDE 79

Alastair Dona Donaldson Microsoft Research Shaz Qadeer Frontend Adam Betts Peter Collingbourne Semantics heavy lifting Jeroen Ketema PhD students Paul Thomson Nathan Chong Dan Liew UROP students Egor Kyshtymov Cassie Epps

Work supported by EU FP7 STREP project CARP (project number 287767) and EPSRC PSL project (EP/I006761/1).