GPU PUs s Ap Appl plic ications ations Tyler Sorensen - - PowerPoint PPT Presentation

gpu pus s ap appl plic ications ations
SMART_READER_LITE
LIVE PREVIEW

GPU PUs s Ap Appl plic ications ations Tyler Sorensen - - PowerPoint PPT Presentation

Wea eak Memo emory y Be Beha havi vior ors s in in GPU PUs s Ap Appl plic ications ations Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop 1 Overview Current


slide-1
SLIDE 1

Wea eak Memo emory y Be Beha havi vior

  • rs

s in in GPU PUs s Ap Appl plic ications ations

Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop

1

slide-2
SLIDE 2

Overview

  • Current techniques for reasoning about GPU applications

under weak memory models are limited to hand analysis

2

slide-3
SLIDE 3

Overview

  • Current techniques for reasoning about GPU applications

under weak memory models are limited to hand analysis

  • This is laborious, error prone, requires a formal model

3

slide-4
SLIDE 4

Overview

  • Current techniques for reasoning about GPU applications

under weak memory models are limited to hand analysis

  • This is laborious, error prone, requires a formal model
  • We propose

e a n a new w methodo dolo logy y bas ased on stress ess an and fu fuzz z testin ting

4

slide-5
SLIDE 5

Overview

GPU application Add stressing/fuzzing hooks and postcondition Annotated notated GPU application

  • Run annotated application for many iterations and check for

postcondition violations.

5

slide-6
SLIDE 6

Overview

  • Buggy dot product routine

6

slide-7
SLIDE 7

Overview

  • Buggy dot product routine
  • Running the program for 1 hour

(~2 seconds per run) the number of failed postconditions are:

No No stress ess Stress ess/fuzzing /fuzzing

7

slide-8
SLIDE 8

Overview

  • Buggy dot product routine
  • Running the program for 1 hour

(~2 seconds per run) the number of failed postconditions are:

No No stress ess Stress ess/fuzzing /fuzzing 396 396

8

slide-9
SLIDE 9

Roadmap

  • Background
  • Stress testing details
  • Results

9

slide-10
SLIDE 10

Weak memory models

  • consider the test known as message passing (MP)

10

slide-11
SLIDE 11

Weak memory models

  • consider the test known as message passing (MP)

11

slide-12
SLIDE 12

Weak memory models

  • consider the test known as message passing (MP)

12

slide-13
SLIDE 13

Weak memory models

  • consider the test known as message passing (MP)

13

slide-14
SLIDE 14

Message passing (MP) test

  • T

ests how to implement a handshake idiom

Data Data

14

slide-15
SLIDE 15

Message passing (MP) test

  • T

ests how to implement a handshake idiom

Flag Flag

15

slide-16
SLIDE 16

Message passing (MP) test

  • T

ests how to implement a handshake idiom

Stale e Data

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

slide-22
SLIDE 22

assertion cannot be satisfied by interleavings this is known as Lamport’s sequential consistency (or SC)

22

slide-23
SLIDE 23

Weak memory models

  • can we assume assertion will never pass?

23

slide-24
SLIDE 24

Weak memory models

  • can we assume assertion will never pass? No!

24

slide-25
SLIDE 25

Weak memory models

  • Alglave et al. report this assertion passes 41 million times out of

5 billion test runs on T egra2 ARM processor1

1http://diy.inria.fr/cats/tables.html 25

slide-26
SLIDE 26

Weak memory models

  • what happened?

26

slide-27
SLIDE 27

Weak memory models

  • what happened?
  • architectures implement weak memory models where the

hardware is allowed to re-order certain memory instructions.

  • weak memory models can allow weak behaviors (executions

that do not correspond to an interleaving)

27

slide-28
SLIDE 28

Block 0 Block 1 Block n Threads

GPU programming

Global Memory

Shared memory for block 0 Shared memory for block 1 Shared memory for block n

Within blocks, threads are grouped into warps

28

slide-29
SLIDE 29

GPU programming

Global Memory

Threads

29

slide-30
SLIDE 30

GPU programming

Global Memory

Block 0 Block 1 Block n Threads

30

slide-31
SLIDE 31

GPU programming

Global Memory

Block 0 Block 1 Block n Threads

Shared memory for block 0 Shared memory for block 1 Shared memory for block n

31

slide-32
SLIDE 32

GPU programming

Global Memory

Within blocks, threads are grouped into warps Block 0 Block 1 Block n Threads

Shared memory for block 0 Shared memory for block 1 Shared memory for block n

32

slide-33
SLIDE 33

Roadmap

  • Background
  • Stress testing details
  • Results

33

slide-34
SLIDE 34

GPU memory models

34

  • Previous work1 showed that GPUs empirically have weak

memory models.

  • Done using a tool which ran litmus tests on GPUs
  • Required heuristics for weak behaviors to appear

1GPU concurrency: Weak behaviours and programming assumptions. ASPLOS ’15.

slide-35
SLIDE 35

Litmus tests

35

slide-36
SLIDE 36

Memory stress

T0 T1 extra thread 1 extra thread n

. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad

36

slide-37
SLIDE 37

Memory stress

T0 T1 extra thread 1 extra thread n

. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad Memory

37

slide-38
SLIDE 38

Memory stress

T0 T1 extra thread 1 extra thread n

. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad X Y Memory

38

slide-39
SLIDE 39

Memory stress

T0 T1 extra thread 1 extra thread n

. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad Scratch X Y Scratch Scratch Memory

39

slide-40
SLIDE 40

Memory stress

  • Can we extend memory stress for testing applications?

40

slide-41
SLIDE 41

Memory stress

block 0 block n extra block 0 extra block x

. . . . .

Run application

. . . . .

Memory stress

Scratchpad Memory Application Memory

41

slide-42
SLIDE 42

Memory stress

block 0 block n extra block 0 extra block x

. . . . .

Run application

. . . . . Application memory Scratchpad Memory

Memory stress

42

slide-43
SLIDE 43

Memory stress

  • Goal: design stress to reveal weak

behaviors with no a priori knowledge about the application.

  • We investigate using litmus tests, MP

, SB, and LB

Memory stress

43

slide-44
SLIDE 44

Memory stress

Where to stres ess: s:

44

slide-45
SLIDE 45

Memory stress

Where to stres ess: s:

  • For each distance D:

X Y

45

slide-46
SLIDE 46

Memory stress

Where to stres ess: s:

  • For each distance D:

X Y

46

slide-47
SLIDE 47

Memory stress

Where to stres ess: s:

  • For each distance D:

X Y

47

slide-48
SLIDE 48

Memory stress

Where to stres ess: s:

  • For each distance D:

X Y

48

slide-49
SLIDE 49

Memory stress

Where to stres ess: s:

  • For each distance D:

X D Y

49

slide-50
SLIDE 50

Memory stress

Where to stres ess: s:

  • For each distance D:
  • For each scratchpad location I:

X D Y I

50

slide-51
SLIDE 51

Memory stress

Where to stres ess: s:

  • For each distance D:
  • For each scratchpad location I:

X D Y I I

51

slide-52
SLIDE 52

Memory stress

Where to stres ess: s:

  • For each distance D:
  • For each scratchpad location I:

X D Y I I

52

slide-53
SLIDE 53

Memory stress

Where to stres ess: s:

  • For each distance D:
  • For each scratchpad location I:
  • Run MP

, SB, LB LB at at distan ance e D litmus us tests ts stressi ssing ng only locat atio ion n I I fo for 1000 0 iterat ratio ions ns

X D Y I I

53

slide-54
SLIDE 54

Memory stress

54

slide-55
SLIDE 55

Memory stress

Distance D

55

slide-56
SLIDE 56

Memory stress

Distance D

X D Y

56

slide-57
SLIDE 57

Memory stress

Distance D Index I stressed

57

slide-58
SLIDE 58

Memory stress

Distance D Index I stressed

I I

58

slide-59
SLIDE 59

Memory stress

Distance D Index I stressed Litmus test

59

slide-60
SLIDE 60

Memory stress

Vertical bar represents the magnitude

  • f weak behaviors observed

60

slide-61
SLIDE 61

Memory stress

  • Visualization samples

61

slide-62
SLIDE 62

Memory stress

  • Visualization samples

62

slide-63
SLIDE 63

Memory stress

  • Visualization samples

63

slide-64
SLIDE 64

Memory stress

  • Visualization samples

64

slide-65
SLIDE 65

Memory stress

  • What does this tell us?

65

slide-66
SLIDE 66

Memory stress

  • What does this tell us?
  • T
  • reveal weak behaviors we only need to stress 1 in every 32

locations*

  • We call a contiguous region of 32 elements a pat

atch

*64 for some chips

66

slide-67
SLIDE 67

Memory stress

  • How many patches can we effectively stress?
  • If D is unknown (as in applications), we would like to stress as

many disjoint patches as possible

67

slide-68
SLIDE 68

Memory stress

  • Scratchpad has size of 64 patches
  • We try stressing a randomly selected n patches for values 1 –

64 for n

68

slide-69
SLIDE 69

69

slide-70
SLIDE 70

Zoom in

  • n first 8

70

slide-71
SLIDE 71

71

slide-72
SLIDE 72

Stressing 2 random patches is most effective

72

slide-73
SLIDE 73

Memory stress

  • Now we have a memory stressing strategy!
  • Stress two random patches in the scratchpad
  • Patch size may change per chip

73

slide-74
SLIDE 74

Roadmap

  • Background
  • Stress testing details
  • Results

74

slide-75
SLIDE 75

Application

N-body particle simulation in Lonestar GPU benchmark1

1see: http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 75

slide-76
SLIDE 76

Application

N-body particle simulation in Lonestar GPU benchmark1

  • Documented to have communication across blocks
  • No other information a priori needed for our testing
  • Post condition checks the final location of particles

1see: http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 76

slide-77
SLIDE 77

Application

Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200:

77

slide-78
SLIDE 78

Application

No No stress ess With th stres ess

Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200:

78

slide-79
SLIDE 79

Application

No No stress ess With th stres ess 48 48

Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200:

79

slide-80
SLIDE 80

Comparing stresses

  • Does it matter how we stress?
  • We compare our systematic stressing method to 2 other

stressing strategies

80

slide-81
SLIDE 81

Comparing stresses

Cache stress:

  • Each stressing blocks streams over a scratchpad the size of the

L2 cache, performing one read and write at each location

Scratch (L2 cache size) Extra blocks

81

slide-82
SLIDE 82

Comparing stresses

Random stress:

  • Each thread randomly selects a location in the scratchpad and

randomly performs a read or write to that location

Scratch T0 T1 T2 T3 T4 T5 T6 T7 T8 …

82

slide-83
SLIDE 83

Application

Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs:

No No stress ess Systemati ematic c stres ess Cache he stress ess Rand ndom stress ess 48

83

slide-84
SLIDE 84

Application

Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs:

No No stress ess Systemati ematic c stres ess Cache he stress ess Rand ndom stress ess 48

84

slide-85
SLIDE 85

Application

Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs:

No No stress ess Systemati ematic c stres ess Cache he stress ess Rand ndom stress ess 48

85

slide-86
SLIDE 86

Application

How about the dot product application?

Chip ip No No stress ess

  • S. stress

ess Cache he stress ess Rand ndom stress ess Titan 396 GTX 980 495

86

slide-87
SLIDE 87

Application

How about the dot product application?

Chip ip No No stress ess

  • S. stress

ess Cache he stress ess Rand ndom stress ess Titan 396 GTX 980 495 2

87

slide-88
SLIDE 88

Application

How about the dot product application?

Chip ip No No stress ess

  • S. stress

ess Cache he stress ess Rand ndom stress ess Titan 396 2 GTX 980 495 2 1

88

slide-89
SLIDE 89

Full experimental study

  • T

ested:

  • 4 chips across 3 major Nvidia architectures
  • 10

10 applications

  • 3 different stress settings for each chip/application combination

89

slide-90
SLIDE 90

Full experimental study

  • T
  • tal of 40 chip/application combinations
  • Observed weak behaviors in 32 of the combinations

Systemati ematic Stress ess Rand ndom Stres ess Cache he Stres ess # combinations showing weak memory 32 8 8 # combinations most effective stress 28 2 2

90

slide-91
SLIDE 91

Some results:

  • We provided empirical confirmation of 3 bugs reported in

prior work

  • We discovered unreported weak memory bugs in 2

2 applications

91

slide-92
SLIDE 92

Future work

  • Use our testing framework to automatically insert memory

fences

  • Benchmark the cost of fences on GPUs

92

slide-93
SLIDE 93

Qu Question stions

93