Wea eak Memo emory y Be Beha havi vior
- rs
s in in GPU PUs s Ap Appl plic ications ations
Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop
1
GPU PUs s Ap Appl plic ications ations Tyler Sorensen - - PowerPoint PPT Presentation
Wea eak Memo emory y Be Beha havi vior ors s in in GPU PUs s Ap Appl plic ications ations Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop 1 Overview Current
Tyler Sorensen Supervisors: Alastair F. Donaldson and James Brotherston 15 July 2015 Imperial Concurrency Workshop
1
under weak memory models are limited to hand analysis
2
under weak memory models are limited to hand analysis
3
under weak memory models are limited to hand analysis
e a n a new w methodo dolo logy y bas ased on stress ess an and fu fuzz z testin ting
4
GPU application Add stressing/fuzzing hooks and postcondition Annotated notated GPU application
postcondition violations.
5
6
(~2 seconds per run) the number of failed postconditions are:
No No stress ess Stress ess/fuzzing /fuzzing
7
(~2 seconds per run) the number of failed postconditions are:
No No stress ess Stress ess/fuzzing /fuzzing 396 396
8
9
10
11
12
13
ests how to implement a handshake idiom
Data Data
14
ests how to implement a handshake idiom
Flag Flag
15
ests how to implement a handshake idiom
Stale e Data
16
17
18
19
20
21
assertion cannot be satisfied by interleavings this is known as Lamport’s sequential consistency (or SC)
22
23
24
5 billion test runs on T egra2 ARM processor1
1http://diy.inria.fr/cats/tables.html 25
26
hardware is allowed to re-order certain memory instructions.
that do not correspond to an interleaving)
27
Block 0 Block 1 Block n Threads
Global Memory
Shared memory for block 0 Shared memory for block 1 Shared memory for block n
Within blocks, threads are grouped into warps
28
Global Memory
Threads
29
Global Memory
Block 0 Block 1 Block n Threads
30
Global Memory
Block 0 Block 1 Block n Threads
Shared memory for block 0 Shared memory for block 1 Shared memory for block n
31
Global Memory
Within blocks, threads are grouped into warps Block 0 Block 1 Block n Threads
Shared memory for block 0 Shared memory for block 1 Shared memory for block n
32
33
34
memory models.
1GPU concurrency: Weak behaviours and programming assumptions. ASPLOS ’15.
35
T0 T1 extra thread 1 extra thread n
. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad
36
T0 T1 extra thread 1 extra thread n
. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad Memory
37
T0 T1 extra thread 1 extra thread n
. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad X Y Memory
38
T0 T1 extra thread 1 extra thread n
. . . . . run T0 test program run T1 test program loop: read or write to scratchpad loop: read or write to scratchpad Scratch X Y Scratch Scratch Memory
39
40
block 0 block n extra block 0 extra block x
. . . . .
Run application
. . . . .
Memory stress
Scratchpad Memory Application Memory
41
block 0 block n extra block 0 extra block x
. . . . .
Run application
. . . . . Application memory Scratchpad Memory
Memory stress
42
behaviors with no a priori knowledge about the application.
, SB, and LB
Memory stress
43
Where to stres ess: s:
44
Where to stres ess: s:
X Y
45
Where to stres ess: s:
X Y
46
Where to stres ess: s:
X Y
47
Where to stres ess: s:
X Y
48
Where to stres ess: s:
X D Y
49
Where to stres ess: s:
X D Y I
50
Where to stres ess: s:
X D Y I I
51
Where to stres ess: s:
X D Y I I
52
Where to stres ess: s:
, SB, LB LB at at distan ance e D litmus us tests ts stressi ssing ng only locat atio ion n I I fo for 1000 0 iterat ratio ions ns
X D Y I I
53
54
Distance D
55
Distance D
X D Y
56
Distance D Index I stressed
57
Distance D Index I stressed
I I
58
Distance D Index I stressed Litmus test
59
Vertical bar represents the magnitude
60
61
62
63
64
65
locations*
atch
*64 for some chips
66
many disjoint patches as possible
67
64 for n
68
69
Zoom in
70
71
Stressing 2 random patches is most effective
72
73
74
N-body particle simulation in Lonestar GPU benchmark1
1see: http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 75
N-body particle simulation in Lonestar GPU benchmark1
1see: http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu 76
Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200:
77
No No stress ess With th stres ess
Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200:
78
No No stress ess With th stres ess 48 48
Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs on a Quadro K5200:
79
stressing strategies
80
Cache stress:
L2 cache, performing one read and write at each location
Scratch (L2 cache size) Extra blocks
81
Random stress:
randomly performs a read or write to that location
Scratch T0 T1 T2 T3 T4 T5 T6 T7 T8 …
82
Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs:
No No stress ess Systemati ematic c stres ess Cache he stress ess Rand ndom stress ess 48
83
Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs:
No No stress ess Systemati ematic c stres ess Cache he stress ess Rand ndom stress ess 48
84
Executing the application for 1 hour (~2 seconds per run), the number of erroneous runs:
No No stress ess Systemati ematic c stres ess Cache he stress ess Rand ndom stress ess 48
85
How about the dot product application?
Chip ip No No stress ess
ess Cache he stress ess Rand ndom stress ess Titan 396 GTX 980 495
86
How about the dot product application?
Chip ip No No stress ess
ess Cache he stress ess Rand ndom stress ess Titan 396 GTX 980 495 2
87
How about the dot product application?
Chip ip No No stress ess
ess Cache he stress ess Rand ndom stress ess Titan 396 2 GTX 980 495 2 1
88
ested:
10 applications
89
Systemati ematic Stress ess Rand ndom Stres ess Cache he Stres ess # combinations showing weak memory 32 8 8 # combinations most effective stress 28 2 2
90
prior work
2 applications
91
fences
92
93