SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in - - PowerPoint PPT Presentation

sword a bounded memory overhead detector of openmp data
SMART_READER_LITE
LIVE PREVIEW

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in - - PowerPoint PPT Presentation

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric School of Computing, University of Utah, Salt Lake City , UT 84112 Presented at IPDPS 2018 See paper for


slide-1
SLIDE 1

Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric School of Computing, University of Utah, Salt Lake City, UT 84112 Presented at IPDPS 2018 See paper for details Ignacio Laguna, Greg L. Lee, Dong H. Ahn Lawrence Livermore National Laboratory, Livermore, CA

Github.com / PRUNERS

SWORD: A Bounded Memory-Overhead Detector

  • f OpenMP Data Races

in Production Runs

Courtesy Pinterest

slide-2
SLIDE 2

What is a data race?

slide-3
SLIDE 3

What is a data race?

Thread 1 Thread 2

slide-4
SLIDE 4

What is a data race?

Thread 1 Thread 2 W R/W

slide-5
SLIDE 5

What is a data race?

Thread 1 Thread 2 W R/W No synchronizations

slide-6
SLIDE 6

T0 T1 W R/W

One way to eliminate this race

slide-7
SLIDE 7

T0 T1 W R/W

One way to eliminate this race

UNLOCK LOCK UNLOCK LOCK

slide-8
SLIDE 8

T0 T1 W R/W

One way to eliminate this race

UNLOCK LOCK UNLOCK LOCK

slide-9
SLIDE 9

Another way to eliminate this race

T0 T1 W R/W

slide-10
SLIDE 10

Another way to eliminate this race

T0 T1 W R/W RELEASE ACQUIRE

Signal using `special’ variables

  • Java ‘volatile’ annotations
  • NOT C ‘volatiles’ !
  • C++11 ’atomic’ annotations
slide-11
SLIDE 11

A third way

T0 T1 W R/W

slide-12
SLIDE 12

A third way

T0 T1 W R/W

Put a barrier

slide-13
SLIDE 13

Why eliminate races?

slide-14
SLIDE 14

Popular answer: “avoid nondeterminism”

T0 T1

X = 0 t = X

slide-15
SLIDE 15

Unclear what “nondeterminism” means..

slide-16
SLIDE 16

Execution Order is Still Nondeterministic

T0 T1

X = 0 t = X

UNLOCK LOCK UNLOCK LOCK

slide-17
SLIDE 17

More relevant: Avoid “pink elephants” !

slide-18
SLIDE 18

More relevant: Avoid “pink elephants” !

Pink elephant (Sutter) : “A value you never wrote but managed to read” Aka ”out of thin air” value

slide-19
SLIDE 19

The birth of a pink elephant…

T0 T1 X = 0 t = X T0 T1 X = 24 t = X

Compiler Optimizations

t is 0 here

24

read here You may never have written “24” in your program

slide-20
SLIDE 20

Details of how a pink elephant is made!

T0 T1 X = 0 t = X Y = 23 X = Y + 1 T0 T1 t = X Y = 23 The compiler has NO IDEA that the user meant to communicate here !! Compiler

  • ptimizations

create these pink-elephant values…

24

read here X = 24

slide-21
SLIDE 21

This is why code containing data races

  • ften fail (only) when optimized!
slide-22
SLIDE 22

Race-freedom ensures intended communications

T0 T1 W R/W

  • You don’t observe

“half baked” values

  • Code does not reorder

around sync. points

  • No “word tearing”
  • Pending writes flushed

(fences inserted)nly

slide-23
SLIDE 23

Exploding a myth!

There is no such thing as a benign race !!

slide-24
SLIDE 24

Races in OpenMP programs are hard to spot

  • See#tinyurl.com/ompRaces if#you#wish#
  • but$later$!
  • Static#analysis#tools#never#shown#to#work#well
  • First#usable#OpenMP dynamic#race#checker#(afaik)
  • Archer$[Atzeni,$IPDPS’16]
  • More$on$that$soon
  • This$talk$will#present#the#second#usable#dynamic#race#checker
  • Sword
slide-25
SLIDE 25

This talk: Why and how of another OMP race checker

slide-26
SLIDE 26
  • HYDRA&porting&on&Sequoia&at&LLNL
  • Large&multiphysics MPI/OpenMP application
  • Non@deterministic&crashes&in&OpenMP region
  • Only&when&the&code&was&optimized!
  • Suspected&data&race
  • Emergency&hack:
  • Disabled&OpenMP&in&Hypre
  • Root@cause&found&by&Archer&:
  • two&threads&writing&0 to&a&common&location&without&synchronization

The Pink Elephant Actually Struck Us!

slide-27
SLIDE 27

Archer to the rescue!

slide-28
SLIDE 28

Archer [IPDPS’16]

  • Utah: Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric
  • LLNL: Dong H. Ahn, Ignacio Laguna, Martin Schulz, Gregory L. Lee
  • RWTH: Joachim Protze, Matthias S. Muller

– In production use at LLNL Part of the “PRUNERS” tool suite

PRUNERS was a finalist of the 2017 R&D 100 Award Selection

Archer to the rescue!

slide-29
SLIDE 29

Archer’s “find”

Two$threads$writing$0 to$the$same$location$ without$synchronization

slide-30
SLIDE 30

Archer’s “find”

Two$threads$writing$0 to$the$same$location$ without$synchronization

slide-31
SLIDE 31

Did we live “happily ever after?”

slide-32
SLIDE 32

No !

slide-33
SLIDE 33

Archer has “memory-outs”; also misses races

slide-34
SLIDE 34
  • Archer&increases&memory&500%
  • It&also&misses&races!
  • These&were&known&issues
  • Finally'surfaced'with'the'”right'large'example”
  • Root9cause'found'by'Archer':
  • two'threads'writing'0 to'a'common'location'

without'synchronization

Archer has “memory-outs”; also misses races

slide-35
SLIDE 35

Reason: Archer employs “shadow cells”

Core 0 Core 1 Core 2 Core 3 ss0 ss1 ss2 ss3 A0 ss0 ss1 ss2 ss3 A1 ss0 ss1 ss2 ss3 Amax …. A programmable number of cells per address (4 shown, and is typical)

slide-36
SLIDE 36

~4 shadow cells per application location

Core 0 Core 1 Core 2 Core 3 ss0 ss1 ss2 ss3 A0 ss0 ss1 ss2 ss3 A1 ss0 ss1 ss2 ss3 Amax …. A programmable number of cells per address (4 shown, and is typical) Shadow-cells immediately increase memory demand by a factor of four

slide-37
SLIDE 37

Archer misses races due to shadow cell eviction

slide-38
SLIDE 38

Archer misses races due to shadow cell eviction

Core Core 1 Core 2 Core 3 ss0 ss1 ss2 ss3 A0 ss0 ss1 ss2 ss3 A1 ss0 ss1 ss2 ss3 Amax ….

slide-39
SLIDE 39

Core Core 1 Core 2 Core 3 ss0 ss1 ss2 ss3 A0 ss0 ss1 ss2 ss3 A1 ss0 ss1 ss2 ss3 Amax …. All threads read a[3] Thread 3 writes a[3] All threads read A[3] Thread 3 writes A[3]

Archer misses races due to shadow cell eviction

slide-40
SLIDE 40

Capacity conflict ! evict shadow cell

Core 0 Core 1 Core 2 Core 3 ss0 ss1 ss2 ss3 A0 ss0 ss1 ss2 ss3 A1 ss0 ss1 ss2 ss3 Amax …. With shadow-cell evicted, races are missed

slide-41
SLIDE 41

Archer misses races due to HB-masking

slide-42
SLIDE 42

Archer misses races due to HB-masking

These are concurrent; there are two races here! These races are missed in this interleaving!

slide-43
SLIDE 43

Solution : Get rid of shadow cells !!

slide-44
SLIDE 44

Offline Analysis

Core 0 Core 1 Core 2 Core 3

Need New Approach with Online/Offline split

Race Reports

Compression Compression Compression Compression

slide-45
SLIDE 45

Details of the online phase

Core 0 Core 1 Core 2 Core 3

  • Collect'traces'per'core'un#coordinated
  • Trace-collection-speeds-increased;-we-use-the-OMPT-tracing-method
  • Employ-data-compression-to-bring-FULL-traces-out
  • Only'2.5'MB'compression'buffer'per'thread'(fits-in-L3-cache)

Compression Compression Compression Compression

slide-46
SLIDE 46

Consequences for the offline phase

Core 0 Core 1 Core 2 Core 3

  • We#would#have#lost#all#the#synchronization#information
  • We#only#know#what#each#thread#is#doing
  • We#must#recover#the#concurrency#structure
  • And#in#the#context#of#its#happens;before#order,#detect#races!

Compression Compression Compression Compression

slide-47
SLIDE 47

Offline synchronization recovery and analysis

0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 3 - [0,1][0,2][0,2] 4 - [0,1][0,2][1,2] 7 - [0,1][2,2] 5 - [0,1][1,2][0,2] 6 - [0,1][1,2][1,2] 11 - [0,1][3,2] 12 - [1,1] 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] 10 - [0,1][4,2] IBarrier(3) Barrier(1) read(x) write(y) write(x) m_acq() m_rel() read(y) m_acq(M1) m_rel(M1) IBarrier(4) Barrier(2) write(y) m_acq(M1) m_rel(M1) write(x) m_acq() m_rel() IBarrier(6) FOR-LOOP IBarrier(7) R1: race on y R2: race on y R3: race on x IBarrier(5)

Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression

OpSem (HIPS’18)

slide-48
SLIDE 48

Offset-Span Labels: How we record concurrency

(Mellor-Crummey, 1991)

slide-49
SLIDE 49

Key state in OpSem: Maintain Barrier Intervals

0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 3 - [0,1][0,2][0,2] 4 - [0,1][0,2][1,2] 7 - [0,1][2,2] 5 - [0,1][1,2][0,2] 6 - [0,1][1,2][1,2] 11 - [0,1][3,2] 12 - [1,1] 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] 10 - [0,1][4,2] IBarrier(3) Barrier(1) read(x) write(y) write(x) m_acq() m_rel() read(y) m_acq(M1) m_rel(M1) IBarrier(4) Barrier(2) write(y) m_acq(M1) m_rel(M1) write(x) m_acq() m_rel() IBarrier(6) FOR-LOOP IBarrier(7) IBarrier(5)

Barrier&Interval&1 Barrier&Interval&3 Barrier&Interval&2 Barrier&Interval&5

slide-50
SLIDE 50

Examples of Races Reported

0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 3 - [0,1][0,2][0,2] 4 - [0,1][0,2][1,2] 7 - [0,1][2,2] 5 - [0,1][1,2][0,2] 6 - [0,1][1,2][1,2] 11 - [0,1][3,2] 12 - [1,1] 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] 10 - [0,1][4,2] IBarrier(3) Barrier(1) read(x) write(y) write(x) m_acq() m_rel() read(y) m_acq(M1) m_rel(M1) IBarrier(4) Barrier(2) write(y) m_acq(M1) m_rel(M1) write(x) m_acq() m_rel() IBarrier(6) FOR-LOOP IBarrier(7) R1: race on y R2: race on y R3: race on x IBarrier(5)

Barrier& Interval&3

Race&within& same& barrier& interval

slide-51
SLIDE 51

Examples of Races Reported

0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 3 - [0,1][0,2][0,2] 4 - [0,1][0,2][1,2] 7 - [0,1][2,2] 5 - [0,1][1,2][0,2] 6 - [0,1][1,2][1,2] 11 - [0,1][3,2] 12 - [1,1] 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] 10 - [0,1][4,2] IBarrier(3) Barrier(1) read(x) write(y) write(x) m_acq() m_rel() read(y) m_acq(M1) m_rel(M1) IBarrier(4) Barrier(2) write(y) m_acq(M1) m_rel(M1) write(x) m_acq() m_rel() IBarrier(6) FOR-LOOP IBarrier(7) R1: race on y R2: race on y R3: race on x IBarrier(5)

Barrier& Interval&3

Races&across& parallel&regions

Barrier&Interval&2 Barrier&Interval&5

slide-52
SLIDE 52

Good news

  • Online&analysis&proved&really&good
  • No#memory#pressure#!!
slide-53
SLIDE 53

Bad news

Offline'analysis'took$a$day$to$$finish$on “medium$sized”$examples

slide-54
SLIDE 54

T wo Key Innovations Saved the Approach

  • Self%balancing,red%black interval,trees
  • On%the%fly,generation,of,Integer,Linear,Programs
slide-55
SLIDE 55
  • Decompress,*record*strided accesses)in*self0balancing*red0black interval*trees
  • Generate*Integer*Linear*Programs*on0the0fly,*and*check*for*overlaps
  • Handles)bursts)of)accesses)efficiently

Core 0 Core 1 Core 2 Core 3

Reducing “a day” to “under a minute”

Compression Compression Compression Compression

slide-56
SLIDE 56

OMP read/writes are bursty with strides!

slide-57
SLIDE 57

OMP read/writes are bursty with strides!

Each of this is a multi-word access Build Integer Linear Programs for each constant-stride interval ILP system encodes accessed byte-addresses in each “burst”

slide-58
SLIDE 58

Overlap of Access Bursts: ILP Generation!

slide-59
SLIDE 59

Interval Trees to record accesses

[335820,335820],1 R,4,4208860 [335820,335820],1 W,4,4208658 [335824,335824],1 W,4,4208639 [335816,335816],1 W,4,4208677 [335820,335820],1 R,4,4208822 [335820,335820],1 W,4,4208884 [335920,335920],1 R,4,4208736 [335812,335812],1 W,4,4208696 [335820,335820],1 R,4,4208926 [335824,335824],1 R,4,4208902 [337888,339884],500 R,4,4208985 [337892,339888],500 W,4,4209028

  • Recorded info is: [Begin, End], #Accesses, Kind, Stride, AtWhichPCValue
  • Allows efficient comparison of access bursts across threads
  • These Red-Black trees are highly tuned
  • Used within Linux to realize fair scheduling methods
slide-60
SLIDE 60

Concluding Remarks: Sword is now practical!

Both%Archer%and%Sword%are%available

Github.com /%PRUNERS

slide-61
SLIDE 61

Conclusions: Time for “Medium” Examples

Online Offline Total Efficacy

Archer

1 1

Misses races

Sword

1 10* 11

Finds all races within the execution**

* : can be brought down to 1 by using an MPI cluster ** : we define the formal semantics of OMP race checking [HIPS’18]

slide-62
SLIDE 62

Online Offline Total Efficacy

Archer

1 1

Misses races

Sword

1 10* 11

Finds all races within the execution**

Conclusions: Time for Larger Examples

Memory

slide-63
SLIDE 63
  • Sword&works&well&;&finds&more&races&than&Archer
  • Applied&to&realistic&benchmarks
  • Archer&test&suite
  • RaceBench from&LLNL
  • Offline&analysis&can&be&parallelized
  • Still&“decent”&on&standard&multicore&platforms
  • It&took&many%ideas%working&together&to&realize&Sword
  • Formal&semantics&of&OpenMPConcurrency
  • Online&/&Offline&checking&split
  • Data&compression
  • SelfAbalancing&interval&trees
  • ILPAsystems&to&compress&traces
  • Employs&standard&tracing&methods&based&on&OMPT

More Concluding Remarks

slide-64
SLIDE 64
  • Continue to(debug(/(tune(Sword
  • Incorporate ideas(from(upcoming(pubs
  • GPU(race checking

Future Work

slide-65
SLIDE 65

Group Credits

Simone Zvonimir Dong Ignacio Greg

slide-66
SLIDE 66

Extras

slide-67
SLIDE 67
  • High-level code is just “fiction”
  • Code%optimizations%are%done%on%a%PER%THREAD%basis
  • Races%occur%if%you%don’t%tell%a%compiler%what’s%shared

while(!f)%{}%%%%%! r%=%f;%%while%(!r)%{}%%%%%:%this%is%OK%if%“f”%is%purely%local while(!f)%{}%%%%%! r%=%f;%%while%(!r)%{}%%%%%:%not%OK%if%f is%shared%and%you%don’t%tell this%to%the%compiler

  • How to inform a compiler
  • Put%the%variables%inside%a%mutex (or%other%synchronization%block)
  • Declare%them%to%be%a%Java%volatile%or%C++11%atomic
  • C-volatiles won’t do (they don’t have a definite concurrency semantics)

Data Races: Gist

slide-68
SLIDE 68
  • High-level code is just “fiction”
  • Code%optimizations%are%done%on%a%PER%THREAD%basis
  • Races%occur%if%you%don’t%tell%a%compiler%what’s%shared

while(!f)%{}%%%%%! r%=%f;%%while%(!r)%{}%%%%%:%this%is%OK%if%“f”%is%purely%local while(!f)%{}%%%%%! r%=%f;%%while%(!r)%{}%%%%%:%not%OK%if%f is%shared%and%you%don’t%tell this%to%the%compiler

Data Races: Gist

slide-69
SLIDE 69

GPUs races also can lead to “pink-elephants”

Analogy due to Herb Sutter

__global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( The'hardware'schedules'these'instrucKons'in' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' E.g.'When'compiling'with'opKmizaKons' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…(' New(Answer:(0,(2,(4,(6,(8,(…'