sword a bounded memory overhead detector of openmp data
play

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in - PowerPoint PPT Presentation

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric School of Computing, University of Utah, Salt Lake City , UT 84112 Presented at IPDPS 2018 See paper for


  1. SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric School of Computing, University of Utah, Salt Lake City , UT 84112 Presented at IPDPS 2018 See paper for details Courtesy Pinterest Ignacio Laguna, Greg L. Lee, Dong H. Ahn Lawrence Livermore National Laboratory, Livermore, CA Github.com / PRUNERS

  2. What is a data race?

  3. What is a data race? Thread 1 Thread 2

  4. What is a data race? Thread 1 Thread 2 R/W W

  5. What is a data race? Thread 1 Thread 2 No synchronizations R/W W

  6. One way to eliminate this race T0 T1 W R/W

  7. One way to eliminate this race T0 T1 LOCK LOCK W R/W UNLOCK UNLOCK

  8. One way to eliminate this race T0 T1 LOCK LOCK W R/W UNLOCK UNLOCK

  9. Another way to eliminate this race T0 T1 W R/W

  10. Another way to eliminate this race T0 T1 Signal using `special’ variables ACQUIRE • Java ‘volatile’ annotations • NOT C ‘volatiles’ ! W R/W • C++11 ’atomic’ annotations RELEASE

  11. A third way T0 T1 W R/W

  12. A third way T0 T1 Put a barrier W R/W

  13. Why eliminate races?

  14. Popular answer: “avoid nondeterminism” T0 T1 t = X X = 0

  15. Unclear what “nondeterminism” means..

  16. Execution Order is Still Nondeterministic T0 T1 LOCK LOCK X = 0 t = X UNLOCK UNLOCK

  17. More relevant: Avoid “pink elephants” !

  18. More relevant: Avoid “pink elephants” ! Pink elephant (Sutter) : “A value you never wrote but managed to read” Aka ”out of thin air” value

  19. The birth of a pink elephant… T0 T0 T1 T1 Compiler Optimizations X = 24 X = 0 t = X t = X 24 t is 0 here read here You may never have written “24” in your program

  20. Details of how a pink elephant is made! The compiler has T0 T0 T1 T1 NO IDEA that the user meant to communicate here !! X = 24 X = 0 t = X t = X 24 Y = 23 Y = 23 Compiler read here optimizations X = Y + 1 create these pink-elephant values…

  21. This is why code containing data races often fail (only) when optimized!

  22. Race-freedom ensures intended communications T0 T1 • You don’t observe “half baked” values • Code does not reorder around sync. points • No “word tearing” W R/W • Pending writes flushed (fences inserted)nly

  23. Exploding a myth! There is no such thing as a benign race !!

  24. Races in OpenMP programs are hard to spot • See#tinyurl.com/ompRaces if#you#wish# • but$later$ ! • Static#analysis#tools#never#shown#to#work#well • First#usable#OpenMP dynamic#race#checker#(afaik) • Archer$[Atzeni,$IPDPS’16] • More$on$that$soon • This$talk$ will#present#the#second#usable#dynamic#race#checker • Sword

  25. This talk: Why and how of another OMP race checker

  26. The Pink Elephant Actually Struck Us! • HYDRA&porting&on&Sequoia&at&LLNL • Large&multiphysics MPI/OpenMP application • Non@deterministic&crashes&in&OpenMP region • Only&when&the&code&was&optimized! • Suspected&data&race • Emergency&hack: • Disabled&OpenMP&in&Hypre • Root@cause&found&by&Archer&: • two&threads&writing& 0 to&a&common&location&without&synchronization

  27. Archer to the rescue!

  28. Archer to the rescue! Archer [IPDPS’16] • Utah: Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric • LLNL: Dong H. Ahn, Ignacio Laguna, Martin Schulz, Gregory L. Lee • RWTH: Joachim Protze, Matthias S. Muller – In production use at LLNL Part of the “PRUNERS” tool suite PRUNERS was a finalist of the 2017 R&D 100 Award Selection

  29. Archer’s “find” Two$threads$writing$ 0 to$the$same$location$ without$synchronization

  30. Archer’s “find” Two$threads$writing$ 0 to$the$same$location$ without$synchronization

  31. Did we live “happily ever after?”

  32. No !

  33. Archer has “memory-outs”; also misses races

  34. Archer has “memory-outs”; also misses races • Archer&increases&memory&500% • It&also&misses&races! • These&were&known&issues • Finally'surfaced'with'the'”right'large'example” • Root9cause'found'by'Archer': • two'threads'writing' 0 to'a'common'location' without'synchronization

  35. Reason: Archer employs “shadow cells” Core 0 Core 1 Core 2 Core 3 A0 A1 Amax A programmable number of cells ss0 ss0 ss0 per address ss1 ss1 …. ss1 (4 shown, and is typical) ss2 ss2 ss2 ss3 ss3 ss3

  36. ~4 shadow cells per application location Core 0 Core 1 Core 2 Core 3 A0 A1 Amax A programmable number of cells ss0 ss0 ss0 per address ss1 ss1 …. ss1 (4 shown, and is typical) ss2 ss2 ss2 ss3 ss3 ss3 Shadow-cells immediately increase memory demand by a factor of four

  37. Archer misses races due to shadow cell eviction

  38. Archer misses races due to shadow cell eviction Core Core Core Core 0 1 2 3 A0 A1 Amax ss0 ss0 ss0 ss1 ss1 …. ss1 ss2 ss2 ss2 ss3 ss3 ss3

  39. Archer misses races due to shadow cell eviction Core Core Core Core 0 1 2 3 A0 A1 Amax ss0 ss0 ss0 ss1 ss1 …. ss1 ss2 ss2 ss2 ss3 ss3 ss3 Thread 3 writes a[3] All threads read a[3] All threads read A[3] Thread 3 writes A[3]

  40. Capacity conflict ! evict shadow cell Core 0 Core 1 Core 2 Core 3 A0 A1 Amax ss0 ss0 ss0 ss1 ss1 …. ss1 ss2 ss2 ss2 ss3 ss3 ss3 With shadow-cell evicted, races are missed

  41. Archer misses races due to HB-masking

  42. Archer misses races due to HB-masking These are These races concurrent; are missed there are two in this races here! interleaving!

  43. Solution : Get rid of shadow cells !!

  44. Need New Approach with Online/Offline split Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression Offline Analysis Race Reports

  45. Details of the online phase Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression • Collect'traces'per'core' un#coordinated • Trace-collection-speeds-increased;-we-use-the-OMPT-tracing-method • Employ-data-compression-to-bring-FULL-traces-out • Only'2.5'MB'compression'buffer'per'thread' (fits-in-L3-cache)

  46. Consequences for the offline phase Core 0 Core 1 Core 2 Core 3 Compression Compression Compression Compression • We#would#have#lost#all#the#synchronization#information • We#only#know#what#each#thread#is#doing • We#must#recover#the#concurrency#structure • And#in#the#context#of#its#happens;before#order,#detect#races!

  47. Offline synchronization recovery and analysis 0 - [0,1] Core 0 Core 1 Core 2 Core 3 1 - [0,1][0,2] 2 - [0,1][1,2] Compression Compression Compression Compression 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) R2: race on y write(y) write(x) m_rel(M1) m_rel() Barrier(2) Barrier(1) read(x) R1: race on y write(y) OpSem IBarrier(3) 7 - [0,1][2,2] FOR-LOOP (HIPS’18) R3: race on x m_acq() write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) IBarrier(4) IBarrier(6) 10 - [0,1][4,2] 11 - [0,1][3,2] IBarrier(7) 12 - [1,1]

  48. Offset-Span Labels: How we record concurrency (Mellor-Crummey, 1991)

  49. Key state in OpSem: Maintain Barrier Intervals 0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) Barrier&Interval&1 Barrier&Interval&2 write(y) write(x) m_rel() m_rel(M1) Barrier(2) Barrier(1) read(x) Barrier&Interval&3 write(y) IBarrier(3) 7 - [0,1][2,2] FOR-LOOP m_acq() Barrier&Interval&5 write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) IBarrier(4) IBarrier(6) 10 - [0,1][4,2] 11 - [0,1][3,2] IBarrier(7) 12 - [1,1]

  50. Examples of Races Reported 0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) R2: race on y write(y) write(x) m_rel() m_rel(M1) Barrier(2) Barrier(1) Barrier& read(x) R1: race on y write(y) Interval&3 IBarrier(3) 7 - [0,1][2,2] FOR-LOOP R3: race on x m_acq() write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) Race&within& IBarrier(4) IBarrier(6) same& 10 - [0,1][4,2] barrier& 11 - [0,1][3,2] interval IBarrier(7) 12 - [1,1]

  51. Examples of Races Reported 0 - [0,1] 1 - [0,1][0,2] 2 - [0,1][1,2] 5 - [0,1][1,2][0,2] 3 - [0,1][0,2][0,2] 6 - [0,1][1,2][1,2] 4 - [0,1][0,2][1,2] m_acq() m_acq(M1) Barrier&Interval&2 R2: race on y write(y) write(x) m_rel() m_rel(M1) Races&across& Barrier(2) Barrier(1) parallel&regions Barrier& read(x) R1: race on y write(y) Interval&3 IBarrier(3) 7 - [0,1][2,2] FOR-LOOP R3: race on x m_acq() Barrier&Interval&5 write(x) 8 - [0,1][2,2][0,2] 9 - [0,1][2,2][1,2] m_rel() m_acq(M1) IBarrier(5) read(y) m_rel(M1) IBarrier(4) IBarrier(6) 10 - [0,1][4,2] 11 - [0,1][3,2] IBarrier(7) 12 - [1,1]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend