scalable observation system sos for scientific workflows
play

Scalable Observation System (SOS) for Scientific Workflows Pr - PowerPoint PPT Presentation

Scalable Observation System (SOS) for Scientific Workflows Pr Project Ov oject Over erview & Discus view & Discussion sion Chad D. Wood Supervisors: Prof. Allen Malony and Kevin Huck So, where is this talk going? To advocate


  1. Scalable Observation System (SOS) for Scientific Workflows Pr Project Ov oject Over erview & Discus view & Discussion sion Chad D. Wood Supervisors: Prof. Allen Malony and Kevin Huck

  2. “So, where is this talk going?” To advocate and demonstrate a run-%me system designed to enable the characteriza%on and analysis of complex scien6fic workflow performance at scale. 2 2

  3. em? So So, , What is the Pr What is the Probl oblem? v It is reasonable to want to see “informa6on” during applica6on execu6on v Informa6on could come from the applica6on as well as from the environment in which the applica6on is execu6ng v Applica6on: Performance, problem-specific data and metadata, ... v Environment: System state, resource usage, run6me proper6es, ... v Mul6ple applica6on components may be running together as a workflow, and higher-level workflow behavior might be interes6ng 3 3

  4. ws Scientific Scien tific Workflo orkflows Compute Time A: Parallel B: Serial C 1 : Irregular VIZ C 1 + C 2 : Parallel C 2 : Serial = Unit of Work = Result DATA v Mul6ple components with data flow v Complex interac6ons with dynamic behavior v Components (or en6re flows) may be parallelized differently v Offline episodic performance analysis has limited benefits 4 4

  5. ts? What ar What are the Requir e the Requiremen ements? v Scalable v Portable v Easy to use v Mul6-purpose v Mul6ple informa6on sources v Operates at the 6me of applica6on (workflow) execu6on v Supports in situ access v Low overhead and low intrusion v Ability to alloca6on addi6onal resources to control overhead 5 5

  6. oach Design Approach Design Appr v Base on a model of a “global” informa6on space v U6lize database technology v U6lize MPI high-performance communica6on v Build on launch support in scheduler v Allow for addi6onal (dedicated) resource alloca6on v Flexible publishing interface v SOS architecture 6 6

  7. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai SOSflow forms a func6onal overlay. 7 7

  8. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai In situ daemon with its local database 8 8

  9. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai PE PE SOS PE PE PE PE PE PE PE PE PE SOS lives side by side with your tasks 9 9

  10. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Dedicated nodes for aggregate databases 10 10

  11. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Dedicated nodes for analy@cs processing 11 11

  12. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Co-located analy@cs query modules 12 12

  13. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Independent ranks of analy@cs engines 13 13

  14. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Analy6cs modules form independent communica6on channels 14 14

  15. C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai SOSflow data is con@nuous and asynchronous 15 15

  16. ure SOSflow: Data Struct SOSflo : Data Structur ENUM Create_Input Scope Source CREATE_OUTPUT Layer PUBLICATION HANDLE Create_Viz Nature Metadata Exec_Work Retain About both Pub Handle and Source Buffer SOS Support_Exec Value Support_Flow Frequency Frequency Control_Flow Frequency State Value State Sos State Class Class ... Class Type Type Type Seman@c ENUM Value Seman@c Seman@c Pa[ern Time_Start Pa[ern Pa[ern Compare Time_Stop Compare Rela6onship Hints Compare Mood TIME_STAMP Mood Mood Time_Span Sample Counter Log 16 16

  17. ure SOSflow: Data Struct SOSflo : Data Structur Every value is conserved, with its full history and evolving metadata . . . . 3 PUB. HANDLE 4 mood 5 @me.pack @me. 6 stored by client pack 7 @me.send send 8 recv pushed to daemon @me.recv 9 seman@c injected into db 10 val=___ 11 12 13 14 . . . < val_snaps > < defini6ons > 17 17

  18. : Easy to use API SOSflo SOSflow: Easy to use API . . . 18 18

  19. otocol SOSflo SOSflow: In Sit : In Situ Sock u Socket C et Communication Pr ommunication Protocol Source INIT sosd SOS GUID_BLOCK ANNOUNCE Metadata, PUBLISH Defs. / Structure VAL_SNAPS All pack()’ed values ... VAL_SNAPS 19 19

  20. e) SOSflo SOSflow: Dis : Distributed A tributed Asynchr synchronous Run onous Runtime (Simpl time (Simple) Source sosd sosd SOS (DB) DB AGGREGATE DB 20 20

  21. time SOSflow: Dis SOSflo : Distributed A tributed Asynchr synchronous Run onous Runtime Client App Client App node node Source node SOS SOS SOS DB sosd db transport massive database of doom cloud_sync socket t sosa local_sync r o p s n sosd a r t analy@cs sosd sosd analy@cs node helper DB local (on-node) local query query 21 21

  22. e it Runs SOSflo SOSflow: : Wher Where it Runs v NERSC v Sogware: q Cori q OpenMPI q Edison q MPICH v LLNL q Slurm q CAB q PBS q Catalyst v University of Oregon q ACISS 22 22

  23. uation SOSflo SOSflow: Ev : Eval aluation v Experimental Setup q Explore performance of work-in-progress implementa6on q Synthe6c and real-world cases q What is the latency cost of being async? v Synthe6c Sweep of Parameters q Itera6ons: 2 to 10, steps of 2 q Size: 100 to 500 unique values per pub, steps of 100 q Delay: 0.5 to 1.0 second, each 0.1 second 23 23

  24. [2 iter] [10 iter] 500 400 300 200 100 [SOS_publish() freq. shown as transparency, 0.5 sec to 1.0 sec (darkest)]

  25. [2 iter] [10 iter] 500 400 300 200 100 [Translucency repr. SOS_publish() frequency, 0.5 sec to 1.0 sec (darkest)]

  26. uation SOSflo SOSflow: Ev : Eval aluation v Real-World Scenario q TAU Instrumented LULESH on Cori q TAU reports results to SOSflow on a 6mer q LULESH calls SOSflow API directly at itera6on q SOSflow gathers metrics from the OS <Video> 26 26

  27. 216 Processes

  28. 343 Processes

  29. 512 Processes

  30. ork Fut Futur ure e Work v Performance improvements v Integrate more automa6c data gathering for node-level metrics v Support for deep analy6cs v Tes6ng with addi6onal real-world workflows and opera6ng environments 30 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend