Scalable Observation System (SOS) for Scientific Workflows Pr - - PowerPoint PPT Presentation

scalable observation system sos for scientific workflows
SMART_READER_LITE
LIVE PREVIEW

Scalable Observation System (SOS) for Scientific Workflows Pr - - PowerPoint PPT Presentation

Scalable Observation System (SOS) for Scientific Workflows Pr Project Ov oject Over erview & Discus view & Discussion sion Chad D. Wood Supervisors: Prof. Allen Malony and Kevin Huck So, where is this talk going? To advocate


slide-1
SLIDE 1

Chad D. Wood

Scalable Observation System (SOS) for Scientific Workflows

Pr Project Ov

  • ject Over

erview & Discus view & Discussion sion

Supervisors: Prof. Allen Malony and Kevin Huck

slide-2
SLIDE 2

“So, where is this talk going?” To advocate and demonstrate a run-%me system designed to enable the characteriza%on and analysis of complex scien6fic workflow performance at scale.

2 2

slide-3
SLIDE 3

v It is reasonable to want to see “informa6on” during applica6on execu6on v Informa6on could come from the applica6on as well as from the environment in which the applica6on is execu6ng v Applica6on: Performance, problem-specific data and metadata, ... v Environment: System state, resource usage, run6me proper6es, ... v Mul6ple applica6on components may be running together as a workflow, and higher-level workflow behavior might be interes6ng So So, , What is the Pr What is the Probl

  • blem?

em?

3 3

slide-4
SLIDE 4

Compute Time DATA VIZ

v Mul6ple components with data flow v Complex interac6ons with dynamic behavior v Components (or en6re flows) may be parallelized differently v Offline episodic performance analysis has limited benefits

A: Parallel B: Serial C2: Serial C1: Irregular C1 + C2 : Parallel

Scien Scientific tific Workflo

  • rkflows

ws

4 4 = Unit of Work = Result

slide-5
SLIDE 5

v Scalable v Portable v Easy to use v Mul6-purpose v Mul6ple informa6on sources v Operates at the 6me of applica6on (workflow) execu6on v Supports in situ access v Low overhead and low intrusion v Ability to alloca6on addi6onal resources to control overhead

What ar What are the Requir e the Requiremen ements? ts?

5 5

slide-6
SLIDE 6

v Base on a model of a “global” informa6on space v U6lize database technology v U6lize MPI high-performance communica6on v Build on launch support in scheduler v Allow for addi6onal (dedicated) resource alloca6on v Flexible publishing interface v SOS architecture Design Appr Design Approach

  • ach

6 6

slide-7
SLIDE 7

7 7

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

SOSflow forms a func6onal overlay.

slide-8
SLIDE 8

8 8

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

In situ daemon with its local database

slide-9
SLIDE 9

9 9

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

SOS lives side by side with your tasks

PE PE SOS PE PE PE PE PE PE PE PE PE

slide-10
SLIDE 10

10 10

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

Dedicated nodes for aggregate databases

slide-11
SLIDE 11

11 11

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

Dedicated nodes for analy@cs processing

slide-12
SLIDE 12

12 12

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

Co-located analy@cs query modules

slide-13
SLIDE 13

13 13

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

Independent ranks of analy@cs engines

slide-14
SLIDE 14

14 14

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db Db Db

Analy6cs modules form independent communica6on channels

Adb Ai Ai Adb Adb

slide-15
SLIDE 15

15 15

SOSflo SOSflow: HP : HPC Allocations C Allocations

Db

Adb Ai Ai Adb Adb

Db Db

SOSflow data is con@nuous and asynchronous

slide-16
SLIDE 16

Frequency State Class Type Seman@c Pa[ern Compare Mood Frequency State Class Type Seman@c Pa[ern Compare Mood

Source

SOS

PUBLICATION HANDLE

Value

...

Value Rela6onship Hints Metadata

About both Pub Handle and Source

Value Scope Layer Nature Retain Frequency State Class Type Seman@c Pa[ern Compare Mood Time_Start Time_Stop TIME_STAMP Time_Span Sample Counter Log Create_Input CREATE_OUTPUT Create_Viz Exec_Work Buffer Support_Exec Support_Flow Control_Flow Sos

16 16

SOSflo SOSflow: Data Struct : Data Structur ure

ENUM ENUM

slide-17
SLIDE 17
  • PUB. HANDLE

SOSflo SOSflow: Data Struct : Data Structur ure

@me. pack send recv < defini6ons > < val_snaps > 3 4 5 6 7 9 10 11 12 13 14 . . . . . . 8 seman@c val=___ mood

@me.pack

stored by client

@me.send

pushed to daemon

@me.recv

injected into db

Every value is conserved, with its full history and evolving metadata.

17 17

slide-18
SLIDE 18

. . .

SOSflo SOSflow: Easy to use API : Easy to use API

18 18

slide-19
SLIDE 19

Source

SOS

sosd

INIT GUID_BLOCK ANNOUNCE PUBLISH VAL_SNAPS ... VAL_SNAPS Metadata,

  • Defs. / Structure

All pack()’ed values

SOSflo SOSflow: In Sit : In Situ Sock u Socket C et Communication Pr

  • mmunication Protocol
  • tocol

19 19

slide-20
SLIDE 20

Source

SOS

sosd

SOSflo SOSflow: Dis : Distributed A tributed Asynchr synchronous Run

  • nous Runtime (Simpl

time (Simple) e)

20 20

DB

sosd (DB)

AGGREGATE

DB

slide-21
SLIDE 21

Client App

SOS

Client App

SOS

Source

SOS

sosd

DB

massive database

  • f doom

sosd db

sosd sosd

local_sync cloud_sync transport

local query

analy@cs helper

local query

socket

node node node node

sosa analy@cs

t r a n s p

  • r

t

SOSflo SOSflow: Dis : Distributed A tributed Asynchr synchronous Run

  • nous Runtime

time

21 21 DB

(on-node)

slide-22
SLIDE 22

22 22

SOSflo SOSflow: : Wher Where it Runs e it Runs

v NERSC q Cori q Edison v LLNL q CAB q Catalyst v University of Oregon q ACISS v Sogware: q OpenMPI q MPICH q Slurm q PBS

slide-23
SLIDE 23

23 23

SOSflo SOSflow: Ev : Eval aluation uation

v Experimental Setup q Explore performance of work-in-progress implementa6on q Synthe6c and real-world cases q What is the latency cost of being async? v Synthe6c Sweep of Parameters q Itera6ons: 2 to 10, steps of 2 q Size: 100 to 500 unique values per pub, steps of 100 q Delay: 0.5 to 1.0 second, each 0.1 second

slide-24
SLIDE 24

100 200 300 400 500 [SOS_publish() freq. shown as transparency, 0.5 sec to 1.0 sec (darkest)] [2 iter] [10 iter]

slide-25
SLIDE 25

100 200 300 400 500 [Translucency repr. SOS_publish() frequency, 0.5 sec to 1.0 sec (darkest)] [2 iter] [10 iter]

slide-26
SLIDE 26

26 26

SOSflo SOSflow: Ev : Eval aluation uation

v Real-World Scenario q TAU Instrumented LULESH on Cori q TAU reports results to SOSflow on a 6mer q LULESH calls SOSflow API directly at itera6on q SOSflow gathers metrics from the OS

<Video>

slide-27
SLIDE 27

216

Processes

slide-28
SLIDE 28

343

Processes

slide-29
SLIDE 29

512

Processes

slide-30
SLIDE 30

v Performance improvements v Integrate more automa6c data gathering for node-level metrics v Support for deep analy6cs v Tes6ng with addi6onal real-world workflows and

  • pera6ng environments

Fut Futur ure e Work

  • rk

30 30