Prediction of workflow execution time using provenance traces: practical applications in medical data processing
Hugo Hiden Simon Woodman Paul Watson
Prediction of workflow execution time using provenance traces: - - PowerPoint PPT Presentation
Prediction of workflow execution time using provenance traces: practical applications in medical data processing Hugo Hiden Simon Woodman Paul Watson How long will my program take to run? Part of a bigger picture Can I repeat my results?
Hugo Hiden Simon Woodman Paul Watson
How long will my program take to run?
How long will my program take to run? What version of the program ran? How was a result generated? Can I repeat my results? What are the implications of errors
provenance
– Platform used for many projects
– Users can add their own code
– Used to collect OPM – Now PROV
– Diverse projects – Large applications
– Acyclic directed graph – Composed of connected “Blocks” – Deploys at reasonable scale in clouds
– Workflow is some combination of individual block models
– The input data sizes – The configuration of the block – The machine it is running on
– What types of model are most appropriate – How accurate are they
ti time me=f(i (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors)
More data increases execution time Each block has different characteristics, so a model is needed for each block The configuration of the block instance can change behavior Machine load, network traffic, hardware variations,…
A workflow is a connected pathway of blocks…
– In response to more data – When more blocks are added
– Pick based on best error
– Mechanisms to return estimate if no models are available
– Collect data ”non-invasively”
– Different blocks display different characteristics – Different algorithms and versions
– New blocks being added – Block behaviour only becomes apparent as data is collected
– Data sizes – Code versions – Algorithm settings
– Block start and end times – Number of concurrent workflows – CPU / Memory usage
Postgres Neo4j
Provenance Store
e-SC DB
Postgres MySQL SQL Server
e-SC Blob Store
Filesystem S3 Postgres Azure Blob Store HDFS
Archive
Filesystem AWS Glacier
New e-SC Blob Store
Migration Queue Provenance Queue Archive Queues Workflow Engines Workflow Queue Control Topic
Service/Lib Cache
OpenID Shibboleth External Auth REST RMI Private API
Security
User MGMT
Processing
Provenance/ Audit
External API REST HTTP
Storage
SWORD Tooling
Workflow Engines Workflow Queue Control Topic
Service/Lib Cache
OpenID Shibboleth External Auth REST RMI Private API
Security
User MGMT
Processing
Provenance/ Audit
External API REST HTTP
Storage
SWORD Tooling
Provenance and performance data capture Data / model storage Model building / updating Data Models
data point:
ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size Identifying data Model X data Model Y data ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size
Blocks may exhibit very different behaviors depending on their implementation details / configuration No relationship Linear relationship Non-linear relationship
model will be
– Gathering more data may change our view
– Models can be rebuilt and replaced on the fly
– This may improve
– Rebuild them all when new data arrives – Use the best one until the next update
Quantitative Structure Activity Relationship
Activity ≈
– Still want to return something
– Use version agnostic model – Use average execution time of block – Use average execution time of all blocks
long as a single block of any type has executed
– Typically 100Hz – Worn for 2 weeks – Analyse sleep patterns, general activity levels etc
– Clinicians view results and modify exercise regime – Collections of 100k data sets (24TB)
Wrist worn accelerometers
50 55 60 65 70 75 80 85 90 95 100 55 60 65 70 75 80 85 90 95 100
Predicted (KB) Actual (KB) Prediction Fitted Ideal
500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500
Predicted (seconds) Actual (seconds) Prediction Fitted Ideal
Output size model Duration model
Physical Activity Classification (PAC1)
200 400 600 800 1000 1200 1400 1600 200 400 600 800 1000 1200 1400 1600
Predicted (seconds) Actual (seconds) Prediction (RMSE=34.670, r2=0.987) Fitted Ideal
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000
Predicted (KB) Actual (KB) Prediction Fitted Ideal
GGIR GENEActiv processing
Output size model Duration model
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
Predicted (seconds) Actual (seconds) Prediction Fitted Ideal
Modelling is complicated by connected nature of workflow
All data for model readily available… … not the case here
? ? ? ? ? ?
?
how big are the intermediate data transfers
si size=f(i (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors)
More data increases execution time Each block has different characteristics, so a model is needed for each block The configuration of the block instance can change behavior Machine load, network traffic, hardware variations, phase of moon
Execution time = Sum(block predictions)
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
Predicted (seconds) Actual (seconds)
Training Prediction (RMSE=5.008,r2=0.980) Testing Prediction (RMSE=4.698,r2=0.981) Fitted Training Ideal
Chemical property modelling
block
propagating size predictions
– Propagation of errors
– Rapidly fails for larger workflows
– More data collection – Model groups of blocks – Build models of whole workflows
models
– Asynchronous collection of data and model building
execution time
to estimate workflow execution time
– Large workflows / poor block models are issues