[PPT] - Prediction of workflow execution time using provenance traces: PowerPoint Presentation

SLIDE 1

Prediction of workflow execution time using provenance traces: practical applications in medical data processing

Hugo Hiden Simon Woodman Paul Watson

SLIDE 2

How long will my program take to run?

SLIDE 3

Part of a bigger picture

How long will my program take to run? What version of the program ran? How was a result generated? Can I repeat my results? What are the implications of errors

SLIDE 4

Provenance Research

Used to answer these questions
Important in scientific research
Lots of work done to capture and represent

provenance

Active research area

OPM PROV

SLIDE 5

e-Science Central

Source of all our provenance data

– Platform used for many projects

Repository of code and data

– Users can add their own code

Well instrumented and understood

– Used to collect OPM – Now PROV

Plenty of data sets

– Diverse projects – Large applications

Workflows for data processing

SLIDE 6

The workflow model

Simple workflow implementation

– Acyclic directed graph – Composed of connected “Blocks” – Deploys at reasonable scale in clouds

SLIDE 7

Modelling performance

Execution time for a single block

– Workflow is some combination of individual block models

There should be some predictors:

– The input data sizes – The configuration of the block – The machine it is running on

The issues are:

– What types of model are most appropriate – How accurate are they

SLIDE 8

Execution time of a block

ti time me=f(i (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors)

More data increases execution time Each block has different characteristics, so a model is needed for each block The configuration of the block instance can change behavior Machine load, network traffic, hardware variations,…

A workflow is a connected pathway of blocks…

SLIDE 9

Requirements for a “real” system

Proactively build models

– In response to more data – When more blocks are added

Select the most appropriate model

– Pick based on best error

Aim to always return some estimate

– Mechanisms to return estimate if no models are available

SLIDE 10

Complications

Gathering data

– Collect data ”non-invasively”

Model types

– Different blocks display different characteristics – Different algorithms and versions

Dynamic environment

– New blocks being added – Block behaviour only becomes apparent as data is collected

SLIDE 11

Data collected via provenance

Provenance collection already captures:

– Data sizes – Code versions – Algorithm settings

Extra instrumentation for

– Block start and end times – Number of concurrent workflows – CPU / Memory usage

SLIDE 12

Postgres Neo4j

Provenance Store

e-SC DB

Postgres MySQL SQL Server

e-SC Blob Store

Filesystem S3 Postgres Azure Blob Store HDFS

e-SC Architecture

SLIDE 13

Workflow Engines Workflow Queue Control Topic

Service/Lib Cache

OpenID Shibboleth External Auth REST RMI Private API

Security

ACL
Authentication

User MGMT

Friends
Groups
Projects
Quotas

Processing

Services
Workflows
Libs

Provenance/ Audit

Capture
Query/Search
Presentation

External API REST HTTP

Storage

Versioning
Archiving

SWORD Tooling

Maven Plugins
File uploader
Domain specific apps/websites

Data capture architecture

Provenance and performance data capture Data / model storage Model building / updating Data Models

SLIDE 14

Data collected

Each execution of a block creates a single

data point:

ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size Identifying data Model X data Model Y data ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size

SLIDE 15

Block models

Execution Time Observed Execution Data Execution Time Observed Execution Data Execution Time Observed Execution Data

Blocks may exhibit very different behaviors depending on their implementation details / configuration No relationship Linear relationship Non-linear relationship

SLIDE 16

Selecting the most appropriate model

SLIDE 17

Selecting the most appropriate model

SLIDE 18

Selecting the most appropriate model

SLIDE 19

Selecting the most appropriate model

SLIDE 20

Dynamic model updating

Impossible (difficult) to know what the best

model will be

– Gathering more data may change our view

Need to implement model updating

– Models can be rebuilt and replaced on the fly

Return best available estimate at a given time

– This may improve

SLIDE 21

“Panel of experts” pattern

Maintain a suite of different models

– Rebuild them all when new data arrives – Use the best one until the next update

Drug modelling project:

Quantitative Structure Activity Relationship

f( )

Activity ≈

SLIDE 22

Model fallbacks

What happens if there is no model?

– Still want to return something

We used the following logic:

– Use version agnostic model – Use average execution time of block – Use average execution time of all blocks

This will always return some prediction as

long as a single block of any type has executed

SLIDE 23

Medical data processing

Measure acceleration in 3-axes

– Typically 100Hz – Worn for 2 weeks – Analyse sleep patterns, general activity levels etc

Data collected and analysed

– Clinicians view results and modify exercise regime – Collections of 100k data sets (24TB)

Wrist worn accelerometers

SLIDE 24

Results

50 55 60 65 70 75 80 85 90 95 100 55 60 65 70 75 80 85 90 95 100

Predicted (KB) Actual (KB) Prediction Fitted Ideal

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

Predicted (seconds) Actual (seconds) Prediction Fitted Ideal

Output size model Duration model

Physical Activity Classification (PAC1)

SLIDE 25

Results

200 400 600 800 1000 1200 1400 1600 200 400 600 800 1000 1200 1400 1600

Predicted (seconds) Actual (seconds) Prediction (RMSE=34.670, r2=0.987) Fitted Ideal

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000

Predicted (KB) Actual (KB) Prediction Fitted Ideal

GGIR GENEActiv processing

Output size model Duration model

SLIDE 26

Not always successful

10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80

Predicted (seconds) Actual (seconds) Prediction Fitted Ideal

SLIDE 27

Predicting Workflow duration

Modelling is complicated by connected nature of workflow

All data for model readily available… … not the case here

? ? ? ? ? ?

?

how big are the intermediate data transfers

SLIDE 28

Data volume produced by a block

si size=f(i (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors)

More data increases execution time Each block has different characteristics, so a model is needed for each block The configuration of the block instance can change behavior Machine load, network traffic, hardware variations, phase of moon

SLIDE 29

Modelling total execution time

Execution time = Sum(block predictions)

SLIDE 30

Results

10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110

Predicted (seconds) Actual (seconds)

Training Prediction (RMSE=5.008,r2=0.980) Testing Prediction (RMSE=4.698,r2=0.981) Fitted Training Ideal

Chemical property modelling

Models built for each individual

block

Prediction generated by

propagating size predictions

SLIDE 31

Modelling workflows: caveats

Much harder to model workflow duration

– Propagation of errors

Works for simple workflows

– Rapidly fails for larger workflows

Possible solutions

– More data collection – Model groups of blocks – Build models of whole workflows

SLIDE 32

Conclusions

Extended provenance capture to build predictive

models

– Asynchronous collection of data and model building

Demonstrated it is possible to model block

execution time

Show it may be possible to combine predictions

to estimate workflow execution time

– Large workflows / poor block models are issues