Prediction of workflow execution time using provenance traces: - - PowerPoint PPT Presentation

prediction of workflow execution time using provenance
SMART_READER_LITE
LIVE PREVIEW

Prediction of workflow execution time using provenance traces: - - PowerPoint PPT Presentation

Prediction of workflow execution time using provenance traces: practical applications in medical data processing Hugo Hiden Simon Woodman Paul Watson How long will my program take to run? Part of a bigger picture Can I repeat my results?


slide-1
SLIDE 1

Prediction of workflow execution time using provenance traces: practical applications in medical data processing

Hugo Hiden Simon Woodman Paul Watson

slide-2
SLIDE 2

How long will my program take to run?

slide-3
SLIDE 3

Part of a bigger picture

How long will my program take to run? What version of the program ran? How was a result generated? Can I repeat my results? What are the implications of errors

slide-4
SLIDE 4

Provenance Research

  • Used to answer these questions
  • Important in scientific research
  • Lots of work done to capture and represent

provenance

  • Active research area

OPM PROV

slide-5
SLIDE 5

e-Science Central

  • Source of all our provenance data

– Platform used for many projects

  • Repository of code and data

– Users can add their own code

  • Well instrumented and understood

– Used to collect OPM – Now PROV

  • Plenty of data sets

– Diverse projects – Large applications

  • Workflows for data processing
slide-6
SLIDE 6

The workflow model

  • Simple workflow implementation

– Acyclic directed graph – Composed of connected “Blocks” – Deploys at reasonable scale in clouds

slide-7
SLIDE 7

Modelling performance

  • Execution time for a single block

– Workflow is some combination of individual block models

  • There should be some predictors:

– The input data sizes – The configuration of the block – The machine it is running on

  • The issues are:

– What types of model are most appropriate – How accurate are they

slide-8
SLIDE 8

Execution time of a block

ti time me=f(i (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors)

More data increases execution time Each block has different characteristics, so a model is needed for each block The configuration of the block instance can change behavior Machine load, network traffic, hardware variations,…

A workflow is a connected pathway of blocks…

slide-9
SLIDE 9

Requirements for a “real” system

  • Proactively build models

– In response to more data – When more blocks are added

  • Select the most appropriate model

– Pick based on best error

  • Aim to always return some estimate

– Mechanisms to return estimate if no models are available

slide-10
SLIDE 10

Complications

  • Gathering data

– Collect data ”non-invasively”

  • Model types

– Different blocks display different characteristics – Different algorithms and versions

  • Dynamic environment

– New blocks being added – Block behaviour only becomes apparent as data is collected

slide-11
SLIDE 11

Data collected via provenance

  • Provenance collection already captures:

– Data sizes – Code versions – Algorithm settings

  • Extra instrumentation for

– Block start and end times – Number of concurrent workflows – CPU / Memory usage

slide-12
SLIDE 12

Postgres Neo4j

Provenance Store

e-SC DB

Postgres MySQL SQL Server

e-SC Blob Store

Filesystem S3 Postgres Azure Blob Store HDFS

Archive

Filesystem AWS Glacier

New e-SC Blob Store

Migration Queue Provenance Queue Archive Queues Workflow Engines Workflow Queue Control Topic

Service/Lib Cache

OpenID Shibboleth External Auth REST RMI Private API

Security

  • ACL
  • Authentication

User MGMT

  • Friends
  • Groups
  • Projects
  • Quotas

Processing

  • Services
  • Workflows
  • Libs

Provenance/ Audit

  • Capture
  • Query/Search
  • Presentation

External API REST HTTP

Storage

  • Versioning
  • Archiving

SWORD Tooling

  • Maven Plugins
  • File uploader
  • Domain specific apps/websites

e-SC Architecture

slide-13
SLIDE 13

Workflow Engines Workflow Queue Control Topic

Service/Lib Cache

OpenID Shibboleth External Auth REST RMI Private API

Security

  • ACL
  • Authentication

User MGMT

  • Friends
  • Groups
  • Projects
  • Quotas

Processing

  • Services
  • Workflows
  • Libs

Provenance/ Audit

  • Capture
  • Query/Search
  • Presentation

External API REST HTTP

Storage

  • Versioning
  • Archiving

SWORD Tooling

  • Maven Plugins
  • File uploader
  • Domain specific apps/websites

Data capture architecture

Provenance and performance data capture Data / model storage Model building / updating Data Models

slide-14
SLIDE 14

Data collected

  • Each execution of a block creates a single

data point:

ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size Identifying data Model X data Model Y data ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size ID, Version Setting_1, Setting_2, Memory Use, Input_size Duration, Output_size

slide-15
SLIDE 15

Block models

Execution Time Observed Execution Data Execution Time Observed Execution Data Execution Time Observed Execution Data

Blocks may exhibit very different behaviors depending on their implementation details / configuration No relationship Linear relationship Non-linear relationship

slide-16
SLIDE 16

Selecting the most appropriate model

slide-17
SLIDE 17

Selecting the most appropriate model

slide-18
SLIDE 18

Selecting the most appropriate model

slide-19
SLIDE 19

Selecting the most appropriate model

slide-20
SLIDE 20

Dynamic model updating

  • Impossible (difficult) to know what the best

model will be

– Gathering more data may change our view

  • Need to implement model updating

– Models can be rebuilt and replaced on the fly

  • Return best available estimate at a given time

– This may improve

slide-21
SLIDE 21

“Panel of experts” pattern

  • Maintain a suite of different models

– Rebuild them all when new data arrives – Use the best one until the next update

  • Drug modelling project:

Quantitative Structure Activity Relationship

f( )

Activity ≈

slide-22
SLIDE 22

Model fallbacks

  • What happens if there is no model?

– Still want to return something

  • We used the following logic:

– Use version agnostic model – Use average execution time of block – Use average execution time of all blocks

  • This will always return some prediction as

long as a single block of any type has executed

slide-23
SLIDE 23

Medical data processing

  • Measure acceleration in 3-axes

– Typically 100Hz – Worn for 2 weeks – Analyse sleep patterns, general activity levels etc

  • Data collected and analysed

– Clinicians view results and modify exercise regime – Collections of 100k data sets (24TB)

Wrist worn accelerometers

slide-24
SLIDE 24

Results

50 55 60 65 70 75 80 85 90 95 100 55 60 65 70 75 80 85 90 95 100

Predicted (KB) Actual (KB) Prediction Fitted Ideal

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

Predicted (seconds) Actual (seconds) Prediction Fitted Ideal

Output size model Duration model

Physical Activity Classification (PAC1)

slide-25
SLIDE 25

Results

200 400 600 800 1000 1200 1400 1600 200 400 600 800 1000 1200 1400 1600

Predicted (seconds) Actual (seconds) Prediction (RMSE=34.670, r2=0.987) Fitted Ideal

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000

Predicted (KB) Actual (KB) Prediction Fitted Ideal

GGIR GENEActiv processing

Output size model Duration model

slide-26
SLIDE 26

Not always successful

10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80

Predicted (seconds) Actual (seconds) Prediction Fitted Ideal

slide-27
SLIDE 27

Predicting Workflow duration

Modelling is complicated by connected nature of workflow

All data for model readily available… … not the case here

? ? ? ? ? ?

?

how big are the intermediate data transfers

slide-28
SLIDE 28

Data volume produced by a block

si size=f(i (input ut-size, blo lock-code, blo lock-se settings, s, random-fa factors)

More data increases execution time Each block has different characteristics, so a model is needed for each block The configuration of the block instance can change behavior Machine load, network traffic, hardware variations, phase of moon

slide-29
SLIDE 29

Modelling total execution time

Execution time = Sum(block predictions)

slide-30
SLIDE 30

Results

10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110

Predicted (seconds) Actual (seconds)

Training Prediction (RMSE=5.008,r2=0.980) Testing Prediction (RMSE=4.698,r2=0.981) Fitted Training Ideal

Chemical property modelling

  • Models built for each individual

block

  • Prediction generated by

propagating size predictions

slide-31
SLIDE 31

Modelling workflows: caveats

  • Much harder to model workflow duration

– Propagation of errors

  • Works for simple workflows

– Rapidly fails for larger workflows

  • Possible solutions

– More data collection – Model groups of blocks – Build models of whole workflows

slide-32
SLIDE 32

Conclusions

  • Extended provenance capture to build predictive

models

– Asynchronous collection of data and model building

  • Demonstrated it is possible to model block

execution time

  • Show it may be possible to combine predictions

to estimate workflow execution time

– Large workflows / poor block models are issues