Rapid Prototyping of Composable Concurrent Workflows using Typed - - PowerPoint PPT Presentation

rapid prototyping of composable concurrent workflows
SMART_READER_LITE
LIVE PREVIEW

Rapid Prototyping of Composable Concurrent Workflows using Typed - - PowerPoint PPT Presentation

Rapid Prototyping of Composable Concurrent Workflows using Typed Templates Albert Schimpf wiki.scraper.server1.link Technische Universitt Kaiserslautern (TUK), Kyoto University 4. Februar 2020 Schimpf (TUK, Kyoto University) Rapid


slide-1
SLIDE 1

Rapid Prototyping of Composable Concurrent Workflows using Typed Templates

Albert Schimpf

wiki.scraper.server1.link

Technische Universität Kaiserslautern (TUK), Kyoto University

  • 4. Februar 2020

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

1 / 16

slide-2
SLIDE 2

Problem Boundary

Task is ...

1 2 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

2 / 16

slide-3
SLIDE 3

Problem Boundary

Task is ...

◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure 1 2 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

2 / 16

slide-4
SLIDE 4

Problem Boundary

Task is ...

◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure

Applications

◮ IoT devices (e.g. fridge monitor1) 1https://git.server1.link/scraper/scraper/wikis/examples/fridge 2 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

2 / 16

slide-5
SLIDE 5

Problem Boundary

Task is ...

◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure

Applications

◮ IoT devices (e.g. fridge monitor1) ◮ Archiving/Monitoring websites (e.g. archiving news threads2) 1https://git.server1.link/scraper/scraper/wikis/examples/fridge 2https://git.server1.link/scraper/jobs/hackernews-json 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

2 / 16

slide-6
SLIDE 6

Problem Boundary

Task is ...

◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure

Applications

◮ IoT devices (e.g. fridge monitor1) ◮ Archiving/Monitoring websites (e.g. archiving news threads2) ◮ Stream processing (extract/transform data3) 1https://git.server1.link/scraper/scraper/wikis/examples/fridge 2https://git.server1.link/scraper/jobs/hackernews-json 3https://git.server1.link/scraper/jobs/extract-vcards Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

2 / 16

slide-7
SLIDE 7

Informal Use Case Specification

BEGIN:VCARD .... END:VCARD

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

3 / 16

slide-8
SLIDE 8

Informal Use Case Specification

BEGIN:VCARD .... END:VCARD

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

3 / 16

slide-9
SLIDE 9

Informal Use Case Specification

BEGIN:VCARD .... END:VCARD

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

3 / 16

slide-10
SLIDE 10

Informal Use Case Specification

BEGIN:VCARD .... END:VCARD Data: entire HDD Matches can be processed independently

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

3 / 16

slide-11
SLIDE 11

Possible Approaches

One program for each task (Java)

◮ too much effort, fragile ◮ code duplication ◮ explicit concurrency handling ⋆ Java Stream interface not suited for task Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

4 / 16

slide-12
SLIDE 12

Possible Approaches

One program for each task (Java)

◮ too much effort, fragile ◮ code duplication ◮ explicit concurrency handling ⋆ Java Stream interface not suited for task

Reuse functionality, abstract and share code

◮ modifications of sub-routines affected other programs ◮ mixed control-flow and data-flow hard to reason about ◮ language focused on control-flow less suited for data-flow problems Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

4 / 16

slide-13
SLIDE 13

Possible Approaches

One program for each task (Java)

◮ too much effort, fragile ◮ code duplication ◮ explicit concurrency handling ⋆ Java Stream interface not suited for task

Reuse functionality, abstract and share code

◮ modifications of sub-routines affected other programs ◮ mixed control-flow and data-flow hard to reason about ◮ language focused on control-flow less suited for data-flow problems

Use tools that OS provides (e.g. pipes)

◮ grep -aoz "BEGIN:VCARD.*?END:VCARD"/dev/sda1 | ... ◮ Sequential specification unwieldy Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

4 / 16

slide-14
SLIDE 14

Functional Nodes

Functional nodes, what about...

◮ ... connecting them (graph structure)? ◮ ... how data is passed around (API)? ◮ ... concurrent access? ◮ ... configuration? ◮ ... complex control-flow, data-parallelism? Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

5 / 16

slide-15
SLIDE 15

Functional Nodes

Functional nodes, what about...

◮ ... connecting them (graph structure)? ◮ ... how data is passed around (API)? ◮ ... concurrent access? ◮ ... configuration? ◮ ... complex control-flow, data-parallelism?

Use specifications instead of programming in Java (DSL)

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

5 / 16

slide-16
SLIDE 16

Requirements

Reusable & adaptable nodes

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

6 / 16

slide-17
SLIDE 17

Requirements

Reusable & adaptable nodes Separation of business logic and program logic

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

6 / 16

slide-18
SLIDE 18

Requirements

Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

6 / 16

slide-19
SLIDE 19

Requirements

Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Reliability

◮ Guarantee concurrent access and processing of data at any time

without crashing

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

6 / 16

slide-20
SLIDE 20

Requirements

Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Reliability

◮ Guarantee concurrent access and processing of data at any time

without crashing

Robustness

◮ Errors only happen during initialization of the specification ◮ After initialization, errors are guaranteed to be of business-logic nature Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

6 / 16

slide-21
SLIDE 21

Flows - Arrows & Nodes

Simple Graph Nodes ...

◮ ... implement single unit of work ◮ ... forward data to another node

No forward target denotes end with some result data Where is the process, how is work done?

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

7 / 16

slide-22
SLIDE 22

Flows - Arrows & Nodes

Flows (implicit) Initial (empty) flow map Fi (flow map) accepted by first node Result data for input Fi is F1

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

7 / 16

slide-23
SLIDE 23

Flows - Arrows & Nodes

Dependent & Dispatched Flows dispatch node creates a new flow new flow is independent F2 does not depend on F1

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

7 / 16

slide-24
SLIDE 24

Nodes - Typed Configuration

  • type: RegexNode

regex: "(BEGIN:VCARD.*?END:VCARD)" content: "{output-chunk}" groups: content: 1 collect: false streamTarget: vcard-match Every key of the configuration is typed (implementation, program logic) Access to flow map content via templates (e.g. {output-chunk}) Format of configuration is not important (declarative, either JSON or YAML for now)

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

8 / 16

slide-25
SLIDE 25

Flows

With arrows (dependent, dispatched, multiple) all types of data-parallelism is possible

◮ Fork, ForkJoin ◮ Map, MapJoin ◮ IfThenElse ◮ Pipe ◮ Retry ◮ Application specific flow routing

Concurrency is handled by the system with predefined semantics Concurrency can be visualized in a graph

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

9 / 16

slide-26
SLIDE 26

Evaluation

Flow graph model with flows, nodes, and arrows Operational semantics with type safety (Master’s thesis result) Java framework based on operational semantics

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

10 / 16

slide-27
SLIDE 27

Evaluation - Recover VCards

Pipes

sudo grep -a -A 20 BEGIN:VCARD /dev/sda | wc -l

Workflow

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

11 / 16

slide-28
SLIDE 28

Evaluation - Recover VCards

Pipes

sudo grep -a -A 20 BEGIN:VCARD /dev/sda | wc -l

Time: 9 minutes 3 seconds Max Memory: 1GB Average memory usage: 4% Workflow

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

11 / 16

slide-29
SLIDE 29

Evaluation - Recover VCards

Pipes

sudo grep -a -A 20 BEGIN:VCARD /dev/sda | wc -l

Time: 9 minutes 3 seconds Max Memory: 1GB Average memory usage: 4% Workflow Time: 9 minutes 34 seconds Max Memory: 2,5GB Average memory usage: 1.5%

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

11 / 16

slide-30
SLIDE 30

Insights and Future Work

Flow graphs are robust, flexible, and adaptable

◮ Nodes don’t crash ◮ If configuration is well-typed, flows are guaranteed to finish ◮ Quasi-static graph makes reasoning about control-flow easy ◮ Business logic can be easily configured via templates without touching

program logic

◮ Efficiency dependent on nodes, not the framework (mostly)

Further formalization and improvement of operational semantics Static analysis

◮ Live variables ◮ Time ◮ ... Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

12 / 16

slide-31
SLIDE 31

Rapid Prototyping of Composable Concurrent Workflows using Typed Templates

Albert Schimpf

wiki.scraper.server1.link

Technische Universität Kaiserslautern (TUK), Kyoto University

  • 4. Februar 2020

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

13 / 16

slide-32
SLIDE 32

Data-parallelism: fork join Multiple dependent arrows connected to other nodes Control-flow not clear, only data-flow is depicted

◮ Remember: separation of control-flow and data-flow is forced

Arrows are annotated by their meaning

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

14 / 16

slide-33
SLIDE 33

Flows are not arrows Arrows show where the accessing flows are directed to

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

14 / 16

slide-34
SLIDE 34

Data-parallelism: map Crossed arrow: none or many flows are created

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

14 / 16

slide-35
SLIDE 35

Data-parallelism: map join

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

14 / 16

slide-36
SLIDE 36

Example: Data Aggregation

Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

15 / 16

slide-37
SLIDE 37

Related Work

Flow-based Programming (FBP)

◮ Nodes are black boxes with ports, send/receive data ◮ Not typed ◮ No frameworks focused on concurrency ⋆ JavaFBP, NoFlo, C#FBP, other domain specific flow-based languages...

Actor-based languages, active object languages

◮ Similar problems to FBP ◮ Processes can crash ◮ Message passing

Storm, Kafka, other distributed streaming frameworks

◮ Nodes are processes again ◮ Too complex, even with DSL (incomplete functionality)

Functional Language, Streams

◮ Don’t want to write a program for each task ◮ No separation of control-flow and data-flow is forced Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows

  • 4. Februar 2020

16 / 16