Formal Semantics for Composable Workflows for Scraper Understanding - - PowerPoint PPT Presentation

formal semantics for composable workflows for scraper
SMART_READER_LITE
LIVE PREVIEW

Formal Semantics for Composable Workflows for Scraper Understanding - - PowerPoint PPT Presentation

Formal Semantics for Composable Workflows for Scraper Understanding Flows Albert Schimpf wiki.scraper.server1.link Technische Universitt Kaiserslautern (TUK), Kyoto University Schimpf (TUK, Kyoto University) Scraper Masters Thesis 18/19


slide-1
SLIDE 1

Formal Semantics for Composable Workflows for Scraper

Understanding Flows Albert Schimpf

wiki.scraper.server1.link

Technische Universität Kaiserslautern (TUK), Kyoto University

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 1 / 26

slide-2
SLIDE 2

Content

Motivation & Observation

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 2 / 26

slide-3
SLIDE 3

Content

Motivation & Observation Flows

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 2 / 26

slide-4
SLIDE 4

Content

Motivation & Observation Flows Motivation for an Operational Semantics

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 2 / 26

slide-5
SLIDE 5

Content

Motivation & Observation Flows Motivation for an Operational Semantics Scraper Language

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 2 / 26

slide-6
SLIDE 6

Content

Motivation & Observation Flows Motivation for an Operational Semantics Scraper Language Insights & Future Work

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 2 / 26

slide-7
SLIDE 7

Problem Boundary

Task is ...

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 3 / 26

slide-8
SLIDE 8

Problem Boundary

Task is ...

◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 3 / 26

slide-9
SLIDE 9

Problem Boundary

Task is ...

◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure

and not ...

◮ CPU-intensive ◮ user interactive Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 3 / 26

slide-10
SLIDE 10

Informal Use Case Specification

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 4 / 26

slide-11
SLIDE 11

Informal Use Case Specification

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 4 / 26

slide-12
SLIDE 12

Informal Use Case Specification

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 4 / 26

slide-13
SLIDE 13

Possible Approaches

One program for each task (Java)

◮ too much effort ◮ fragile ◮ code duplication ... Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 5 / 26

slide-14
SLIDE 14

Possible Approaches

One program for each task (Java)

◮ too much effort ◮ fragile ◮ code duplication ...

Reuse functionality, abstract and share code

◮ modifications of sub-routines affected other programs ◮ mixed control-flow and data-flow hard to reason about ◮ language focused on control-flow less suited for data-flow problems Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 5 / 26

slide-15
SLIDE 15

Possible Approaches

One program for each task (Java)

◮ too much effort ◮ fragile ◮ code duplication ...

Reuse functionality, abstract and share code

◮ modifications of sub-routines affected other programs ◮ mixed control-flow and data-flow hard to reason about ◮ language focused on control-flow less suited for data-flow problems

Encapsulate functions into nodes and connect them

→ nodes don’t crash, threads that access a node crash

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 5 / 26

slide-16
SLIDE 16

Functional Nodes

Functional nodes, what about...

◮ ... connecting them (graph structure)? ◮ ... how data is passed around (API)? ◮ ... concurrent access? ◮ ... configuration? ◮ ... complex control-flow, data-parallelism? Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 6 / 26

slide-17
SLIDE 17

Functional Nodes

Functional nodes, what about...

◮ ... connecting them (graph structure)? ◮ ... how data is passed around (API)? ◮ ... concurrent access? ◮ ... configuration? ◮ ... complex control-flow, data-parallelism?

Use specifications instead of programming in Java (DSL)

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 6 / 26

slide-18
SLIDE 18

Summary: Requirements

Reusable & adaptable nodes

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 7 / 26

slide-19
SLIDE 19

Summary: Requirements

Reusable & adaptable nodes Separation of business logic and program logic

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 7 / 26

slide-20
SLIDE 20

Summary: Requirements

Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 7 / 26

slide-21
SLIDE 21

Summary: Requirements

Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Reliability

◮ Guarantee concurrent access and processing of data at any time

without crashing

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 7 / 26

slide-22
SLIDE 22

Summary: Requirements

Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Reliability

◮ Guarantee concurrent access and processing of data at any time

without crashing

Robustness

◮ Errors only happen during initialization of the specification ◮ After initialization, errors are guaranteed to be of business-logic nature Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 7 / 26

slide-23
SLIDE 23

Flows - Arrows & Nodes

Simple Graph Nodes ...

◮ ... implement single unit of work ◮ ... forward data to another node

No forward target denotes end with some result data Where is the process, how is work done?

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 8 / 26

slide-24
SLIDE 24

Flows - Arrows & Nodes

Flows (implicit) Initial (empty) flow map Fi (flow map) accepted by first node Result data for input Fi is F1

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 8 / 26

slide-25
SLIDE 25

Flows - Arrows & Nodes

Dependent & Dispatched Flows dispatch node creates a new flow new flow is independent F2 does not depend on F1

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 8 / 26

slide-26
SLIDE 26

Design Decisions

Data in flows are JSON maps Nodes are functional map consumers Nodes are configurable, configuration is typed Nodes are addressable by label

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 9 / 26

slide-27
SLIDE 27

Formalization - Considerations

Coordination languages CCS/π-calculus and extensions Concurrent object-oriented calculi Petri-nets

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 10 / 26

slide-28
SLIDE 28

Formalization - Considerations

Coordination languages CCS/π-calculus and extensions Concurrent object-oriented calculi Petri-nets Operational Semantics

◮ Goal: type safety Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 10 / 26

slide-29
SLIDE 29

Scraper Syntax

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 11 / 26

slide-30
SLIDE 30

Scraper Syntax

Control-flow defined explicitly Functional nodes handled with mod Process lookups to enable complex configurations Processes are concatenation of nodes

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 11 / 26

slide-31
SLIDE 31

Scraper Syntax

Terms used for configuration

  • f nodes

JSON values and templates τ Templates used to lookup values in the current accessing map Lookup element should match type

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 11 / 26

slide-32
SLIDE 32

Scraper Syntax

Map (called FM store) binds terms to keys Typing: JSON objects

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 11 / 26

slide-33
SLIDE 33

Scraper Evaluation

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 12 / 26

slide-34
SLIDE 34

Scraper Evaluation - I

Active, forked, and concurrent processes

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 13 / 26

slide-35
SLIDE 35

Scraper Evaluation - II

Nested evaluation Pull concurrent configurations

  • ut of forks

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 14 / 26

slide-36
SLIDE 36

Scraper Evaluation - III

Function evaluation encapsulated Process lookup inserts new nodes DISP and FORK introduce concurrency JOIN merges forked configurations

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 15 / 26

slide-37
SLIDE 37

Scraper Functions

Template evaluation inside functional nodes No templates inside map

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 16 / 26

slide-38
SLIDE 38

Scraper Typing - Excerpt

Concurrent configurations typed with same environment Process and store typed separately Join stores old typing and joins with a list of keys

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 17 / 26

slide-39
SLIDE 39

Scope: Language vs. Implementation

Stateful nodes

◮ Time

Exceptions

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 18 / 26

slide-40
SLIDE 40

Scope: Language vs. Implementation

Stateful nodes

◮ Time

Exceptions Data-parallelism

◮ Map ◮ Map join Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 18 / 26

slide-41
SLIDE 41

Scope: Language vs. Implementation

Stateful nodes

◮ Time

Exceptions Data-parallelism

◮ Map ◮ Map join

Complex templates

◮ Template expressions ◮ Key lookup @|τ : String| ⋆ @|a| : Simple template ⋆ @@|a| : Look inside maps (UnpackMapNode) ◮ Array lookup |τ : List<T>|[τ : Integer] ◮ String concatenation τ : String + τ : String ◮ Simple value Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 18 / 26

slide-42
SLIDE 42

Scope: Language vs. Implementation

Stateful nodes

◮ Time

Exceptions Data-parallelism

◮ Map ◮ Map join

Complex templates

◮ Template expressions ◮ Key lookup @|τ : String| ⋆ @|a| : Simple template ⋆ @@|a| : Look inside maps (UnpackMapNode) ◮ Array lookup |τ : List<T>|[τ : Integer] ◮ String concatenation τ : String + τ : String ◮ Simple value

Record types Process table More functionality

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 18 / 26

slide-43
SLIDE 43

Insights

Typing nodes instead of data Implementation is robust

◮ Nodes don’t crash ◮ If configuration is well-typed, flows are guaranteed to finish

Implementation is flexible and adaptable

◮ Quasi-static graph makes reasoning about control-flow easy ◮ Business logic can be easily configured via templates without touching

program logic

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 19 / 26

slide-44
SLIDE 44

Future Work

More functions in language More data-parallelism (at least map, map join) Time Shared state Scheduling Java JSON specification → SL translation Benchmarking and comparison Scraper as a distributed system

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 20 / 26

slide-45
SLIDE 45

Thank you for your attention.

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 21 / 26

slide-46
SLIDE 46

Scraper Store Modification

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 22 / 26

slide-47
SLIDE 47

Scraper Typing

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 23 / 26

slide-48
SLIDE 48

Data-parallelism: fork join Multiple dependent arrows connected to other nodes Control-flow not clear, only data-flow is depicted

◮ Remember: separation of control-flow and data-flow is forced

Arrows are annotated by their meaning

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 24 / 26

slide-49
SLIDE 49

Flows are not arrows Arrows show where the accessing flows are directed to

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 24 / 26

slide-50
SLIDE 50

Data-parallelism: map Crossed arrow: none or many flows are created

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 24 / 26

slide-51
SLIDE 51

Data-parallelism: map join

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 24 / 26

slide-52
SLIDE 52

Example: Data Aggregation

Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 25 / 26

slide-53
SLIDE 53

Related Work

Flow-based Programming (FBP)

◮ Nodes are black boxes with ports, send/receive data ◮ Not typed ◮ No frameworks focused on concurrency ⋆ JavaFBP, NoFlo, C#FBP, other domain specific flow-based languages...

Actor-based languages, active object languages

◮ Similar problems to FBP ◮ Processes can crash ◮ Message passing

Storm, Kafka, other distributed streaming frameworks

◮ Nodes are processes again ◮ Too complex, even with DSL (incomplete functionality)

Functional Language, Streams

◮ Don’t want to write a program for each task ◮ No separation of control-flow and data-flow is forced Schimpf (TUK, Kyoto University) Scraper Master’s Thesis 18/19 26 / 26