Experiences with the Model-based Generation of Big Data Pipelines - - PowerPoint PPT Presentation

▶

Sep 05, 2022 292 likes •399 views

Experiences with the Model-based Generation of Big Data Pipelines Holger Eichelberger, Cui Qin, Klaus Schmid {eichelberger, qin, schmid}@sse.uni-hildesheim.de Software Systems Engineering University of Hildesheim www.sse.uni-hildesheim.de

SLIDE 1

Experiences with the Model-based Generation

f Big Data Pipelines

Holger Eichelberger, Cui Qin, Klaus Schmid

Software Systems Engineering University of Hildesheim

www.sse.uni-hildesheim.de

{eichelberger, qin, schmid}@sse.uni-hildesheim.de

SLIDE 2

Motivation

Background FP7 QualiMaster:

– Configurable and adaptive data processing infrastructure – Real-time financial risk analysis

Programming applications for Big Data frameworks is complex
Ideal: Focus on data processing, ignore technical complexity

3/6/2017

Goal:

– Model-based approach to stream processing – Hide complexity – Ease development – Generate complex parts of code – Support self-adaptation

Experiences Lessons learned

Experiences with the Model-based Generation of Big Data Pipelines

SLIDE 3

Model-based design

Basis: Concept analysis

– Fixed stream operators (e.g., Borealis, PIPES) – User-defined operators / algorithms (e.g., Storm, Heron) – Combinations (e.g., Spark, Flink)

Common concept: Data flow graph

3/6/2017

Common concept: Data flow graph

Typically represented as program
Recent trend: DSL

P1 P2 P3 Data source Data processors Data sink

Experiences with the Model-based Generation of Big Data Pipelines

SLIDE 4

Specific modeling concepts

Data processing pipeline

P1 P2 P3 Algorithm family Source Sink

3/6/2017

Domain restrictions

– Must be a valid data flow graph – If Ps→ Pe, Ps must provide types that Pe can process – Interface compatibility between families and algorithms

Simple algorithm Hardware co- processor P2.1 P2.1 P2.1 P2.1 Sub-pipeline

Experiences with the Model-based Generation of Big Data Pipelines

SLIDE 5

Modeling support

Domain-specific modeling frontend

Underlying: Own model- managment framework

3/6/2017

Experiences with the Model-based Generation of Big Data Pipelines 5

SLIDE 6

Code generation

Architecture

– Heterogeneous resource pool – Intermediary layer extending Storm – Management layer for runtime

Generated Pipelines / Applications Intermediary Layer Stream Processing Framework (Apache Storm) Reconfigurable Hardware Management Stack

3/6/2017

Generation steps

– Family interfaces – Data serialization support – Integration of hardware co-processors – Pipelines / sub-pipelines, switching – Compile, integrate dependencies, package 16 pipelines

x 7 code produced
~880 MB deployable

components

Experiences with the Model-based Generation of Big Data Pipelines 6

SLIDE 7

Experiences and Lessons learned (1)

7 data engineers from 3 groups, 6 large pipelines
Beginning of the project

– Sceptical about model-based approach – Initial version after some months – Hands-on workshops – Feedback:

3/6/2017

– Feedback:

Puzzled about type saftey
First own generated pipelines helped
Change of focus: More on algorithms
Requests for new features, reports on buggy features
Confidence increased with improved versions (~1 year)

Experiences with the Model-based Generation of Big Data Pipelines 7

SLIDE 8

Experiences and Lessons learned (2)

Later phases

– Interfaces help to structure work – Typing helps avoiding runtime errors – “Magic“ of generated code

serialization
parameters

3/6/2017

algorithm switching

– Complex structures due to additional nodes, communication – For sub-pipelines: Manual / generated code perform the same

– Shields from complex coding

Experiences with the Model-based Generation of Big Data Pipelines 8

SLIDE 9

Experiences and Lessons learned (2)

Center of integration → Higher workload
Supports evolution

– Consistent deployment of changes – Algorithms must be evolved manually – Also errors are deployed easily

Continuous integration

3/6/2017

– Generation and algorithms – Up-to date pipelines are available – Intensive tests increase overall build time → local debugging first

Effects

– Focus of work on algorithms – Allows realization and evolution of complex structures – Avoid runtime issues – Stability increases confidence, requires higher quality assurance

Experiences with the Model-based Generation of Big Data Pipelines 9

SLIDE 10

Conclusions

Model-based approach for streaming Big Data applications

– Type-safe – Heterogeneous data processing (hardware co-processors) – Flexible exchange of algorithms

Code generation for Apache Storm
Approach pays off

3/6/2017

Approach pays off

– Positive feedback – Requires training, modeling effort, effort for realization of transformation, maintenance and evolution

Future: Optimized code generation for self-adaptation

– Switching efficiency – Multiple target platforms

Optimized resource usage is already reality!

Experiences with the Model-based Generation of Big Data Pipelines