Experiences with the Model-based Generation of Big Data Pipelines - - PowerPoint PPT Presentation

experiences with the model based generation of big data
SMART_READER_LITE
LIVE PREVIEW

Experiences with the Model-based Generation of Big Data Pipelines - - PowerPoint PPT Presentation

Experiences with the Model-based Generation of Big Data Pipelines Holger Eichelberger, Cui Qin, Klaus Schmid {eichelberger, qin, schmid}@sse.uni-hildesheim.de Software Systems Engineering University of Hildesheim www.sse.uni-hildesheim.de


slide-1
SLIDE 1

Experiences with the Model-based Generation

  • f Big Data Pipelines

Holger Eichelberger, Cui Qin, Klaus Schmid

Software Systems Engineering University of Hildesheim

www.sse.uni-hildesheim.de

{eichelberger, qin, schmid}@sse.uni-hildesheim.de

slide-2
SLIDE 2

Motivation

  • Background FP7 QualiMaster:

– Configurable and adaptive data processing infrastructure – Real-time financial risk analysis

  • Programming applications for Big Data frameworks is complex
  • Ideal: Focus on data processing, ignore technical complexity

3/6/2017

  • Goal:

– Model-based approach to stream processing – Hide complexity – Ease development – Generate complex parts of code – Support self-adaptation

1

Experiences Lessons learned

Experiences with the Model-based Generation of Big Data Pipelines

slide-3
SLIDE 3

Model-based design

  • Basis: Concept analysis

– Fixed stream operators (e.g., Borealis, PIPES) – User-defined operators / algorithms (e.g., Storm, Heron) – Combinations (e.g., Spark, Flink)

  • Common concept: Data flow graph

3/6/2017

Common concept: Data flow graph

  • Typically represented as program
  • Recent trend: DSL

2

P1 P2 P3 Data source Data processors Data sink

Experiences with the Model-based Generation of Big Data Pipelines

slide-4
SLIDE 4

Specific modeling concepts

Data processing pipeline

P1 P2 P3 Algorithm family Source Sink

3/6/2017

  • Domain restrictions

– Must be a valid data flow graph – If Ps→ Pe, Ps must provide types that Pe can process – Interface compatibility between families and algorithms

3

Simple algorithm Hardware co- processor P2.1 P2.1 P2.1 P2.1 Sub-pipeline

Experiences with the Model-based Generation of Big Data Pipelines

slide-5
SLIDE 5

Modeling support

Domain-specific modeling frontend

Underlying: Own model- managment framework

3/6/2017

Experiences with the Model-based Generation of Big Data Pipelines 5

slide-6
SLIDE 6

Code generation

  • Architecture

– Heterogeneous resource pool – Intermediary layer extending Storm – Management layer for runtime

Generated Pipelines / Applications Intermediary Layer Stream Processing Framework (Apache Storm) Reconfigurable Hardware Management Stack

3/6/2017

  • Generation steps

– Family interfaces – Data serialization support – Integration of hardware co-processors – Pipelines / sub-pipelines, switching – Compile, integrate dependencies, package 16 pipelines

  • x 7 code produced
  • ~880 MB deployable

components

Experiences with the Model-based Generation of Big Data Pipelines 6

slide-7
SLIDE 7

Experiences and Lessons learned (1)

  • 7 data engineers from 3 groups, 6 large pipelines
  • Beginning of the project

– Sceptical about model-based approach – Initial version after some months – Hands-on workshops – Feedback:

3/6/2017

– Feedback:

  • Puzzled about type saftey
  • First own generated pipelines helped
  • Change of focus: More on algorithms
  • Requests for new features, reports on buggy features
  • Confidence increased with improved versions (~1 year)

Experiences with the Model-based Generation of Big Data Pipelines 7

slide-8
SLIDE 8

Experiences and Lessons learned (2)

  • Later phases

– Interfaces help to structure work – Typing helps avoiding runtime errors – “Magic“ of generated code

  • serialization
  • parameters

3/6/2017

  • algorithm switching

– Complex structures due to additional nodes, communication – For sub-pipelines: Manual / generated code perform the same

– Shields from complex coding

Experiences with the Model-based Generation of Big Data Pipelines 8

slide-9
SLIDE 9

Experiences and Lessons learned (2)

  • Center of integration → Higher workload
  • Supports evolution

– Consistent deployment of changes – Algorithms must be evolved manually – Also errors are deployed easily

  • Continuous integration

3/6/2017

– Generation and algorithms – Up-to date pipelines are available – Intensive tests increase overall build time → local debugging first

  • Effects

– Focus of work on algorithms – Allows realization and evolution of complex structures – Avoid runtime issues – Stability increases confidence, requires higher quality assurance

Experiences with the Model-based Generation of Big Data Pipelines 9

slide-10
SLIDE 10

Conclusions

  • Model-based approach for streaming Big Data applications

– Type-safe – Heterogeneous data processing (hardware co-processors) – Flexible exchange of algorithms

  • Code generation for Apache Storm
  • Approach pays off

3/6/2017

10

  • Approach pays off

– Positive feedback – Requires training, modeling effort, effort for realization of transformation, maintenance and evolution

  • Future: Optimized code generation for self-adaptation

– Switching efficiency – Multiple target platforms

Optimized resource usage is already reality!

Experiences with the Model-based Generation of Big Data Pipelines