DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. - - PowerPoint PPT Presentation

dsp frameworks
SMART_READER_LITE
LIVE PREVIEW

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. - - PowerPoint PPT Presentation

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica DSP frameworks we


slide-1
SLIDE 1

Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica

DSP Frameworks

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

slide-2
SLIDE 2

DSP frameworks we consider

  • Apache Storm (with lab)
  • Twitter Heron

– From Twitter as Storm and compatible with Storm

  • Apache Spark Streaming (lab)

– Reduce the size of each stream and process streams

  • f data (micro-batch processing)
  • Apache Flink
  • Apache Samza
  • Cloud-based frameworks

– Google Cloud Dataflow – Amazon Kinesis

1 Valeria Cardellini - SABD 2018/19

slide-3
SLIDE 3

Apache Storm

  • Apache Storm

– Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter

  • Topology

– DAG of spouts (sources of streams) and bolts (operators and data sinks) – Top-level abstraction submitted to Storm for execution

2 Valeria Cardellini - SABD 2018/19

slide-4
SLIDE 4

Storm: stream grouping

  • Stream grouping defines how to send send tuples

between two topology nodes

– Remember of data parallelism: spouts and bolts execute in parallel (multiple threads of execution)

  • Shuffle grouping

– Randomly partitions the tuples

  • Field grouping

– Hashes on a subset of the tuple attributes

Valeria Cardellini - SABD 2018/19 3

slide-5
SLIDE 5

Storm: stream grouping

  • All grouping (i.e., broadcast)

– Replicates the entire stream to all the consumer tasks

  • Global grouping

– Sends the entire stream to a single task of a bolt

  • Direct grouping

– The producer of the tuple decides which task of the consumer will receive this tuple

Valeria Cardellini - SABD 2018/19 4

slide-6
SLIDE 6

Storm: topology API

  • Storm uses tuples as its data model

– Tuple: named list of values, and a field in a tuple can be an

  • bject of any type

– Storm supports all the primitive types, strings, and byte arrays as tuple field values – To use an object of another type, you just need to implement a serializer for the type

  • A simple topology: Exclamation

– The spout emits words, and each bolt appends the string "!!!" to its input

Valeria Cardellini - SABD 2018/19 5

setSpout and setBolt methods take as input:

  • user-specified id
  • object containing the

processing logic

  • amount of parallelism

for the operator

slide-7
SLIDE 7

Storm: topology API

  • Example: WordCount

For complete code https://github.com/apache/storm/blob/master/examples/storm- starter/src/jvm/org/apache/storm/starter/WordCountTopology.java

  • Bolts can be defined in any language

– Bolts written in another language are executed as subprocesses, and Storm communicates with those subprocesses with JSON messages over stdin/stdout – Communication protocol implemented in an adapter library already available for Python

Valeria Cardellini - SABD 2018/19 6

slide-8
SLIDE 8

Storm: windows

Valeria Cardellini - SABD 2018/19 7

  • Windowing support introduced in Storm from v. 1.0

– Before developers had to built their own windowing logic

  • Storm has support for sliding and tumbling windows

based on time duration or event count

  • Window types

– Tumbling windows

  • Length or duration (aka fixed

windows)

  • Can be count-based or time-based

– Sliding windows

  • Length or duration + sliding interval
  • Data can be processed in more

than one window

  • Can be count-based or time-based
slide-9
SLIDE 9

Storm: windows

Valeria Cardellini - SABD 2018/19 8

  • Tuples grouped in single window based on time or

count

– Count-based windows

  • Specified on the basis of the number of operations rather than

a time interval (no relation to clock time)

– Time-based windows

  • Specified on the basis of a time duration (in seconds) rather

than a number of tuples processed

slide-10
SLIDE 10

Storm: architecture

9 Valeria Cardellini - SABD 2018/19

  • Master-worker architecture
slide-11
SLIDE 11

Storm components: Nimbus and Zookeeper

  • Nimbus

– The master node – Clients submit topologies to it – Responsible for distributing and coordinating the topology execution

  • Zookeeper

– Nimbus uses a combination of the local disk(s) and Zookeeper to store state about the topology

Valeria Cardellini - SABD 2018/19 10

slide-12
SLIDE 12

worker process

executor executor

THREAD THREAD JAVA PROCESS

task task task task task

Storm components: worker

  • Task: operator instance

– Actual work for bolt or spout is done by task

  • Executor: smallest schedulable entity

– Execute one or more tasks related to same operator

  • Worker process: Java process running one or

more executors

  • Worker node: computing

resource, a container for

  • ne or more worker processes

11 Valeria Cardellini - SABD 2018/19

slide-13
SLIDE 13

Storm components: supervisor

  • Each worker node runs a supervisor

The supervisor:

– receives assignments from Nimbus (through ZooKeeper) and spawns workers based on the assignment – sends to Nimbus (through ZooKeeper) a periodic heartbeat – advertises the topologies that they are currently running, and any vacancies that are available to run more topologies

Valeria Cardellini - SABD 2018/19 12

slide-14
SLIDE 14

Storm: running topology

  • The application developer can configure the

parallelism of a topology

– Number of worker processes – Number of executors (threads) – Number of tasks

Valeria Cardellini - SABD 2018/19 13

  • Parallelism of running

topology can be changed manually using rebalance command

slide-15
SLIDE 15

Storm: reliable message processing

  • What happens if a bolt fails to process a tuple?
  • Storm provides a mechanism by which the originating

spout can replay the failed tuple

– Needs to maintain the link between the spout tuple and its child tuples so to detect when the tree of tuples is fully processed: anchoring – And needs to ack or fail the spout tuple appropriately

  • If ack is not received within a specified timeout time period, the

tuple processing is considered as failed

  • Storm offers at-least-once semantics

– Add Trident for exactly-once semantics

Valeria Cardellini - SABD 2018/19 14

slide-16
SLIDE 16

Storm: application monitoring

Valeria Cardellini - SABD 2018/19 15

See https://storm.apache.org/releases/1.2.2/STORM-UI-REST-API.html

  • number of messages executed *

average execute latency / time window

– Latency

  • For spouts: completeLatency (total

latency for processing the message)

– Ignore value if acking is disabled

  • For bolts: executeLatency (avg time the

bolt spends in the execute method) and processLatency (avg time from starting execute to ack)

  • Storm has a built-in monitoring and metrics system

– Built-in and user-defined metrics

  • Built-in metrics include:

– Capacity ⎼ JVM memory usage and garbage collection

  • Metrics can be queried via Storm’s UI REST API or reported to

a registered consumer (e.g., Graphite)

slide-17
SLIDE 17

Twitter Heron

  • Real-time, distributed, fault-tolerant stream

processing engine from Twitter

  • Developed as direct successor of Storm

– Released as open source in 2016

https://apache.github.io/incubator-heron/

– Stream data processing engine used at Twitter

  • Goal of overcoming Storm’s performance,

reliability, and other shortcomings

  • Compatibility with Storm

– API compatible with Storm: no code change is required for migration

Valeria Cardellini - SABD 2018/19 16

slide-18
SLIDE 18

Heron: in common with Storm

  • Same terminology of Storm

– Topology, spout, bolt

  • Same stream groupings

– Shuffle, fields, all, global

  • Example: WordCount topology

Valeria Cardellini - SABD 2018/19 17

slide-19
SLIDE 19

Heron: design goals

  • Isolation

– Process-based topologies rather than thread-based – Each process should run in isolation (easy debugging, profiling, and troubleshooting) – Goal: overcoming Storm’s performance, reliability, and other shortcomings

  • Resource constraints

– Safe to run in shared infrastructure: topologies use

  • nly initially allocated resources and never exceed

bounds

  • Compatibility

– Fully API and data model compatible with Storm

Valeria Cardellini - SABD 2018/19 18

slide-20
SLIDE 20

Heron: design goals

  • Backpressure

– Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag – Heron dynamically adjusts the rate at which data flows through the topology using backpressure

  • Performance

– Higher throughput and lower latency than Storm – Enhanced configurability to fine-tune potential latency/throughput trade-offs

  • Semantic guarantees

– Support for both at-most-once and at-least-once processing semantics

  • Efficiency

– Minimum possible resource usage

Valeria Cardellini - SABD 2018/19 19

slide-21
SLIDE 21

Heron: topology

  • Similarly to Storm, a Heron topology is a DAG

used to process streams of data and consists

  • f spouts and bolts

– Spouts inject data from external sources like pub- sub messaging systems (Apache Kafka, Apache Pulsar, etc.) – Bolts apply user-defined processing logic to data supplied by spouts, can be stateless or stateful

Valeria Cardellini - SABD 2018/19 20

slide-22
SLIDE 22

Heron API

Valeria Cardellini - SABD 2018/19 21

  • Heron API based on Storm topology API
  • Window operations supported in both

frameworks

  • Same window types as in Storm

– Tumbling windows – Sliding windows

  • Based on count or time
slide-23
SLIDE 23
  • Recent shift from procedural to functional style

– Change common also to Apache Storm

  • Heron: Heron Streamlet API
  • Storm: Stream API
  • Still in beta

– Let’s examine Heron Streamlet API

Valeria Cardellini - SABD 2018/19 22

Heron API: shift to functional style

slide-24
SLIDE 24

Heron API: shift to functional style

  • Processing graphs consist of streamlets

– One or more supplier streamlets inject data into the graph to be processed by downstream operators

  • Operations (similar to Spark)

Valeria Cardellini - SABD 2018/19 23

slide-25
SLIDE 25

Heron API: shift to functional style

  • Operations (continued)

Valeria Cardellini - SABD 2018/19 24

slide-26
SLIDE 26

Heron: topology lifecycle

  • Topology lifecycle managed through Heron’s

CLI tool

  • Stages

– Submit the topology to the cluster – Activate the topology – Restart an active topology if, e.g., after updating the topology configuration – Deactivate the topology – Kill a topology to completely remove it from the cluster

Valeria Cardellini - SABD 2018/19 25

slide-27
SLIDE 27

Heron topology: logical and physical plans

  • Topology’s logical plan:

analogous to a database query plan in that it maps out the basic operations associated with a topology

  • Topology’s physical plan:

determines the “physical” execution logic of a topology, i.e. how topology processes are divided between Heron containers

  • Logical and physical plans are

automatically created by

Valeria Cardellini - SABD 2018/19 26

Heron

slide-28
SLIDE 28

Heron architecture per topology

  • Master-work architecture
  • One Topology Master (TM)

– Manages a topology throughout its entire lifecycle

  • Multiple Containers

– Each Container multiple Heron Instances, a Stream Manager, and a Metrics Manager – A Heron Instance is a process that handles a single task of a spout or bolt – Containers communicate with TM to ensure that the topology forms a fully connected graph

Valeria Cardellini - SABD 2018/19 27

slide-29
SLIDE 29

Heron architecture per topology

Valeria Cardellini - SABD 2018/19 28

slide-30
SLIDE 30

Heron architecture per topology

  • Stream Manager (SM): routing engine for data

streams

– Each Heron container connects to its local SM, while all of the SMs in a given topology connect to one another to form a network – Responsible for propagating backpressure

Valeria Cardellini - SABD 2018/19 29

slide-31
SLIDE 31

Heron: topology submit sequence

Valeria Cardellini - SABD 2018/19 30

slide-32
SLIDE 32

Heron: self-adaptation

  • Dhalion: framework on top of Heron to autonomously

reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed

Valeria Cardellini - SABD 2018/19 31

  • Phases in Dhalion:
  • Symptom detection

(backpressure, skew, …)

  • Diagnosis generation
  • Resolution
  • Adaptation actions:

parallelism changes

slide-33
SLIDE 33

Heron environment

  • Heron supports deployment on Apache

Mesos

  • Can also run on Mesos using Apache Aurora

as a scheduler or using a local scheduler

Valeria Cardellini - SABD 2018/19 32

slide-34
SLIDE 34

Batch processing vs. stream processing

  • Batch processing is just a special case of

stream processing

Valeria Cardellini - SABD 2018/19 33

slide-35
SLIDE 35

Batch processing vs. stream processing

  • Batched/stateless: scheduled in batches

– Short-lived tasks (Hadoop, Spark) – Distributed streaming over batches (Spark Streaming)

  • Dataflow/stateful: continuous/scheduled once

(Storm, Flink, Heron)

– Long-lived task execution – State is kept inside tasks

Valeria Cardellini - SABD 2018/19 34

slide-36
SLIDE 36

Native vs. non-native streaming

Valeria Cardellini - SABD 2018/19 35

slide-37
SLIDE 37

Apache Flink

  • Distributed data flow processing system
  • One common runtime for DSP applications

and batch processing applications

– Batch processing applications run efficiently as special cases of DSP applications

  • Integrated with many other projects in the
  • pen-source data processing ecosystem

Valeria Cardellini - SABD 2018/19 36

  • Derives from Stratosphere

project by TU Berlin, Humboldt University and Hasso Plattner Institute

  • Support a Storm-compatible API
slide-38
SLIDE 38

Flink: software stack

Valeria Cardellini - SABD 2018/19 37

  • Flink is a layered system
  • On top: libraries with high-level APIs for different

use cases

https://ci.apache.org/projects/flink/flink-docs-release-1.8/

slide-39
SLIDE 39

Flink: programming model

  • Data streams

– Unbounded, partitioned immutable sequence of events

  • Stream operators

– Stream transformations that take one or more streams as input, and produce one or more output streams as a result

Valeria Cardellini - SABD 2018/19 38

slide-40
SLIDE 40

DSP and time

  • Different notions of time in a DSP application:

– Processing time: time at which events are observed in the system (local time of the machine executing the operator) – Event time: time at which events actually occured

  • Usually described by a timestamp in the events

– Ingestion time: when an event enters the dataflow at the source operator(s)

Valeria Cardellini - SABD 2018/19 39

See https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

slide-41
SLIDE 41

Flink: time

  • Flink supports all the 3 notions of time

– Internally, ingestion time is treated similarly to event time

  • Event time makes it easy to compute over streams

where events arrive out-of-order, and where events may arrive delayed

  • How to measure the progress of event time?

– Flink uses watermarks

Valeria Cardellini - SABD 2018/19 40

slide-42
SLIDE 42

Flink: backpressure

  • Continuous streaming model with

backpressure

– Flink’s streaming runtime provides flow control: slow data sinks backpressure faster sources – Flink’s UI allows to monitor backpressure behavior

  • f running jobs
  • Back pressure warning (e.g. High) for an upstream
  • perator

Valeria Cardellini - SABD 2018/19 41

slide-43
SLIDE 43

Flink: other features

  • Highly flexible streaming windows

– Also user-defined windows

  • Exactly-once semantics for stateful

computations

– Based on two-phase commit

Valeria Cardellini - SABD 2018/19 42

slide-44
SLIDE 44

Flink: levels of abstraction

Valeria Cardellini - SABD 2018/19 43

  • Different levels of abstraction to develop

streaming/batch applications

  • APIs in Java and Scala
slide-45
SLIDE 45

Flink: APIs and libraries

  • Streaming data applications: DataStream API

– Supports functional transformations on data streams, with user-defined state and flexible windows – Example: how to compute a sliding histogram of word occurrences of a data stream of texts

Valeria Cardellini - SABD 2018/19 44

WindowWordCount in Flink's DataStream API

Sliding time window of 5 sec length and 1 sec trigger interval

slide-46
SLIDE 46

Flink: APIs and libraries

  • Batch processing applications: DataSet API

– Supports a wide range of data types beyond key/value pairs and a wealth of operators

Valeria Cardellini - SABD 2018/19 45

Core loop of the PageRank algorithm for graphs

slide-47
SLIDE 47

Anatomy of a Flink program

  • Let’s analyze the DataStream API

https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html

  • Each Flink program consists of the same basic

parts:

1. Obtain an execution environment 2. Load/create the initial data

Valeria Cardellini - SABD 2018/19 46

slide-48
SLIDE 48

Anatomy of a Flink program

3. Specify transformations on data 4. Specify where to put the results of your computations 5. Trigger the program execution

Valeria Cardellini - SABD 2018/19 47

slide-49
SLIDE 49

Flink: lazy evaluation

  • All Flink programs are executed lazily

– When the program’s main method is executed, the data loading and transformations do not happen directly – Rather, each operation is created and added to the program’s plan – Operations are actually executed when the execution is explicitly triggered by execute() call

  • n the execution environment

Valeria Cardellini - SABD 2018/19 48

slide-50
SLIDE 50

Flink: data sources

  • Several predefined stream sources accessible from

the StreamExecutionEnvironment

  • 1. File-based:

– E.g., readTextFile(path) to read text files – Flink splits file reading process into two sub-tasks: directory monitoring and data reading

  • Monitoring is implemented by a single, non-parallel task, while

reading is performed by multiple tasks running in parallel, whose parallelism is equal to the job parallelism

  • 2. Socket-based
  • 3. Collection-based
  • 4. Custom

– E.g., to read from Kafka addSource(new FlinkKafkaConsumer08<>(...)) – See Apache Bahir for streaming connectors and SQL data sources https://bahir.apache.org/

Valeria Cardellini - SABD 2018/19 49

slide-51
SLIDE 51

Flink: DataStream transformations

  • Map

DataStream → DataStream

– Example: double the values of the input stream

  • FlatMap

DataStream → DataStream

– Example: split sentences to words

Valeria Cardellini - SABD 2018/19 50

slide-52
SLIDE 52

Flink: DataStream transformations

  • Filter

DataStream → DataStream

– Example: filter out zero values

  • KeyBy

DataStream → KeyedStream

– To specify a key, that logically partitions a stream into disjoint partitions – Internally, implemented with hash partitioning – Different ways to specify keys, the simplest case is grouping tuples

  • n one or more fields of the tuple:

– Examples:

Valeria Cardellini - SABD 2018/19 51

slide-53
SLIDE 53

Flink: DataStream transformations

  • Reduce

KeyedStream → DataStream

– “Rolling” reduce on a keyed data stream – Combines the current element with the last reduced value and emits the new value – Example: create a stream of partial sums

Valeria Cardellini - SABD 2018/19 52

slide-54
SLIDE 54

Flink: DataStream transformations

  • Fold

KeyedStream → DataStream

– “Rolling” fold on a keyed data stream with an initial value – Combines the current element with the last folded value and emits the new value – Example: to emit the sequence "start-1", "start-1-2", "start-1-2-3", ... when applied on the sequence (1,2,3,4,5)

Valeria Cardellini - SABD 2018/19 53

slide-55
SLIDE 55

Flink: DataStream transformations

  • Aggregations

KeyedStream → DataStream

– To aggregate on a keyed data stream – min returns the minimum value, whereas minBy returns the element that has the minimum value in this field

  • Window

KeyedStream → WindowedStream

Valeria Cardellini - SABD 2018/19 54

slide-56
SLIDE 56

Flink: DataStream transformations

  • Other transformations available in Flink

– Join: joins two data streams on a given key – Union: union of two or more data streams creating a new stream containing all the elements from all the streams – Split: splits the stream into two or more streams according to some criterion – Iterate: creates a “feedback” loop in the flow, by redirecting the output of one operator to some previous operator

  • Useful for algorithms that continuously update a model

See https://ci.apache.org/projects/flink/flink-docs-release-

1.8/dev/stream/operators/

Valeria Cardellini - SABD 2018/19 55

slide-57
SLIDE 57

Example: streaming window WordCount

Valeria Cardellini - SABD 2018/19 56

  • Count the words from a web socket in 5 sec windows

// Key by the first element of a Tuple

slide-58
SLIDE 58

Example: streaming window WordCount

Valeria Cardellini - SABD 2018/19 57

slide-59
SLIDE 59

Flink: windows support

  • Windows can be applied either to keyed streams or

to non-keyed ones

  • General structure of a windowed Flink program

Valeria Cardellini - SABD 2018/19 58

slide-60
SLIDE 60

Flink: window lifecycle

  • First, specify if stream is keyed or not and define the

window assigner

– Keyed stream allows to perform the windowed computation in parallel by multiple tasks – The window will be completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness

  • Then associate to the window the trigger and function

– Trigger determines when a window is ready to be processed by the window function – Function specifies the computation to be applied to the window contents

Valeria Cardellini - SABD 2018/19 59

slide-61
SLIDE 61

Flink: window assigners

  • How elements are assigned to windows
  • Support for different window assigners

– Each WindowAssigner comes with a default Trigger

  • Built-in assigners for most common use cases:

– Tumbling windows – Sliding windows – Session windows – Global windows

  • Except global windows, they assign elements to

windows based on time, which can either be processing time or event time

  • Also possible to implement a custom window

assigner

Valeria Cardellini - SABD 2018/19 60

slide-62
SLIDE 62

Flink: window assigners

  • Session windows

– To group elements by sessions

  • f activity

– Differently from tumbling and sliding windows, do not overlap and do not have a fixed start and end time – A session window closes when a gap of inactivity occurs

  • Global windows

– To assign all elements with the same key to the same single global window – Only useful if you also specify a custom trigger

Valeria Cardellini - SABD 2018/19 61

slide-63
SLIDE 63

Flink: window functions

  • Different window functions to specify the computation
  • n each window
  • ReduceFunction

– To incrementally aggregate the elements of a window – Example: sum up the second fields of the tuples for all elements in a window

Valeria Cardellini - SABD 2018/19 62

slide-64
SLIDE 64

Flink: window functions

  • AggregateFunction: generalized version of a ReduceFunction

– Example: compute the average of the second field of the elements in the window

Valeria Cardellini - SABD 2018/19 63

slide-65
SLIDE 65

Flink: window functions

  • FoldFunction: to specify how an input element of the window is

combined with an element of the output type

  • ProcessWindowFunction: gets an Iterable containing all the

elements of the window, and a Context object with access to time and state information

– More flexibility than other window functions, at the cost of performance and resource consumption: elements are buffered until the window is ready for processing

  • ReduceFunction and AggregateFunction can be

executed more efficiently

– Flink can incrementally aggregate the elements for each window as they arrive

Valeria Cardellini - SABD 2018/19 64

slide-66
SLIDE 66

Flink: control events

  • Control events: special events injected in the

data stream by operators

  • Two types of control events in Flink

⎼ Watermarks ⎼ Checkpoint barriers

Valeria Cardellini - SABD 2018/19 65

slide-67
SLIDE 67

Flink: watermarks

  • Watermarks signal the progress of event time within a

data stream

Valeria Cardellini - SABD 2018/19 66

– Watermark(t) declares that event time has reached time t in that stream, meaning that there should be no more elements with timestamp t’ <= t – Crucial for out-of-order streams, where events are not

  • rdered by their timestamps
  • Flink does not provide ordering guarantees after any

form of stream partitioning or broadcasting

– In such case, dealing with out-of-order tuples is left to the

  • perator implementation
slide-68
SLIDE 68

Flink: checkpoint barriers

  • To provide fault tolerance (see next slides), special

barrier markers (called checkpoint barriers) are periodically injected at streams sources and then pushed downstream up to sinks

Valeria Cardellini - SABD 2018/19 67

slide-69
SLIDE 69

Fault tolerance

  • To provide consistent results, DSP systems need to

be resilient to failures

  • How? By periodically capturing a snapshot of the

execution graph which can be used later to restart in case of failures (checkpointing)

Snapshot: global state of the execution graph, capturing all necessary information to restart computation from that specific execution state

  • Common approach is to rely on periodic global state

snapshots, but has drawbacks:

Valeria Cardellini - SABD 2018/19 68

– Stall overall computation – Eagerly persist all tuples in transit along with states which results in larger snapshots than required

slide-70
SLIDE 70

Flink: fault tolerance

Valeria Cardellini - SABD 2018/19 69

  • Flink offers a lightweight snapshotting mechanism

– Allows to maintain high throughput and provide strong consistency guarantees at the same time

  • Such mechanism:

– Draws consistent snapshots of stream flows and operators’ state, – Even in presence of failures, the application state will reflect every record from the data stream exactly once – State stored at configurable place – Disabled by default

  • Inspired by Chandy-Lamport algorithm for distributed

snapshot and tailored to Flink’s execution model

slide-71
SLIDE 71

Chandy-Lamport algorithm

  • The observer process (process initiating the snapshot):

– Saves its own local state – Sends a snapshot requestmessage bearing a snapshot token to all

  • ther processes
  • If a process receives the token for the first time:

– Sends the observer process its own saved state – Attaches the snapshot token to all subsequent messages (to help propagate the snapshot token)

  • When a process that has already received the token receives a

message not bearing the token, it will forward that message to the observer process

– This message was sent before the snapshot “cut off” (as it does not bear a snapshot token) and needs to be included in the snapshot

  • The observer builds up a complete snapshot: a saved state for

each process and all messages “in the ether” are saved

Valeria Cardellini - SABD 2018/19 70

slide-72
SLIDE 72

Flink: fault tolerance

  • Uses checkpoint barriers

– When an operator has received a barrier for snapshotn from all of its input streams, it emits a barrier for snapshotn into all of its outgoing streams. Once a sink operator has received barrier n from all of its input streams, it acknowledgesthat snapshotn to the checkpoint coordinator. After all sinks have acknowledged a snapshot, it is considered completed

Valeria Cardellini - SABD 2018/19 71

https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html

slide-73
SLIDE 73

Flink: performance and memory management

  • High throughput and low latency
  • Memory management

– Flink implements its own memory management inside the JVM

Valeria Cardellini - SABD 2018/19 72

slide-74
SLIDE 74

Flink: architecture

Valeria Cardellini - SABD 2018/19 73

  • The usual master-worker architecture
slide-75
SLIDE 75

Flink: architecture

Valeria Cardellini - SABD 2018/19 74

  • Master (JobManager): schedules tasks, coordinates

checkpoints, coordinates recovery on failures, etc.

  • Workers (TaskManagers): JVM processes that

execute tasks of a dataflow, and buffer and exchange the data streams

– Workers use task slots to control the number of tasks they accept (at least one) – Each task slot represents a fixed subset of resources of the worker

slide-76
SLIDE 76

Flink: application execution

  • The JobManager receives

the JobGraph

– Representation of the data flow consisting of operators (JobVertex) and intermediate results (IntermediateDataSet) – Each operator has properties, like parallelism and code that it executes

  • The JobManager

transforms the JobGraph into an ExecutionGraph

– Parallel version of JobGraph

Valeria Cardellini - SABD 2018/19 75

slide-77
SLIDE 77

Flink: application execution

  • Data parallelism

– Different operators of the same program may have different levels of parallelism – The parallelism of an individual operator, data source, or data sink can be defined by calling its setParallelism() method

Valeria Cardellini - SABD 2018/19 76

slide-78
SLIDE 78

Flink: application execution

  • Execution plan can be visualized

Valeria Cardellini - SABD 2018/19 77

slide-79
SLIDE 79

Flink: application monitoring

  • Flink has a built-in monitoring and metrics system
  • Built-in metrics include

– Throughput: in terms of number of records per sec. (per

  • perator/task)

– Latency

  • Support for latency tracking: special markers are periodically

inserted at all sources in order to obtain a distribution of latency between sources and each downstream operator – But do not account for time spent in operator processing – Assume that all machines clocks are sync

– Used JVM heap/non-heap/direct memory

  • Application-specific metrics can be added

– E.g., counters for the number of invalid records

  • All metrics can be either queried via Flink’s REST

API or send to external systems (e.g., Graphite and

InfluxDB)

Valeria Cardellini - SABD 2018/19 78

See https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html

slide-80
SLIDE 80

Flink: deployment

  • Designed to run on large-scale clusters with many

thousands of nodes

  • Can be run in a fully distributed fashion on a static

(but possibly heterogeneous) standalone cluster

  • For a dynamically shared cluster, can be deployed on

YARN or Mesos

  • Docker images for Apache Flink available on Docker

Hub

Valeria Cardellini - SABD 2018/19 79

slide-81
SLIDE 81

A recent need

  • A common need for many companies

– Run both batch and stream processing

  • Alternative solutions
  • 1. Lambda architecture
  • 2. Unified frameworks
  • 3. Unified programming model

Valeria Cardellini - SABD 2018/19 80

slide-82
SLIDE 82

Lambda architecture

  • Data-processing design pattern to integrate batch and

real-time processing

  • Streaming framework used to process real-time events,

and, in parallel, batch framework to process the entire dataset

  • Results from the two parallel pipelines are then merged

81 Source: https://voltdb.com/products/alternatives/lambda-architecture Valeria Cardellini - SABD 2018/19

slide-83
SLIDE 83

Lambda architecture: example

Valeria Cardellini - SABD 2018/19 82

  • Lambda architecture used at LinkedIn before Samza

development

slide-84
SLIDE 84

Lambda architecture: pros and cons

  • Pros:

– Flexibility in the frameworks’ choice

  • Cons:

– Implementing and maintaining two separate frameworks for batch and stream processing can be hard and error-prone – Overhead of developing and managing multiple source codes

  • The logic in each fork evolves over time, and keeping

them in sync involves duplicated and complex manual effort, often with different languages

Valeria Cardellini - SABD 2018/19 83

slide-85
SLIDE 85

Unified frameworks

  • Use a unified (Lambda-less) design for

processing both real-time as well as batch data using the same data structure

  • Spark, Flink, Samza and Apex follow this

trend

Valeria Cardellini - SABD 2018/19 84

slide-86
SLIDE 86

Unified programming model: Apache Beam

  • A new layer of abstraction
  • Provides advanced unified programming model

– Allows to define batch and streaming data processing pipelines that run on any execution engine (for now: Apex, Flink, Spark, Google Cloud Dataflow) – Java, Python and Go as programming languages

  • Translates the data processing pipeline defined

by the user with the Beam program into the API compatible with the chosen distributed processing engine

  • Developed by Google and released as open-

source top-level project

Valeria Cardellini - SABD 2018/19 85

slide-87
SLIDE 87

Apache Samza

  • A distributed framework for stateful and fault-

tolerant stream processing

– Unified framework for batch and stream processing

  • Similarly to Flink, streams as first-class citizen, batch as

special case of streaming

– Used in production at LinkedIn

Valeria Cardellini - SABD 2018/19 86

slide-88
SLIDE 88

Apache Samza

  • Why stateful and fault-tolerant processing? User

profiles, email digests, aggregate counts, …

  • Example: Email Digestion System at LinkedIn

– Production application running to digest updates into one email

Valeria Cardellini - SABD 2018/19 87

slide-89
SLIDE 89

Samza: features

  • Unified processing API for stream and batch

– Supports both stateless and stateful stream processing – Supports both processing time and event time

  • Configurable and heterogeneous data sources and

sinks (e.g., Kafka, HDFS, AWS Kinesis)

  • At-least once processing
  • Efficient state management

– Local state (in-memory or on disk) partitioned among tasks (rather than remote data store) – Incremental checkpointing: only the delta rather than the entire state

  • Flexible deployment

– As light-weight embedded library that can be integrated with a larger application – Alternately, as managed framework using YARN

Valeria Cardellini - SABD 2018/19 88

slide-90
SLIDE 90

Samza: architecture

  • Task: logical unit of parallelism
  • Container: physical unit of parallelism
  • Usual architecture

– The coordinator manages the assignment of tasks across containers, monitors the liveness of containers and redistributes the tasks during a failure – One coordinator per application

Valeria Cardellini - SABD 2018/19 89

– Host-affinity: during a new deployment Samza tries to preserve the assignment of tasks to hosts to re-use the snapshot of its local state

slide-91
SLIDE 91

DSP state management

Valeria Cardellini - SABD 2018/19 90

  • How to manage state information, i.e., “intermediate

information” that needs to be maintained between tuples for processing streams of data correctly?

  • Common approach (e.g., in Storm) to deal with large

amounts of state: use remote data store (e.g., Redis)

slide-92
SLIDE 92

Samza: state management

  • Samza approach: keep state local to each node and

make it robust to failures by replicating state changes across multiple machines

Valeria Cardellini - SABD 2018/19 91

External store Local state

slide-93
SLIDE 93
  • Samza offers multiple APIs

– High Level Streams API, Low Level Task API, Samza SQL – High Level Streams API : includes common stream processing

  • perations such as filter, partition, join, and windowing

– Example: Wikipedia stream application using Samza that consumes events from Wikipedia and produce stats to a Kafka topic

https://samza.apache.org/learn/tutorials/latest/hello-samza-high-level-code.html

Samza: High Level Streams API

Valeria Cardellini - SABD 2018/19 92

slide-94
SLIDE 94

Towards strict delivery guarantees

  • Most frameworks provide at-least-once delivery

guarantees (e.g., Storm, Samza)

– For stateful non-idempotent operators such as counting, at- least-once delivery guarantees can give incorrect results

  • Flink, Storm with Trident, and Google’s MillWheel offer

stronger delivery guarantees (i.e., exactly-once)

– Exactly-once low latency stream processing in MillWheel works as follows:

  • The record is checked against de-duplication data from previous

deliveries; duplicates are discarded

  • User code is run for the input record, possibly resulting in pending

changes to timers, state, and productions

  • Pending changes are committed to the backing store
  • Senders are acked
  • Pending downstream productions are sent

Valeria Cardellini - SABD 2018/19 93

slide-95
SLIDE 95

Comparing DSP frameworks

  • Let’s compare open source DSP frameworks

according to some features

Valeria Cardellini - SABD 2018/19 94

API Windows Delivery semantics Fault tol. State mgmt. Flow ctl. Operator elasticity Storm

Low-level High-level SQL No batch Yes At least once Exactly once with Trident Acking Checkpoint. (similar to Fink) Limited Yes with Trident Back pressure No

Heron

Low-level High-level No SQL No batch Yes At least once Effectively

  • nce

Limited Back pressure Yes with Dhalion

Flink

High-level SQL Also batch Yes, also used-def. At least once Exactly once Checkpoint. Yes Back pressure No

Samza

Low-level High-level SQL Unified Yes At least once Incremental checkpoint. Yes No No

slide-96
SLIDE 96

DSP in the Cloud

  • Data streaming systems also as Cloud services

– Amazon Kinesis Data Streams – Google Cloud Dataflow – IBM Streaming Analytics – Microsoft Azure Stream Analytics

  • Abstract the underlying infrastructure and support

dynamic scaling of computing resources

  • Appear to execute in a single data center (i.e., no

geo-distribution)

Valeria Cardellini - SABD 2018/19 95

slide-97
SLIDE 97

Google Cloud Dataflow

  • Fully-managed data processing service, supporting

both stream and batch data processing

– Automated resource management – Dynamic work rebalancing – Horizontal auto-scaling

  • Provides a unified programming model based on

Apache Beam

– Apache Beam SDKs in Java and Python – Enable developers to implement custom extensions and choose other execution engines

  • Provides exactly-once processing

– MillWheel is Google’s internal version of Cloud Dataflow

Valeria Cardellini - SABD 2018/19 96

slide-98
SLIDE 98

Google Cloud Dataflow

  • Can be seamlessly integrated with GCP services for

streaming events ingestion (Cloud Pub/Sub), data warehousing (BigQuery), machine learning (Cloud Machine Learning)

Valeria Cardellini - SABD 2018/19 97

slide-99
SLIDE 99

Amazon Kinesis Data Streams

  • Allows to collect and ingest streaming data at scale for

real-time analytics

Valeria Cardellini - SABD 2018/19 98

slide-100
SLIDE 100

Kinesis Data Analytics

  • Allows to process data streams in real time with SQL
  • r Java

– Java open source libraries based on Apache Flink

  • Usual operators to filter aggregate and transform streaming

data

– Per-hour pricing based on the number of Kinesis Processing Units used to run the application

  • Horizontal scaling of KPUs

Valeria Cardellini - SABD 2018/19 99

slide-101
SLIDE 101

References

  • Akidau, “Streaming 101: The world beyond batch”, 2015.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming- 101

  • Kulkarni et al., “Twitter Heron: stream processing at scale”, ACM

SIGMOD'15. http://bit.ly/2rUXkux

  • Carbone et al., “Apache Flink: Stream and batch processing in a

single engine”, Bulletin IEEE Comp. Soc. Tech. Comm. on Data Eng., 2015. http://bit.ly/2sYzoGb

  • Noghabi et al., “Samza: Stateful scalable stream processing at

LinkedIn”, VLDB Endowment, 2017. https://bit.ly/2LushvF

  • Akidau et al., “MillWheel: Fault-tolerant stream processing at

Internet scale”, VLDB'13. http://bit.ly/2rE7Fa3

Valeria Cardellini - SABD 2018/19 100