DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. - - PowerPoint PPT Presentation
DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. - - PowerPoint PPT Presentation
Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica DSP frameworks we
DSP frameworks we consider
- Apache Storm (with lab)
- Twitter Heron
– From Twitter as Storm and compatible with Storm
- Apache Spark Streaming (lab)
– Reduce the size of each stream and process streams
- f data (micro-batch processing)
- Apache Flink
- Apache Samza
- Cloud-based frameworks
– Google Cloud Dataflow – Amazon Kinesis
1 Valeria Cardellini - SABD 2018/19
Apache Storm
- Apache Storm
– Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter
- Topology
– DAG of spouts (sources of streams) and bolts (operators and data sinks) – Top-level abstraction submitted to Storm for execution
2 Valeria Cardellini - SABD 2018/19
Storm: stream grouping
- Stream grouping defines how to send send tuples
between two topology nodes
– Remember of data parallelism: spouts and bolts execute in parallel (multiple threads of execution)
- Shuffle grouping
– Randomly partitions the tuples
- Field grouping
– Hashes on a subset of the tuple attributes
Valeria Cardellini - SABD 2018/19 3
Storm: stream grouping
- All grouping (i.e., broadcast)
– Replicates the entire stream to all the consumer tasks
- Global grouping
– Sends the entire stream to a single task of a bolt
- Direct grouping
– The producer of the tuple decides which task of the consumer will receive this tuple
Valeria Cardellini - SABD 2018/19 4
Storm: topology API
- Storm uses tuples as its data model
– Tuple: named list of values, and a field in a tuple can be an
- bject of any type
– Storm supports all the primitive types, strings, and byte arrays as tuple field values – To use an object of another type, you just need to implement a serializer for the type
- A simple topology: Exclamation
– The spout emits words, and each bolt appends the string "!!!" to its input
Valeria Cardellini - SABD 2018/19 5
setSpout and setBolt methods take as input:
- user-specified id
- object containing the
processing logic
- amount of parallelism
for the operator
Storm: topology API
- Example: WordCount
For complete code https://github.com/apache/storm/blob/master/examples/storm- starter/src/jvm/org/apache/storm/starter/WordCountTopology.java
- Bolts can be defined in any language
– Bolts written in another language are executed as subprocesses, and Storm communicates with those subprocesses with JSON messages over stdin/stdout – Communication protocol implemented in an adapter library already available for Python
Valeria Cardellini - SABD 2018/19 6
Storm: windows
Valeria Cardellini - SABD 2018/19 7
- Windowing support introduced in Storm from v. 1.0
– Before developers had to built their own windowing logic
- Storm has support for sliding and tumbling windows
based on time duration or event count
- Window types
– Tumbling windows
- Length or duration (aka fixed
windows)
- Can be count-based or time-based
– Sliding windows
- Length or duration + sliding interval
- Data can be processed in more
than one window
- Can be count-based or time-based
Storm: windows
Valeria Cardellini - SABD 2018/19 8
- Tuples grouped in single window based on time or
count
– Count-based windows
- Specified on the basis of the number of operations rather than
a time interval (no relation to clock time)
– Time-based windows
- Specified on the basis of a time duration (in seconds) rather
than a number of tuples processed
Storm: architecture
9 Valeria Cardellini - SABD 2018/19
- Master-worker architecture
Storm components: Nimbus and Zookeeper
- Nimbus
– The master node – Clients submit topologies to it – Responsible for distributing and coordinating the topology execution
- Zookeeper
– Nimbus uses a combination of the local disk(s) and Zookeeper to store state about the topology
Valeria Cardellini - SABD 2018/19 10
worker process
executor executor
THREAD THREAD JAVA PROCESS
task task task task task
Storm components: worker
- Task: operator instance
– Actual work for bolt or spout is done by task
- Executor: smallest schedulable entity
– Execute one or more tasks related to same operator
- Worker process: Java process running one or
more executors
- Worker node: computing
resource, a container for
- ne or more worker processes
11 Valeria Cardellini - SABD 2018/19
Storm components: supervisor
- Each worker node runs a supervisor
The supervisor:
– receives assignments from Nimbus (through ZooKeeper) and spawns workers based on the assignment – sends to Nimbus (through ZooKeeper) a periodic heartbeat – advertises the topologies that they are currently running, and any vacancies that are available to run more topologies
Valeria Cardellini - SABD 2018/19 12
Storm: running topology
- The application developer can configure the
parallelism of a topology
– Number of worker processes – Number of executors (threads) – Number of tasks
Valeria Cardellini - SABD 2018/19 13
- Parallelism of running
topology can be changed manually using rebalance command
Storm: reliable message processing
- What happens if a bolt fails to process a tuple?
- Storm provides a mechanism by which the originating
spout can replay the failed tuple
– Needs to maintain the link between the spout tuple and its child tuples so to detect when the tree of tuples is fully processed: anchoring – And needs to ack or fail the spout tuple appropriately
- If ack is not received within a specified timeout time period, the
tuple processing is considered as failed
- Storm offers at-least-once semantics
– Add Trident for exactly-once semantics
Valeria Cardellini - SABD 2018/19 14
Storm: application monitoring
Valeria Cardellini - SABD 2018/19 15
See https://storm.apache.org/releases/1.2.2/STORM-UI-REST-API.html
- number of messages executed *
average execute latency / time window
– Latency
- For spouts: completeLatency (total
latency for processing the message)
– Ignore value if acking is disabled
- For bolts: executeLatency (avg time the
bolt spends in the execute method) and processLatency (avg time from starting execute to ack)
- Storm has a built-in monitoring and metrics system
– Built-in and user-defined metrics
- Built-in metrics include:
– Capacity ⎼ JVM memory usage and garbage collection
- Metrics can be queried via Storm’s UI REST API or reported to
a registered consumer (e.g., Graphite)
Twitter Heron
- Real-time, distributed, fault-tolerant stream
processing engine from Twitter
- Developed as direct successor of Storm
– Released as open source in 2016
https://apache.github.io/incubator-heron/
– Stream data processing engine used at Twitter
- Goal of overcoming Storm’s performance,
reliability, and other shortcomings
- Compatibility with Storm
– API compatible with Storm: no code change is required for migration
Valeria Cardellini - SABD 2018/19 16
Heron: in common with Storm
- Same terminology of Storm
– Topology, spout, bolt
- Same stream groupings
– Shuffle, fields, all, global
- Example: WordCount topology
Valeria Cardellini - SABD 2018/19 17
Heron: design goals
- Isolation
– Process-based topologies rather than thread-based – Each process should run in isolation (easy debugging, profiling, and troubleshooting) – Goal: overcoming Storm’s performance, reliability, and other shortcomings
- Resource constraints
– Safe to run in shared infrastructure: topologies use
- nly initially allocated resources and never exceed
bounds
- Compatibility
– Fully API and data model compatible with Storm
Valeria Cardellini - SABD 2018/19 18
Heron: design goals
- Backpressure
– Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag – Heron dynamically adjusts the rate at which data flows through the topology using backpressure
- Performance
– Higher throughput and lower latency than Storm – Enhanced configurability to fine-tune potential latency/throughput trade-offs
- Semantic guarantees
– Support for both at-most-once and at-least-once processing semantics
- Efficiency
– Minimum possible resource usage
Valeria Cardellini - SABD 2018/19 19
Heron: topology
- Similarly to Storm, a Heron topology is a DAG
used to process streams of data and consists
- f spouts and bolts
– Spouts inject data from external sources like pub- sub messaging systems (Apache Kafka, Apache Pulsar, etc.) – Bolts apply user-defined processing logic to data supplied by spouts, can be stateless or stateful
Valeria Cardellini - SABD 2018/19 20
Heron API
Valeria Cardellini - SABD 2018/19 21
- Heron API based on Storm topology API
- Window operations supported in both
frameworks
- Same window types as in Storm
– Tumbling windows – Sliding windows
- Based on count or time
- Recent shift from procedural to functional style
– Change common also to Apache Storm
- Heron: Heron Streamlet API
- Storm: Stream API
- Still in beta
– Let’s examine Heron Streamlet API
Valeria Cardellini - SABD 2018/19 22
Heron API: shift to functional style
Heron API: shift to functional style
- Processing graphs consist of streamlets
– One or more supplier streamlets inject data into the graph to be processed by downstream operators
- Operations (similar to Spark)
Valeria Cardellini - SABD 2018/19 23
Heron API: shift to functional style
- Operations (continued)
Valeria Cardellini - SABD 2018/19 24
Heron: topology lifecycle
- Topology lifecycle managed through Heron’s
CLI tool
- Stages
– Submit the topology to the cluster – Activate the topology – Restart an active topology if, e.g., after updating the topology configuration – Deactivate the topology – Kill a topology to completely remove it from the cluster
Valeria Cardellini - SABD 2018/19 25
Heron topology: logical and physical plans
- Topology’s logical plan:
analogous to a database query plan in that it maps out the basic operations associated with a topology
- Topology’s physical plan:
determines the “physical” execution logic of a topology, i.e. how topology processes are divided between Heron containers
- Logical and physical plans are
automatically created by
Valeria Cardellini - SABD 2018/19 26
Heron
Heron architecture per topology
- Master-work architecture
- One Topology Master (TM)
– Manages a topology throughout its entire lifecycle
- Multiple Containers
– Each Container multiple Heron Instances, a Stream Manager, and a Metrics Manager – A Heron Instance is a process that handles a single task of a spout or bolt – Containers communicate with TM to ensure that the topology forms a fully connected graph
Valeria Cardellini - SABD 2018/19 27
Heron architecture per topology
Valeria Cardellini - SABD 2018/19 28
Heron architecture per topology
- Stream Manager (SM): routing engine for data
streams
– Each Heron container connects to its local SM, while all of the SMs in a given topology connect to one another to form a network – Responsible for propagating backpressure
Valeria Cardellini - SABD 2018/19 29
Heron: topology submit sequence
Valeria Cardellini - SABD 2018/19 30
Heron: self-adaptation
- Dhalion: framework on top of Heron to autonomously
reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed
Valeria Cardellini - SABD 2018/19 31
- Phases in Dhalion:
- Symptom detection
(backpressure, skew, …)
- Diagnosis generation
- Resolution
- Adaptation actions:
parallelism changes
Heron environment
- Heron supports deployment on Apache
Mesos
- Can also run on Mesos using Apache Aurora
as a scheduler or using a local scheduler
Valeria Cardellini - SABD 2018/19 32
Batch processing vs. stream processing
- Batch processing is just a special case of
stream processing
Valeria Cardellini - SABD 2018/19 33
Batch processing vs. stream processing
- Batched/stateless: scheduled in batches
– Short-lived tasks (Hadoop, Spark) – Distributed streaming over batches (Spark Streaming)
- Dataflow/stateful: continuous/scheduled once
(Storm, Flink, Heron)
– Long-lived task execution – State is kept inside tasks
Valeria Cardellini - SABD 2018/19 34
Native vs. non-native streaming
Valeria Cardellini - SABD 2018/19 35
Apache Flink
- Distributed data flow processing system
- One common runtime for DSP applications
and batch processing applications
– Batch processing applications run efficiently as special cases of DSP applications
- Integrated with many other projects in the
- pen-source data processing ecosystem
Valeria Cardellini - SABD 2018/19 36
- Derives from Stratosphere
project by TU Berlin, Humboldt University and Hasso Plattner Institute
- Support a Storm-compatible API
Flink: software stack
Valeria Cardellini - SABD 2018/19 37
- Flink is a layered system
- On top: libraries with high-level APIs for different
use cases
https://ci.apache.org/projects/flink/flink-docs-release-1.8/
Flink: programming model
- Data streams
– Unbounded, partitioned immutable sequence of events
- Stream operators
– Stream transformations that take one or more streams as input, and produce one or more output streams as a result
Valeria Cardellini - SABD 2018/19 38
DSP and time
- Different notions of time in a DSP application:
– Processing time: time at which events are observed in the system (local time of the machine executing the operator) – Event time: time at which events actually occured
- Usually described by a timestamp in the events
– Ingestion time: when an event enters the dataflow at the source operator(s)
Valeria Cardellini - SABD 2018/19 39
See https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Flink: time
- Flink supports all the 3 notions of time
– Internally, ingestion time is treated similarly to event time
- Event time makes it easy to compute over streams
where events arrive out-of-order, and where events may arrive delayed
- How to measure the progress of event time?
– Flink uses watermarks
Valeria Cardellini - SABD 2018/19 40
Flink: backpressure
- Continuous streaming model with
backpressure
– Flink’s streaming runtime provides flow control: slow data sinks backpressure faster sources – Flink’s UI allows to monitor backpressure behavior
- f running jobs
- Back pressure warning (e.g. High) for an upstream
- perator
Valeria Cardellini - SABD 2018/19 41
Flink: other features
- Highly flexible streaming windows
– Also user-defined windows
- Exactly-once semantics for stateful
computations
– Based on two-phase commit
Valeria Cardellini - SABD 2018/19 42
Flink: levels of abstraction
Valeria Cardellini - SABD 2018/19 43
- Different levels of abstraction to develop
streaming/batch applications
- APIs in Java and Scala
Flink: APIs and libraries
- Streaming data applications: DataStream API
– Supports functional transformations on data streams, with user-defined state and flexible windows – Example: how to compute a sliding histogram of word occurrences of a data stream of texts
Valeria Cardellini - SABD 2018/19 44
WindowWordCount in Flink's DataStream API
Sliding time window of 5 sec length and 1 sec trigger interval
Flink: APIs and libraries
- Batch processing applications: DataSet API
– Supports a wide range of data types beyond key/value pairs and a wealth of operators
Valeria Cardellini - SABD 2018/19 45
Core loop of the PageRank algorithm for graphs
Anatomy of a Flink program
- Let’s analyze the DataStream API
https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
- Each Flink program consists of the same basic
parts:
1. Obtain an execution environment 2. Load/create the initial data
Valeria Cardellini - SABD 2018/19 46
Anatomy of a Flink program
3. Specify transformations on data 4. Specify where to put the results of your computations 5. Trigger the program execution
Valeria Cardellini - SABD 2018/19 47
Flink: lazy evaluation
- All Flink programs are executed lazily
– When the program’s main method is executed, the data loading and transformations do not happen directly – Rather, each operation is created and added to the program’s plan – Operations are actually executed when the execution is explicitly triggered by execute() call
- n the execution environment
Valeria Cardellini - SABD 2018/19 48
Flink: data sources
- Several predefined stream sources accessible from
the StreamExecutionEnvironment
- 1. File-based:
– E.g., readTextFile(path) to read text files – Flink splits file reading process into two sub-tasks: directory monitoring and data reading
- Monitoring is implemented by a single, non-parallel task, while
reading is performed by multiple tasks running in parallel, whose parallelism is equal to the job parallelism
- 2. Socket-based
- 3. Collection-based
- 4. Custom
– E.g., to read from Kafka addSource(new FlinkKafkaConsumer08<>(...)) – See Apache Bahir for streaming connectors and SQL data sources https://bahir.apache.org/
Valeria Cardellini - SABD 2018/19 49
Flink: DataStream transformations
- Map
DataStream → DataStream
– Example: double the values of the input stream
- FlatMap
DataStream → DataStream
– Example: split sentences to words
Valeria Cardellini - SABD 2018/19 50
Flink: DataStream transformations
- Filter
DataStream → DataStream
– Example: filter out zero values
- KeyBy
DataStream → KeyedStream
– To specify a key, that logically partitions a stream into disjoint partitions – Internally, implemented with hash partitioning – Different ways to specify keys, the simplest case is grouping tuples
- n one or more fields of the tuple:
– Examples:
Valeria Cardellini - SABD 2018/19 51
Flink: DataStream transformations
- Reduce
KeyedStream → DataStream
– “Rolling” reduce on a keyed data stream – Combines the current element with the last reduced value and emits the new value – Example: create a stream of partial sums
Valeria Cardellini - SABD 2018/19 52
Flink: DataStream transformations
- Fold
KeyedStream → DataStream
– “Rolling” fold on a keyed data stream with an initial value – Combines the current element with the last folded value and emits the new value – Example: to emit the sequence "start-1", "start-1-2", "start-1-2-3", ... when applied on the sequence (1,2,3,4,5)
Valeria Cardellini - SABD 2018/19 53
Flink: DataStream transformations
- Aggregations
KeyedStream → DataStream
– To aggregate on a keyed data stream – min returns the minimum value, whereas minBy returns the element that has the minimum value in this field
- Window
KeyedStream → WindowedStream
Valeria Cardellini - SABD 2018/19 54
Flink: DataStream transformations
- Other transformations available in Flink
– Join: joins two data streams on a given key – Union: union of two or more data streams creating a new stream containing all the elements from all the streams – Split: splits the stream into two or more streams according to some criterion – Iterate: creates a “feedback” loop in the flow, by redirecting the output of one operator to some previous operator
- Useful for algorithms that continuously update a model
See https://ci.apache.org/projects/flink/flink-docs-release-
1.8/dev/stream/operators/
Valeria Cardellini - SABD 2018/19 55
Example: streaming window WordCount
Valeria Cardellini - SABD 2018/19 56
- Count the words from a web socket in 5 sec windows
// Key by the first element of a Tuple
Example: streaming window WordCount
Valeria Cardellini - SABD 2018/19 57
Flink: windows support
- Windows can be applied either to keyed streams or
to non-keyed ones
- General structure of a windowed Flink program
Valeria Cardellini - SABD 2018/19 58
Flink: window lifecycle
- First, specify if stream is keyed or not and define the
window assigner
– Keyed stream allows to perform the windowed computation in parallel by multiple tasks – The window will be completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness
- Then associate to the window the trigger and function
– Trigger determines when a window is ready to be processed by the window function – Function specifies the computation to be applied to the window contents
Valeria Cardellini - SABD 2018/19 59
Flink: window assigners
- How elements are assigned to windows
- Support for different window assigners
– Each WindowAssigner comes with a default Trigger
- Built-in assigners for most common use cases:
– Tumbling windows – Sliding windows – Session windows – Global windows
- Except global windows, they assign elements to
windows based on time, which can either be processing time or event time
- Also possible to implement a custom window
assigner
Valeria Cardellini - SABD 2018/19 60
Flink: window assigners
- Session windows
– To group elements by sessions
- f activity
– Differently from tumbling and sliding windows, do not overlap and do not have a fixed start and end time – A session window closes when a gap of inactivity occurs
- Global windows
– To assign all elements with the same key to the same single global window – Only useful if you also specify a custom trigger
Valeria Cardellini - SABD 2018/19 61
Flink: window functions
- Different window functions to specify the computation
- n each window
- ReduceFunction
– To incrementally aggregate the elements of a window – Example: sum up the second fields of the tuples for all elements in a window
Valeria Cardellini - SABD 2018/19 62
Flink: window functions
- AggregateFunction: generalized version of a ReduceFunction
– Example: compute the average of the second field of the elements in the window
Valeria Cardellini - SABD 2018/19 63
Flink: window functions
- FoldFunction: to specify how an input element of the window is
combined with an element of the output type
- ProcessWindowFunction: gets an Iterable containing all the
elements of the window, and a Context object with access to time and state information
– More flexibility than other window functions, at the cost of performance and resource consumption: elements are buffered until the window is ready for processing
- ReduceFunction and AggregateFunction can be
executed more efficiently
– Flink can incrementally aggregate the elements for each window as they arrive
Valeria Cardellini - SABD 2018/19 64
Flink: control events
- Control events: special events injected in the
data stream by operators
- Two types of control events in Flink
⎼ Watermarks ⎼ Checkpoint barriers
Valeria Cardellini - SABD 2018/19 65
Flink: watermarks
- Watermarks signal the progress of event time within a
data stream
Valeria Cardellini - SABD 2018/19 66
– Watermark(t) declares that event time has reached time t in that stream, meaning that there should be no more elements with timestamp t’ <= t – Crucial for out-of-order streams, where events are not
- rdered by their timestamps
- Flink does not provide ordering guarantees after any
form of stream partitioning or broadcasting
– In such case, dealing with out-of-order tuples is left to the
- perator implementation
Flink: checkpoint barriers
- To provide fault tolerance (see next slides), special
barrier markers (called checkpoint barriers) are periodically injected at streams sources and then pushed downstream up to sinks
Valeria Cardellini - SABD 2018/19 67
Fault tolerance
- To provide consistent results, DSP systems need to
be resilient to failures
- How? By periodically capturing a snapshot of the
execution graph which can be used later to restart in case of failures (checkpointing)
Snapshot: global state of the execution graph, capturing all necessary information to restart computation from that specific execution state
- Common approach is to rely on periodic global state
snapshots, but has drawbacks:
Valeria Cardellini - SABD 2018/19 68
– Stall overall computation – Eagerly persist all tuples in transit along with states which results in larger snapshots than required
Flink: fault tolerance
Valeria Cardellini - SABD 2018/19 69
- Flink offers a lightweight snapshotting mechanism
– Allows to maintain high throughput and provide strong consistency guarantees at the same time
- Such mechanism:
– Draws consistent snapshots of stream flows and operators’ state, – Even in presence of failures, the application state will reflect every record from the data stream exactly once – State stored at configurable place – Disabled by default
- Inspired by Chandy-Lamport algorithm for distributed
snapshot and tailored to Flink’s execution model
Chandy-Lamport algorithm
- The observer process (process initiating the snapshot):
– Saves its own local state – Sends a snapshot requestmessage bearing a snapshot token to all
- ther processes
- If a process receives the token for the first time:
– Sends the observer process its own saved state – Attaches the snapshot token to all subsequent messages (to help propagate the snapshot token)
- When a process that has already received the token receives a
message not bearing the token, it will forward that message to the observer process
– This message was sent before the snapshot “cut off” (as it does not bear a snapshot token) and needs to be included in the snapshot
- The observer builds up a complete snapshot: a saved state for
each process and all messages “in the ether” are saved
Valeria Cardellini - SABD 2018/19 70
Flink: fault tolerance
- Uses checkpoint barriers
– When an operator has received a barrier for snapshotn from all of its input streams, it emits a barrier for snapshotn into all of its outgoing streams. Once a sink operator has received barrier n from all of its input streams, it acknowledgesthat snapshotn to the checkpoint coordinator. After all sinks have acknowledged a snapshot, it is considered completed
Valeria Cardellini - SABD 2018/19 71
https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html
Flink: performance and memory management
- High throughput and low latency
- Memory management
– Flink implements its own memory management inside the JVM
Valeria Cardellini - SABD 2018/19 72
Flink: architecture
Valeria Cardellini - SABD 2018/19 73
- The usual master-worker architecture
Flink: architecture
Valeria Cardellini - SABD 2018/19 74
- Master (JobManager): schedules tasks, coordinates
checkpoints, coordinates recovery on failures, etc.
- Workers (TaskManagers): JVM processes that
execute tasks of a dataflow, and buffer and exchange the data streams
– Workers use task slots to control the number of tasks they accept (at least one) – Each task slot represents a fixed subset of resources of the worker
Flink: application execution
- The JobManager receives
the JobGraph
– Representation of the data flow consisting of operators (JobVertex) and intermediate results (IntermediateDataSet) – Each operator has properties, like parallelism and code that it executes
- The JobManager
transforms the JobGraph into an ExecutionGraph
– Parallel version of JobGraph
Valeria Cardellini - SABD 2018/19 75
Flink: application execution
- Data parallelism
– Different operators of the same program may have different levels of parallelism – The parallelism of an individual operator, data source, or data sink can be defined by calling its setParallelism() method
Valeria Cardellini - SABD 2018/19 76
Flink: application execution
- Execution plan can be visualized
Valeria Cardellini - SABD 2018/19 77
Flink: application monitoring
- Flink has a built-in monitoring and metrics system
- Built-in metrics include
– Throughput: in terms of number of records per sec. (per
- perator/task)
– Latency
- Support for latency tracking: special markers are periodically
inserted at all sources in order to obtain a distribution of latency between sources and each downstream operator – But do not account for time spent in operator processing – Assume that all machines clocks are sync
– Used JVM heap/non-heap/direct memory
- Application-specific metrics can be added
– E.g., counters for the number of invalid records
- All metrics can be either queried via Flink’s REST
API or send to external systems (e.g., Graphite and
InfluxDB)
Valeria Cardellini - SABD 2018/19 78
See https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
Flink: deployment
- Designed to run on large-scale clusters with many
thousands of nodes
- Can be run in a fully distributed fashion on a static
(but possibly heterogeneous) standalone cluster
- For a dynamically shared cluster, can be deployed on
YARN or Mesos
- Docker images for Apache Flink available on Docker
Hub
Valeria Cardellini - SABD 2018/19 79
A recent need
- A common need for many companies
– Run both batch and stream processing
- Alternative solutions
- 1. Lambda architecture
- 2. Unified frameworks
- 3. Unified programming model
Valeria Cardellini - SABD 2018/19 80
Lambda architecture
- Data-processing design pattern to integrate batch and
real-time processing
- Streaming framework used to process real-time events,
and, in parallel, batch framework to process the entire dataset
- Results from the two parallel pipelines are then merged
81 Source: https://voltdb.com/products/alternatives/lambda-architecture Valeria Cardellini - SABD 2018/19
Lambda architecture: example
Valeria Cardellini - SABD 2018/19 82
- Lambda architecture used at LinkedIn before Samza
development
Lambda architecture: pros and cons
- Pros:
– Flexibility in the frameworks’ choice
- Cons:
– Implementing and maintaining two separate frameworks for batch and stream processing can be hard and error-prone – Overhead of developing and managing multiple source codes
- The logic in each fork evolves over time, and keeping
them in sync involves duplicated and complex manual effort, often with different languages
Valeria Cardellini - SABD 2018/19 83
Unified frameworks
- Use a unified (Lambda-less) design for
processing both real-time as well as batch data using the same data structure
- Spark, Flink, Samza and Apex follow this
trend
Valeria Cardellini - SABD 2018/19 84
Unified programming model: Apache Beam
- A new layer of abstraction
- Provides advanced unified programming model
– Allows to define batch and streaming data processing pipelines that run on any execution engine (for now: Apex, Flink, Spark, Google Cloud Dataflow) – Java, Python and Go as programming languages
- Translates the data processing pipeline defined
by the user with the Beam program into the API compatible with the chosen distributed processing engine
- Developed by Google and released as open-
source top-level project
Valeria Cardellini - SABD 2018/19 85
Apache Samza
- A distributed framework for stateful and fault-
tolerant stream processing
– Unified framework for batch and stream processing
- Similarly to Flink, streams as first-class citizen, batch as
special case of streaming
– Used in production at LinkedIn
Valeria Cardellini - SABD 2018/19 86
Apache Samza
- Why stateful and fault-tolerant processing? User
profiles, email digests, aggregate counts, …
- Example: Email Digestion System at LinkedIn
– Production application running to digest updates into one email
Valeria Cardellini - SABD 2018/19 87
Samza: features
- Unified processing API for stream and batch
– Supports both stateless and stateful stream processing – Supports both processing time and event time
- Configurable and heterogeneous data sources and
sinks (e.g., Kafka, HDFS, AWS Kinesis)
- At-least once processing
- Efficient state management
– Local state (in-memory or on disk) partitioned among tasks (rather than remote data store) – Incremental checkpointing: only the delta rather than the entire state
- Flexible deployment
– As light-weight embedded library that can be integrated with a larger application – Alternately, as managed framework using YARN
Valeria Cardellini - SABD 2018/19 88
Samza: architecture
- Task: logical unit of parallelism
- Container: physical unit of parallelism
- Usual architecture
– The coordinator manages the assignment of tasks across containers, monitors the liveness of containers and redistributes the tasks during a failure – One coordinator per application
Valeria Cardellini - SABD 2018/19 89
– Host-affinity: during a new deployment Samza tries to preserve the assignment of tasks to hosts to re-use the snapshot of its local state
DSP state management
Valeria Cardellini - SABD 2018/19 90
- How to manage state information, i.e., “intermediate
information” that needs to be maintained between tuples for processing streams of data correctly?
- Common approach (e.g., in Storm) to deal with large
amounts of state: use remote data store (e.g., Redis)
Samza: state management
- Samza approach: keep state local to each node and
make it robust to failures by replicating state changes across multiple machines
Valeria Cardellini - SABD 2018/19 91
External store Local state
- Samza offers multiple APIs
– High Level Streams API, Low Level Task API, Samza SQL – High Level Streams API : includes common stream processing
- perations such as filter, partition, join, and windowing
– Example: Wikipedia stream application using Samza that consumes events from Wikipedia and produce stats to a Kafka topic
https://samza.apache.org/learn/tutorials/latest/hello-samza-high-level-code.html
Samza: High Level Streams API
Valeria Cardellini - SABD 2018/19 92
Towards strict delivery guarantees
- Most frameworks provide at-least-once delivery
guarantees (e.g., Storm, Samza)
– For stateful non-idempotent operators such as counting, at- least-once delivery guarantees can give incorrect results
- Flink, Storm with Trident, and Google’s MillWheel offer
stronger delivery guarantees (i.e., exactly-once)
– Exactly-once low latency stream processing in MillWheel works as follows:
- The record is checked against de-duplication data from previous
deliveries; duplicates are discarded
- User code is run for the input record, possibly resulting in pending
changes to timers, state, and productions
- Pending changes are committed to the backing store
- Senders are acked
- Pending downstream productions are sent
Valeria Cardellini - SABD 2018/19 93
Comparing DSP frameworks
- Let’s compare open source DSP frameworks
according to some features
Valeria Cardellini - SABD 2018/19 94
API Windows Delivery semantics Fault tol. State mgmt. Flow ctl. Operator elasticity Storm
Low-level High-level SQL No batch Yes At least once Exactly once with Trident Acking Checkpoint. (similar to Fink) Limited Yes with Trident Back pressure No
Heron
Low-level High-level No SQL No batch Yes At least once Effectively
- nce
Limited Back pressure Yes with Dhalion
Flink
High-level SQL Also batch Yes, also used-def. At least once Exactly once Checkpoint. Yes Back pressure No
Samza
Low-level High-level SQL Unified Yes At least once Incremental checkpoint. Yes No No
DSP in the Cloud
- Data streaming systems also as Cloud services
– Amazon Kinesis Data Streams – Google Cloud Dataflow – IBM Streaming Analytics – Microsoft Azure Stream Analytics
- Abstract the underlying infrastructure and support
dynamic scaling of computing resources
- Appear to execute in a single data center (i.e., no
geo-distribution)
Valeria Cardellini - SABD 2018/19 95
Google Cloud Dataflow
- Fully-managed data processing service, supporting
both stream and batch data processing
– Automated resource management – Dynamic work rebalancing – Horizontal auto-scaling
- Provides a unified programming model based on
Apache Beam
– Apache Beam SDKs in Java and Python – Enable developers to implement custom extensions and choose other execution engines
- Provides exactly-once processing
– MillWheel is Google’s internal version of Cloud Dataflow
Valeria Cardellini - SABD 2018/19 96
Google Cloud Dataflow
- Can be seamlessly integrated with GCP services for
streaming events ingestion (Cloud Pub/Sub), data warehousing (BigQuery), machine learning (Cloud Machine Learning)
Valeria Cardellini - SABD 2018/19 97
Amazon Kinesis Data Streams
- Allows to collect and ingest streaming data at scale for
real-time analytics
Valeria Cardellini - SABD 2018/19 98
Kinesis Data Analytics
- Allows to process data streams in real time with SQL
- r Java
– Java open source libraries based on Apache Flink
- Usual operators to filter aggregate and transform streaming
data
– Per-hour pricing based on the number of Kinesis Processing Units used to run the application
- Horizontal scaling of KPUs
Valeria Cardellini - SABD 2018/19 99
References
- Akidau, “Streaming 101: The world beyond batch”, 2015.
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming- 101
- Kulkarni et al., “Twitter Heron: stream processing at scale”, ACM
SIGMOD'15. http://bit.ly/2rUXkux
- Carbone et al., “Apache Flink: Stream and batch processing in a
single engine”, Bulletin IEEE Comp. Soc. Tech. Comm. on Data Eng., 2015. http://bit.ly/2sYzoGb
- Noghabi et al., “Samza: Stateful scalable stream processing at
LinkedIn”, VLDB Endowment, 2017. https://bit.ly/2LushvF
- Akidau et al., “MillWheel: Fault-tolerant stream processing at
Internet scale”, VLDB'13. http://bit.ly/2rE7Fa3
Valeria Cardellini - SABD 2018/19 100