DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. - - PDF document

dsp frameworks
SMART_READER_LITE
LIVE PREVIEW

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. - - PDF document

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP frameworks we consider Apache


slide-1
SLIDE 1

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

DSP Frameworks

Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini

DSP frameworks we consider

  • Apache Storm (with lab)
  • Twitter Heron

– From Twitter as Storm and compatible with Storm

  • Apache Spark Streaming (lab)

– Reduce the size of each stream and process streams

  • f data (micro-batch processing)
  • Apache Flink
  • Apache Samza
  • Cloud-based frameworks

– Google Cloud Dataflow – Amazon Kinesis Streams

1 Valeria Cardellini - SABD 2017/18

slide-2
SLIDE 2

Apache Storm

  • Apache Storm

– Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter

  • Topology

– DAG of spouts (sources of streams) and bolts (operators and data sinks)

2 Valeria Cardellini - SABD 2017/18

Stream grouping in Storm

  • Data parallelism in Storm: how are streams

partitioned among multiple tasks (threads of execution)?

  • Shuffle grouping

– Randomly partitions the tuples

  • Field grouping

– Hashes on a subset of the tuple attributes

Valeria Cardellini - SABD 2017/18 3

slide-3
SLIDE 3

Stream grouping in Storm

  • All grouping (i.e., broadcast)

– Replicates the entire stream to all the consumer tasks

  • Global grouping

– Sends the entire stream to a single task of a bolt

  • Direct grouping

– The producer of the tuple decides which task of the consumer will receive this tuple

Valeria Cardellini - SABD 2017/18 4

Storm architecture

5 Valeria Cardellini - SABD 2017/18

  • Master-worker architecture
slide-4
SLIDE 4

Storm components: Nimbus and Zookeeper

  • Nimbus

– The master node – Clients submit topologies to it – Responsible for distributing and coordinating the topology execution

  • Zookeeper

– Nimbus uses a combination of the local disk(s) and Zookeeper to store state about the topology

Valeria Cardellini - SABD 2017/18 6

worker process

executor executor

THREAD THREAD JAVA PROCESS

task task task task task

Storm components: worker

  • Task: operator instance

– The actual work for a bolt or a spout is done in the task

  • Executor: smallest schedulable entity

– Execute one or more tasks related to same operator

  • Worker process: Java process running one or

more executors

  • Worker node: computing

resource, a container for

  • ne or more worker processes

7 Valeria Cardellini - SABD 2017/18

slide-5
SLIDE 5

Storm components: supervisor

  • Each worker node runs a supervisor

The supervisor:

  • receives assignments from Nimbus (through

ZooKeeper) and spawns workers based on the assignment

  • sends to Nimbus (through ZooKeeper) a

periodic heartbeat;

  • advertises the topologies that they are

currently running, and any vacancies that are available to run more topologies

Valeria Cardellini - SABD 2017/18 8

Twitter Heron

  • Realtime, distributed, fault-tolerant stream

processing engine from Twitter

  • Developed as direct successor of Storm

– Released as open source in 2016 https://twitter.github.io/heron/ – De facto stream data processing engine inside Twitter

  • Goal of overcoming Storm’s performance,

reliability, and other shortcomings

  • Compatibility with Storm

– API compatible with Storm: no code change is required for migration

Valeria Cardellini - SABD 2017/18 9

slide-6
SLIDE 6

Heron: in common with Storm

  • Same terminology of Storm

– Topology, spout, bolt

  • Same stream groupings

– Shuffle, fields, all, global

  • Example: WordCount topology

Valeria Cardellini - SABD 2017/18 10

Heron: design goals

  • Isolation

– Process-based topologies rather than thread-based – Each process should run in isolation (easy debugging, profiling, and troubleshooting) – Goal: overcoming Storm’s performance, reliability, and other shortcomings

  • Resource constraints

– Safe to run in shared infrastructure: topologies use

  • nly initially allocated resources and never exceed

bounds

  • Compatibility

– Fully API and data model compatible with Storm

Valeria Cardellini - SABD 2017/18 11

slide-7
SLIDE 7

Heron: design goals

  • Backpressure

– Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag – Heron dynamically adjusts the rate at which data flows through the topology using backpressure

  • Performance

– Higher throughput and lower latency than Storm – Enhanced configurability to fine-tune potential latency/throughput trade-offs

  • Semantic guarantees

– Support for both at-most-once and at-least-once processing semantics

  • Efficiency

– Minimum possible resource usage

Valeria Cardellini - SABD 2017/18 12

Heron topology architecture

  • Master-work architecture
  • One Topology Master (TM)

– Manages a topology throughout its entire lifecycle

  • Multiple Containers

– Each Container multiple Heron Instances, a Stream Manager, and a Metrics Manager – A Heron Instance is a process that handles a single task of a spout or bolt – Containers communicate with TM to ensure that the topology forms a fully connected graph

Valeria Cardellini - SABD 2017/18 13

slide-8
SLIDE 8

Heron topology architecture

Valeria Cardellini - SABD 2017/18 14

Heron topology architecture

  • Stream Manager (SM): routing engine for data

streams

– Each Heron connects to its local SM, while all of the SMs in a given topology connect to one another to form a network – Responsbile for propagating back pressure

Valeria Cardellini - SABD 2017/18 15

slide-9
SLIDE 9

Topology submit sequence

Valeria Cardellini - SABD 2017/18 16

Self-adaptation in Heron

  • Dhalion: framework on top of Heron to autonomously

reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed

Valeria Cardellini - SABD 2017/18 17

  • Phases in Dhalion:
  • Symptom detection

(backpressure, skew, …)

  • Diagnosis generation
  • Resolution
  • Adaptation actions:

parallelism changes

slide-10
SLIDE 10

Heron environment

  • Heron supports deployment on Apache

Mesos

  • Can also run on Mesos using Apache Aurora

as a scheduler or using a local scheduler

Valeria Cardellini - SABD 2017/18 18

Batch processing vs. stream processing

  • Batch processing is just a special case of

stream processing

Valeria Cardellini - SABD 2017/18 19

slide-11
SLIDE 11

Batch processing vs. stream processing

  • Batched/stateless: scheduled in batches

– Short-lived tasks (Hadoop, Spark) – Distributed streaming over batches (Spark Streaming)

  • Dataflow/stateful: continuous/scheduled once

(Storm, Flink, Heron)

– Long-lived task execution – State is kept inside tasks

Valeria Cardellini - SABD 2017/18 20

Native vs. non-native streaming

Valeria Cardellini - SABD 2017/18 21

slide-12
SLIDE 12

Apache Flink

  • Distributed data flow processing system
  • One common runtime for DSP applications

and batch processing applications

– Batch processing applications run efficiently as special cases of DSP applications

  • Integrated with many other projects in the
  • pen-source data processing ecosystem

Valeria Cardellini - SABD 2017/18 22

  • Derives from Stratosphere

project by TU Berlin, Humboldt University and Hasso Plattner Institute

  • Support a Storm-compatible API

Flink: software stack

Valeria Cardellini - SABD 2017/18 23

  • On top: libraries with high-level APIs for different

use cases, still in beta

slide-13
SLIDE 13

Flink: programming model

  • Data stream

– An unbounded, partitioned immutable sequence

  • f events
  • Stream operators

– Stream transformations that generate new output data streams from input ones

Valeria Cardellini - SABD 2017/18 24

Flink: some features

  • Supports stream processing and windowing

with Event Time semantics

– Event time makes it easy to compute over streams where events arrive out of order, and where events may arrive delayed

  • Exactly-once semantics for stateful

computations

  • Highly flexible streaming windows

Valeria Cardellini - SABD 2017/18 25

slide-14
SLIDE 14

Flink: some features

  • Continuous streaming model with

backpressure

  • Flink's streaming runtime has natural flow

control: slow data sinks backpressure faster sources

Valeria Cardellini - SABD 2017/18 26

Flink: APIs and libraries

  • Streaming data applications: DataStream API

– Supports functional transformations on data streams, with user-defined state, and flexible windows – Example: how to compute a sliding histogram of word occurrences of a data stream of texts

Valeria Cardellini - SABD 2017/18 27

WindowWordCount in Flink's DataStream API

Sliding time window of 5 sec length and 1 sec trigger interval

slide-15
SLIDE 15

Flink: APIs and libraries

  • Batch processing applications: DataSet API
  • Supports a wide range of data types beyond

key/value pairs, and a wealth of operators

Valeria Cardellini - SABD 2017/18 28

Core loop of the PageRank algorithm for graphs

Flink: program optimization

  • Batch programs are automatically optimized

to exploit situations where expensive

  • perations (like shuffles and sorts) can be

avoided, and when intermediate data should be cached

Valeria Cardellini - SABD 2017/18 29

slide-16
SLIDE 16

Flink: control events

  • Control events: special events injected in the data

stream by operators

  • Periodically, the data source injects checkpoint barriers

into the data stream by dividing the stream into pre- checkpoint and post-checkpoint

  • More coarse-grained approach than Storm: acks sequences of

records instead of individual records

  • Watermarks signal the progress of event-time within a

stream partition

  • Flink does not provide ordering guarantees after any

form of stream repartitioning or broadcasting

– Dealing with out-of-order records is left to the operator implementation

Valeria Cardellini - SABD 2017/18 30

Flink: fault-tolerance

  • Based on Chandy-Lamport distributed

snapshots

  • Lightweight mechanism

– Allows to maintain high throughput rates and provide strong consistency guarantees at the same time

Valeria Cardellini - SABD 2017/18 31

slide-17
SLIDE 17

Flink: performance and memory management

  • High performance and low latency
  • Memory management

– Flink implements its own memory management inside the JVM

Valeria Cardellini - SABD 2017/18 32

Flink: architecture

Valeria Cardellini - SABD 2017/18 33

  • The usual master-worker architecture
slide-18
SLIDE 18

Flink: architecture

Valeria Cardellini - SABD 2017/18 34

  • Master (Job Manager): schedules tasks,

coordinates checkpoints, coordinates recovery

  • n failures, etc.
  • Workers (Task Managers): JVM processes

that execute tasks of a dataflow, and buffer and exchange the data streams

– Workers use task slots to control the number of tasks it accepts – Each task slot represents a fixed subset of resources of the worker

Flink: application execution

  • Jobs are expressed as data flows
  • The job graph is transformed into the

execution graph

  • The execution graph contain information to

schedule and execute a job

Valeria Cardellini - SABD 2017/18 35

slide-19
SLIDE 19

Flink: infrastructure

  • Designed to run on large-scale clusters with

many thousands of nodes

  • Provides support for YARN and Mesos

Valeria Cardellini - SABD 2017/18 36

A recent need

  • A common need for many companies

– Run both batch and stream processing

  • Alternative solutions

– Lambda architecture – Unified frameworks: Spark, Flink, Samza – Unified programming model: Apache Beam

Valeria Cardellini - SABD 2017/18 37

slide-20
SLIDE 20

Lambda architecture

  • Data-processing design pattern to integrate batch and

real-time processing

  • Streaming framework used to process real-time events,

and, in parallel, batch framework (e.g., Hadoop/Spark) deployed to process the entire dataset (perhaps periodically)

  • Results from the two parallel pipelines are then merged

38 Source: https://voltdb.com/products/alternatives/lambda-architecture Valeria Cardellini - SABD 2017/18

Lambda architecture: example

Valeria Cardellini - SABD 2017/18 39

  • Lambda architecture used at LinkedIn before

Samza development

slide-21
SLIDE 21

Lambda architecture

  • Pros:

– Flexibility in the frameworks’ choice

  • Cons:

– Implementing and maintaining two separate frameworks for batch and stream processing can be hard and error-prone – Overhead of developing and managing multiple source codes

  • The logic in each fork evolves over time, and keeping

them in sync involves duplicated and complex manual effort, often with different languages

Valeria Cardellini - SABD 2017/18 40

Unified frameworks

  • Use a unified (Lambda-less) design for

processing both real-time as well as batch data using the same data structure

  • Spark, Flink and Samza follow this trend

Valeria Cardellini - SABD 2017/18 41

slide-22
SLIDE 22

Apache Beam

  • A new layer of abstraction
  • Provides advanced unified programming model

– Allows to define batch and streaming data processing pipelines that run on any execution engine (for now: Flink, Spark, Google Cloud Dataflow) – Java, Python and Go as programming languages

  • Translates the data processing pipeline defined

by the user with the Beam program into the API compatible with the chosen distributed processing engine

  • Developed by Google and released as open-

source top-level project

Valeria Cardellini - SABD 2017/18 42

Apache Samza

  • A distributed system for stateful and fault-

tolerant stream processing

– A unified framework for batch and stream processing: similarly to Flink, streams as first-class citizen, batch as special case of streaming – Production at LinkedIn since 2014

  • Uses Kafka for messaging and YARN to

provide fault tolerance, processor isolation, security, and resource management

Valeria Cardellini - SABD 2017/18 43

slide-23
SLIDE 23

Apache Samza

  • Why stateful and fault-tolerant processing? User

profiles, email digests, aggregate counts, …

  • Example: Email Digestion System

– Production application running at LinkedIn using Samza to digest updates into one email

Valeria Cardellini - SABD 2017/18 44

Apache Samza: features

  • Provides distributed and scalable data

processing with

– Configurable and heterogeneous data sources and sinks (e.g., Kafka, HDFS, AWS Kinesis) – Efficient state management: local state (in memory

  • r on disk) partitioned among tasks (rather than

using a remote data store) and incremental checkpointing (more efficient than full state checkpointing) – Unified processing API for stream and batch processing

  • Differently for Flink

– Flexible deployment models

Valeria Cardellini - SABD 2017/18 45

slide-24
SLIDE 24

Apache Samza: architecture

  • Samza is made up of three layers:

– A streaming layer: Kafka – An execution layer: YARN – A processing layer: Samza API

Valeria Cardellini - SABD 2017/18 46

Apache Samza: architecture

  • Samza, YARN and Kafka interaction

Valeria Cardellini - SABD 2017/18 47

  • RM: YARN ResourceManager
  • NM: YARN NodeManager
  • AM: ApplicationMaster
  • Samza client uses YARN to run

a Samza job

  • YARN starts and supervises one
  • r more SamzaContainers, and

the processing code (using the StreamTask API) runs inside those containers

  • The input and output for the

Samza StreamTasks come from Kafka brokers

slide-25
SLIDE 25

Apache Samza: running an application

  • Count the number of page views aggregated by

user ID

  • Two Samza jobs: one to group messages by user

ID, the other to do the counting

Valeria Cardellini - SABD 2017/18 48

Apache Samza: state management

  • Common approach (e.g., in Storm) to deal with large

amounts of state: use a remote data store (e.g., Redis)

  • Samza approach: keep state local to each node and

make it robust to machine failures by replicating state changes across multiple machines

Valeria Cardellini - SABD 2017/18 49

slide-26
SLIDE 26

Apache Samza: high-level API

Valeria Cardellini - SABD 2017/18 50

Apache Samza: example application

  • Samza job to find top-k trending tags

– DAG for the logical representation of the application

Valeria Cardellini - SABD 2017/18 51

slide-27
SLIDE 27

Apache Samza: example application

Valeria Cardellini - SABD 2017/18 52

Towards strict delivery guarantees

  • Most frameworks provide at-least-once delivery

guarantees (e.g., Storm, Samza)

– For stateful non-idempotent operators such as counting, at- least-once delivery guarantees can give incorrect results

  • Flink, Storm with Trident, and Google’s MillWheel offer

stronger delivery guarantees (i.e., exactly-once)

– Exactly-once low latency stream processing in MillWheel works as follows:

  • The record is checked against de-duplication data from previous

deliveries; duplicates are discarded

  • User code is run for the input record, possibly resulting in pending

changes to timers, state, and productions

  • Pending changes are committed to the backing store
  • Senders are acked
  • Pending downstream productions are sent

Valeria Cardellini - SABD 2017/18 53

Source: MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.

slide-28
SLIDE 28

DSP in the Cloud

  • Data streaming systems are also offered as

Cloud services

– Amazon Kinesis Data Streams – Google Cloud Dataflow – IBM Streaming Analytics – Microsoft Azure Stream Analytics

  • Abstract the underlying infrastructure and

support dynamic scaling of the computing resources

  • Appear to execute in a single data center

Valeria Cardellini - SABD 2017/18 54

Google Cloud Dataflow

  • Fully-managed data processing service, supporting

both stream and batch execution of pipelines

– Transparently handles resource lifetime and can dynamically provision resources to minimize latency while maintaining high utilization efficiency – On-demand and auto-scaling

  • Provides a unified programming model and a

managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation

– Apache Beam model

  • Provides exactly-once processing

– MillWheel is Google’s internal version of Cloud Dataflow

Valeria Cardellini - SABD 2017/18 55

slide-29
SLIDE 29

Google Cloud Dataflow

  • Seamlessly integrates with other Google cloud

services

– Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud Bigtable, and BigQuery

  • Apache Beam SDKs, available in Java and Python

– Enable developers to implement custom extensions and choose other execution engines

Valeria Cardellini - SABD 2017/18 56

Amazon Kinesis Data Streams

  • Allows to build applications that process and

analyze streaming data

Valeria Cardellini - SABD 2017/18 57

slide-30
SLIDE 30

Kinesis Data Streams: components

  • Data record

– Unit of data stored by Kinesis

  • Stream: group of data records

– Data producers write data records to Kinesis streams – Data records in the stream are distributed into shards

  • Shard: sequence of data records in a stream

– Number of shards in a stream specified when the stream is created

  • Can then be increased or decreased as needed but manually

– Base unit of capacity: up to 1MB/s of written data and 2MB/s

  • f read data

– Partition key used to group data by shard within a stream – Used also for service pricing http://amzn.to/2szRTkG – Data records are stored in shards temporarily (24 hours by default)

Valeria Cardellini - SABD 2017/18 58

Kinesis Data Streams: consuming data

  • Kinesis Data Streams is used to capture streaming

data

  • An application reads data from a Kinesis stream as

data records, then uses the Kinesis Client Library (KCL) for the processing logic

– KCL takes care of: load-balancing across multiple EC2 instances, responding to instance failures, check-pointing processed records, reacting to re-sharding (that adjusts the number of shards)

Valeria Cardellini - SABD 2017/18 59

slide-31
SLIDE 31

References

  • Overview on DSP frameworks

– Wingerath et al. “Real-time stream processing for Big Data”,

  • 2016. http://bit.ly/2rUhXXG
  • Kulkarni et al., “Twitter Heron: stream processing at

scale”, ACM SIGMOD'15. http://bit.ly/2rUXkux

  • Carbone et al., “Apache Flink: Stream and batch

processing in a single engine”, Bulletin IEEE Comp. Soc.

  • Tech. Comm. on Data Eng., 2015. http://bit.ly/2sYzoGb
  • Noghabi et al., “Samza: Stateful scalable stream

processing at LinkedIn”, VLDB Endowment, 2017. https://bit.ly/2LushvF

  • Akidau et al., “MillWheel: fault-tolerant stream processing

at Internet scale”, VLDB'13. http://bit.ly/2rE7Fa3

Valeria Cardellini - SABD 2017/18 60