[PDF] - Data Acquisition and Ingestion Corso di Sistemi e Architetture per PDF Document

SLIDE 1

Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica

Data Acquisition and Ingestion

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Valeria Cardellini - SABD 2019/2020

Resource Management Data Storage Data Processing High-level Frameworks Support / Integration

1

SLIDE 2

Data acquisition and ingestion

How to collect data from external (and

multiple) data sources and ingest it into a system where it can be stored and later analyzed using batch processing?

– Distributed file systems (e.g., HDFS), NoSQL data stores (e.g., Hbase), …

How to connect external data sources to

stream or in-memory processing systems for immediate use?

How to perform also some preprocessing

(e.g., data transformation or conversion)?

Valeria Cardellini - SABD 2019/2020 2

Driving factors

Source type and location

– Batch data sources: files, logs, RDBMS, … – Real-time data sources: sensors, IoT systems, social media feeds, stock market feeds, … – Source location

Velocity

– How fast data is generated? – How frequently data varies? – Real-time or streaming data require low latency and low

verhead
Ingestion mechanism

– Depends on data consumer – Pull vs. push based approach

Valeria Cardellini - SABD 2019/2020 3

SLIDE 3

Requirements

Ingestion

– Batch data, streaming data – Easy writing to storage (e.g., HDFS)

Decoupling

– Data sources should not directly be coupled to processing framework

High availability and fault tolerance

– Data ingestion should be available 24x7 – Data should be buffered (persisted) in case processing framework is not available

Scalability and high throughput

– Number of sources and consumers will increase, amount of data will increase

Valeria Cardellini - SABD 2019/2020 4

Requirements

Data provenance
Security

– Authentication and data in motion encryption

Data conversion

– From multiple sources: transform data into common format – Also to speed up processing

Data integration

– From multiple flows to single flow

Data compression
Data preprocessing (e.g., filtering)
Backpressure and routing

– Buffer data in case of temporary spikes in workload and provide a mechanism to replay it later

5 Valeria Cardellini - SABD 2019/2020

SLIDE 4

A unifying view: Lambda architecture

Valeria Cardellini - SABD 2019/2020 6

Data acquisition layer

Allows collecting, aggregating and moving data
From various sources (server logs, social media,

streaming sensor data, …)

To a data store (distributed file system, NoSQL data

store, messaging system)

We analyze

– Apache Flume – Apache Sqoop – Apache NiFi

Valeria Cardellini - SABD 2019/2020 7

SLIDE 5

Apache Flume

Distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of stream data (e.g., log data)

Robust and fault tolerant with tunable reliability

mechanisms and failover and recovery mechanisms

– Tunable reliability levels

Best effort: “Fast and loose”
Guaranteed delivery: “Deliver no matter what”
Suitable for streaming analytics

Valeria Cardellini - SABD 2019/2020 8

Flume architecture

Valeria Cardellini - SABD 2019/2020 9

SLIDE 6

Flume architecture

Agent: JVM running Flume

– One per machine – Can run many sources, sinks and channels

Event

– Basic unit of data that is moved using Flume (e.g., Avro event) – Normally ~4KB

Source

– Produces data in the form of events

Channel

– Connects source to sink (like a queue) – Implements the reliability semantics

Sink

– Removes an event from a channel and forwards it to either to a destination (e.g., HDFS) or to another agent

Valeria Cardellini - SABD 2019/2020 10

Flume data flows

Flume allows a user to build multi-hop flows where

events travel through multiple agents before reaching the final destination

Supports multiplexing the event flow to one or more

destinations

Multiple built-in sources and sinks (e.g., Avro, Kafka)

Valeria Cardellini - SABD 2019/2020 11

SLIDE 7

Flume reliability

Events are staged in a channel on each agent

– A channel can be either durable (FILE, will persist data to disk) or non durable (MEMORY, will lose data if a machine fails)

Events are then delivered to next agent or final

destination (e.g., HDFS) in the flow

Events are removed from a channel only after they

are stored in the channel of next agent or in the final destination

Transactional approach to guarantee the reliable

delivery of events

– Sources and sinks encapsulate in a transaction the storage/retrieval of events

Valeria Cardellini - SABD 2019/2020 12

Apache Sqoop

A commonly used tool for SQL data transfer to

Hadoop

– SQL to Hadoop = SQOOP

To import bulk data from structured data stores such

as RDBMS into HDFS, HBase or Hive

Also to export data from HDFS to RDBMS
Supports a variety of file formats (e.g., Avro)

Valeria Cardellini - SABD 2019/2020 13

SLIDE 8

Apache NiFi

Valeria Cardellini - SABD 2019/2020 14

Powerful and reliable system to automate the flow of

data between systems

Mainly used for data routing and transformation
Highly configurable

– Flow specific QoS: loss tolerant vs guaranteed delivery, low latency vs high throughput – Prioritized queueing – Flow can be modified at runtime

Useful for data preprocessing

– Back pressure

Data governance and security
Ease of use: visual command and control

– Web-based UI where to define sources from where to collect data, processors for data conversion, destinations to store the data

Apache NiFi: core concepts

Based on flow based programming
Main concepts:

– FlowFile: each object moving through the system – FlowFile Processor: performs the work of data routing, transformation, or mediation between systems – Connection: actual linkage between processors; acts as queue

Valeria Cardellini - SABD 2019/2020 15

– Flow Controller: maintains the knowledge of how processes connect and manages threads and allocations – Process Group: specific set

f processes and their

connections

SLIDE 9

Apache NiFi: architecture

Valeria Cardellini - SABD 2019/2020 16

NiFi executes within a JVM
Multiple NiFi servers can be clustered for scalability

Apache NiFi: use case

Use NiFi to fetch tweets by means of NiFi’s processor

‘GetTwitter’

– It uses Twitter Streaming API for retrieving tweets

Move data stream to Apache Kafka using NiFi’s

processor ‘PublishKafka’

Valeria Cardellini - SABD 2019/2020 17

SLIDE 10

Data serialization formats for Big Data

Serialization: process of converting structured

data into a compact (binary) form

Some data serialization formats you already

know

– JSON – XML

Other serialization formats

– Apache Avro (row-oriented) – Apache Parquet (column-oriented) – Protocol buffers – Thrift

Valeria Cardellini - SABD 2019/2020 18

Apache Avro

Key features

– Compact, fast, binary data format – Supports a number of data structures for serialization – Neutral to programming language – Simple integration with dynamic languages – Relies on schemas: data+schema is fully self-describing

JSON-based schema segregated from data

– RPC – Both Hadoop and Spark can access Avro as data source

Comparing performance of serialization formats

https://bit.ly/2qrMnOz – Avro should not be used from small objects (high serialization and deserialization times) – Interesting for large objects

Valeria Cardellini - SABD 2019/2020 19

SLIDE 11

Messaging layer: architectural choices

Message queue

– ActiveMQ – RabbitMQ – ZeroMQ – Amazon SQS

Publish/subscribe

– Kafka – NATS http://www.nats.io – Apache Pulsar https://pulsar.apache.org/

Geo-replication of stored messages

– Redis

Valeria Cardellini - SABD 2019/2020 20

Messaging layer: use cases

Mainly used in the data processing pipelines

for data ingestion or aggregation

Envisioned mainly to be used at the

beginning or end of a data processing pipeline

Example

– Incoming data from various sensors: ingest this data into a streaming system for real-time analytics or a distributed file system for batch analytics

Valeria Cardellini - SABD 2019/2020 21

SLIDE 12

Message queue pattern

Messages are put into queue
Multiple consumers can read from the queue
Each message is delivered to only one

consumer

Pros (non only in Big data domain)

– Loose coupling – Service statelessness

22 Valeria Cardellini - SABD 2019/2020

Message queue API

Basic interface to a queue in a MQS:

– put: nonblocking send

Append a message to a specified queue

– get: blocking receive

Block until the specified queue is nonempty and remove the

first message

Variations: allow searching for a specific message in the

queue, e.g., using a matching pattern

– poll: nonblocking receive

Check a specified queue for message and remove the first
Never block

– notify: nonblocking receive

Install a handler to be automatically called when a message

is put into the specified queue

23 Valeria Cardellini - SABD 2019/2020

SLIDE 13

Publish/subscribe pattern

24

Application components can publish asynchronous

messages (e.g., event notifications), and/or declare their interest in message topics (or content) by issuing a subscription

– Let’s consider only topic-based pub/sub

Valeria Cardellini - SABD 2019/2020

Publish/subscribe pattern

25

Multiple consumers can subscribe to topic with
r without filters
Subscriptions are collected by a distributed

event dispatcher component, responsible for routing events to all matching subscribers

Pros

– High degree of decoupling

Spatial, temporal, asynchronous

Valeria Cardellini - SABD 2019/2020

SLIDE 14

Publish/subscribe API

Basic interface:

– publish(event): publish event

Events can be of any data type supported by the given

implementation languages and may also contain meta- data

– subscribe(filter_expr, notify_cb, expiry) → sub_handle: subscribe to event

Takes a filter expression, a reference to a notify callback

for event delivery, and an expiry time for the subscription registration.

Returns a subscription handle

– unsubscribe(sub_handle) – notify_cb(sub_handle, event): called by pub/sub system to deliver a matching event to the corresponding subscribers

Valeria Cardellini - SABD 2019/2020 26

Pub/sub vs. message queue

A sibling of message queue pattern but

further generalizes it by delivering a message to multiple consumers

– Message queue: delivers messages to only one receiver, i.e., one-to-one communication – Pub/sub: delivers messages to multiple receivers, i.e., one-to-many communication

Some frameworks (e.g., RabbitMQ, Kafka,

NATS) support both patterns

Valeria Cardellini - SABD 2019/2020 27

SLIDE 15

Apache Kafka

General-purpose, distributed pub/sub system
Originally developed in 2010 by LinkedIn
Used at scale by tech giants (Netflix, Uber, LinkedIn, …)
Written in Scala
Horizontally scalable
Fault-tolerant
High throughput ingestion

– Billions of messages

Kreps et al., “Kafka: A Distributed Messaging System for Log Processing”, NetDB’11.

Valeria Cardellini - SABD 2019/2020 28

Not only messaging, also

processing of data

⎼ We focus on messaging https://kafka.apache.org/documentation/

Kafka at a glance

Kafka maintains feeds of messages in categories called

topics

– A topic can have 0, 1, or many consumers subscribing to data written to it

Producers: publish messages to a Kafka topic
Consumers: subscribe to topics and process the feed of

published message

Kafka cluster: distributed log of data over servers known

as brokers

– A broker is responsible for receiving and storing published data

Valeria Cardellini - SABD 2019/2020 29

SLIDE 16

Kafka: topics and partitions

Topic: a category to which the message is published
For each topic, Kafka cluster maintains a partitioned log

– Log (data structure!): append-only, totally-ordered sequence of records ordered by time

A topic is split into a pre-defined number of partitions

– Partition: unit of parallelism of the topic (allows for parallel access)

Each partition is replicated with some replication factor

$> bin/kafka-topics.sh --create --bootstrap-server localhost:9092

-replication-factor 1 --partitions 1 --topic test

30

Create a topic with 1 partition and 1 replica using CLI tools

Valeria Cardellini - SABD 2019/2020

Kafka: partitions

Producers publish their records to partitions of a topic

(round-robin or partitioned by keys), and consumers consume the published records of that topic

Each partition is an ordered, numbered, immutable

sequence of records that is continually appended to

– Like a commit log

Each record is associated with a monotonically

increasing sequence number, called offset

31 Valeria Cardellini - SABD 2019/2020

SLIDE 17

Kafka: partitions

Partitions are distributed across brokers for scalability
Each partition is replicated for fault tolerance across

a configurable number of brokers

Each partition has one leader broker and 0 or more

followers

The leader handles read and write requests

– Read from leader – Write to leader

A follower replicates the leader and acts as a backup
Each broker is a leader for some of it partitions and a

follower for others to load balance

– Brokers rely on Apache Zookeeper for coordination

32 Valeria Cardellini - SABD 2019/2020

Kafka: partitions

Valeria Cardellini - SABD 2019/2020 33

SLIDE 18

Kafka: producers

Producers = data sources
Publish data to topics of their choice

– The producer sends data directly to the broker that is the leader for the partition without any routing tier

Also responsible for choosing which record to assign

to which partition within the topic

– Random or key-based partitioned – E.g., if user id is the key, all data for a given user will be sent to the same partition

Send some message using CLI tools

$> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is a message This is another message

Valeria Cardellini - SABD 2019/2020 34

Kafka: design choice for consumers

Push vs. pull model for consumers
Push model

– The brokers actively push messages to the consumers – Challenging for the broker to deal with different types of consumers as it controls the rate at which data is transferred – Need to decide whether to send a message immediately or accumulate more data and send

Pull model

– The consumer assumes the primary responsibility for retrieving messages from the brokers – The consumer has to maintain an offset that identifies the next message to be transmitted and processed – Pros: better scalability and flexibility (different consumers with diverse needs and capabilities) – Cons: in case broker has no data, consumers may end up busy waiting for data to arrive

35 Valeria Cardellini - SABD 2019/2020

SLIDE 19

Kafka: consumers

In Kafka design, pull approach rather than push

approach for consumers

– Why? Less burden on brokers

https://kafka.apache.org/documentation/#design_pull

Consumer Group: set of consumers sharing a common

group ID

– A Consumer Group maps to a logical subscriber – Each group consists of multiple consumers for scalability and fault tolerance

36 Valeria Cardellini - SABD 2019/2020

Kafka: consumers

In Kafka design, pull approach rather than push

approach for consumers

– Why? Less burden on brokers

https://kafka.apache.org/documentation/#design_pull

Consumer Group: set of consumers sharing a common

group ID

– A Consumer Group maps to a logical subscriber – Each group consists of multiple consumers for scalability and fault tolerance

37 Valeria Cardellini - SABD 2019/2020

SLIDE 20

Kafka: consumers

Consumers use the offset to track which messages have

been consumed

– Messages can be replayed using the offset

Run the consumer using CLI tools

$> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

38 Valeria Cardellini - SABD 2019/2020

Kafka: ordering guarantees

Messages sent by a producer to a particular topic

partition will be appended in the order they are sent

Consumer sees records in the order they are stored

in the partition

Strong guarantees about ordering only within a

partition

– Total order over messages within a partition – But Kafka cannot preserve message order between different partitions in a topic

However such per-partition ordering combined with

the ability to partition data by key among topic partitions is sufficient for most applications

Valeria Cardellini - SABD 2019/2020 39

SLIDE 21

Kafka: delivery semantics

Delivery guarantees supported by Kafka

– At-least-once (default): guarantees no message loss but duplicated messages, possibly out-of-order – Exactly-once: guarantees no loss and no duplicates, but requires expensive end-to-end 2PC

But not fully exactly-once
Support depends on the destination system
User can also implement at-most-once delivery by

disabling retries on producer and committing offsets in consumer prior to processing a message

See https://kafka.apache.org/documentation/#semantics

Valeria Cardellini - SABD 2019/2020 40

Kafka: fault tolerance

Kafka replicates partitions for fault tolerance
Kafka makes a message available for consumption
nly after all the followers acknowledge to the

leader a successful write

– Implies that a message may not be immediately available for consumption (the usual tradeoff between consistency and availability) – This default behavior can be relaxed is such strong guarantee is not required

Kafka retains messages for a configured period of

time

– Messages can be “replayed” in case a consumer fails

Valeria Cardellini - SABD 2019/2020 41

SLIDE 22

Kafka and ZooKeeper

Kafka uses ZooKeeper to coordinate between

producers, consumers and brokers

ZooKeeper stores Kafka metadata
List of brokers
List of consumers and their offsets
List of producers

Valeria Cardellini - SABD 2019/2020 42

Kafka: APIs

Four core APIs
Producer API: allows apps to

publish records (e.g., clickstream, logs, IoT) to topics

Consumer API: allows apps to

read records from topics

Connect API: reusable

connectors (producers or consumers) that connect Kafka topics to existing applications

r data systems so to move

large collections of data into and out of Kafka

⎼ Connectors for AWS S3, HDFS, RabbitMQ, MySQL, Postgres, AWS Lambda, MongoDB, Twitter, …

43

https://kafka.apache.org/documentation/#api

Valeria Cardellini - SABD 2019/2020

SLIDE 23

Kafka: APIs

Streams API: allows transforming streams of data

from input topics to output topics

– Kafka is not only a pub/sub system but also a real-time streaming platform

Use Kafka Streams to process data in pipelines consisting of

multiple stages

Kafka APIs support Java and Scala only

Valeria Cardellini - SABD 2019/2020 44

Kafka: client library

JVM internal client
Plus rich ecosystem of clients, among which:

– Sarama: Go client library

https://shopify.github.io/sarama/

– Python client library

https://pypi.org/project/kafka-python/ https://github.com/confluentinc/confluent-kafka-python/

NodeJS client library

https://github.com/Blizzard/node-rdkafka

Valeria Cardellini - SABD 2019/2020 45

SLIDE 24

Performance comparison: Kafka vs RabbitMQ

Valeria Cardellini - SABD 2019/2020 46

Both guarantee millisecond-level low-latency

– At least once delivery guarantee more expensive on Kafka (latency almost doubles)

Replication has a drastic impact in both cases

– Performance reduced by 50% (RabbitMQ) and 75% (Kafka)

Kafka is best suited as scalable ingestion system
The two systems can be chained
For feature comparison see also

https://www.cloudkarafka.com/blog/2020-02-02-which-service-rabbitmq-vs- apache-kafka.html

Dobbelaere and Esmaili, “Kafka versus RabbitMQ”, ACM DEBS 2017

Kafka: limitations

No complete set of monitoring tools
No support for wildcard topic selection
Limited support for geo-replication

– Can run a single Apache Kafka cluster across multiple geographic regions but suffers from latency – MirrorMaker tool allows to maintain a replica of an existing Kafka cluster but mainly for fault tolerance

Valeria Cardellini - SABD 2019/2020 47

SLIDE 25

Kafka: evolution

Kafka is no longer just a pub/sub system but

rather a distributed streaming platform

– Kafka Stream – Including KSQL, a streaming SQL engine for Kafka on top of Kafka Stream

You will see Kafka Streams in the hands-on

course

Valeria Cardellini - SABD 2019/2020 48

Kafka @ Netflix

Netflix uses Kafka for data collection and buffering

Valeria Cardellini - SABD 2019/2020

http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html

49

SLIDE 26

Kafka @ Uber

Uber has one of the largest Kafka deployment in the

industry

Valeria Cardellini - SABD 2019/2020

https://eng.uber.com/ureplicator/

50

Kafka @ Audi

Valeria Cardellini - SABD 2019/2020 51

Audi uses Kafka for

real-time data processing

https://www.youtube.com/watch?v=yGLKi3TMJv8

SLIDE 27

Kafka @ CINI Smart City Challenge ’17

Valeria Cardellini - SABD 2019/2020

By M. Adriani, D. Magnanimi, M. Ponza, F. Rossi

52

Cloud services for Kafka

Fully-managed services based on Kafka
Amazon MSK (Managed Streaming for

Apache Kafka) https://aws.amazon.com/msk/

Confluent Cloud https://www.confluent.io/confluent-cloud/

– Led by the creators of Kafka

CloudKarafka https://www.cloudkarafka.com/

Valeria Cardellini - SABD 2019/2020 53

SLIDE 28

Cloud services for data ingestion

Amazon Kinesis Data

Firehose

– Fully managed service for delivering streaming data

Google Cloud Pub/Sub

https://cloud.google.com/pubsub/

– Fully-managed real-time pub/sub messaging service

Valeria Cardellini - SABD 2019/2020 54

directly to S3, used as data lake – Can transform and compress streaming data before storing it – Can invoke Lambda functions to transform incoming source data and deliver it to S3

Cloud services for IoT data ingestion

Let’s consider some AWS cloud services

devoted to IoT data ingestion (and analysis)

– AWS IoT Events – AWS IoT Core – AWS IoT Analytics

Valeria Cardellini - SABD 2019/2020 55

SLIDE 29

AWS IoT Events

IoT service to detect and respond to events from IoT

sensors and applications

– Select data sources to ingest, define logic for each event using if-then-else statements, and select alert or custom action to trigger when event occurs – Integrated with other services (AWS IoT Core and IoT Analytics) to enable detection and insights into events

Valeria Cardellini - SABD 2019/2020 56

AWS IoT Core

Valeria Cardellini - SABD 2019/2020 57

Managed cloud service that lets connected devices

interact with cloud applications and other devices

SLIDE 30

AWS IoT Analytics

Fully-managed Cloud service to run analytics on

massive volumes of IoT data

Filters, transforms, and enriches IoT data before

storing it in a time-series data store for analysis

Valeria Cardellini - SABD 2019/2020 58

References

Apache Flume documentation,

https://flume.apache.org/FlumeUserGuide.html

Apache NiFi documentation,

https://nifi.apache.org/docs.html

Apache Kafka documentation,

https://kafka.apache.org/documentation/

Kreps et al., “Kafka: A distributed messaging system

for log processing”, NetDB 2011. http://bit.ly/2oxpael

Dobbelaere and Esmaili, “Kafka versus RabbitMQ: A

comparative study of two industry reference publish/subscribe implementations”, ACM DEBS ’17

https://doi.org/10.1145/3093742.3093908

Valeria Cardellini - SABD 2019/2020 59