Data Acquisition and Ingestion Corso di Sistemi e Architetture per - - PDF document

data acquisition and ingestion
SMART_READER_LITE
LIVE PREVIEW

Data Acquisition and Ingestion Corso di Sistemi e Architetture per - - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition and Ingestion Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica The


slide-1
SLIDE 1

Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica

Data Acquisition and Ingestion

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Valeria Cardellini - SABD 2019/2020

Resource Management Data Storage Data Processing High-level Frameworks Support / Integration

1

slide-2
SLIDE 2

Data acquisition and ingestion

  • How to collect data from external (and

multiple) data sources and ingest it into a system where it can be stored and later analyzed using batch processing?

– Distributed file systems (e.g., HDFS), NoSQL data stores (e.g., Hbase), …

  • How to connect external data sources to

stream or in-memory processing systems for immediate use?

  • How to perform also some preprocessing

(e.g., data transformation or conversion)?

Valeria Cardellini - SABD 2019/2020 2

Driving factors

  • Source type and location

– Batch data sources: files, logs, RDBMS, … – Real-time data sources: sensors, IoT systems, social media feeds, stock market feeds, … – Source location

  • Velocity

– How fast data is generated? – How frequently data varies? – Real-time or streaming data require low latency and low

  • verhead
  • Ingestion mechanism

– Depends on data consumer – Pull vs. push based approach

Valeria Cardellini - SABD 2019/2020 3

slide-3
SLIDE 3

Requirements

  • Ingestion

– Batch data, streaming data – Easy writing to storage (e.g., HDFS)

  • Decoupling

– Data sources should not directly be coupled to processing framework

  • High availability and fault tolerance

– Data ingestion should be available 24x7 – Data should be buffered (persisted) in case processing framework is not available

  • Scalability and high throughput

– Number of sources and consumers will increase, amount of data will increase

Valeria Cardellini - SABD 2019/2020 4

Requirements

  • Data provenance
  • Security

– Authentication and data in motion encryption

  • Data conversion

– From multiple sources: transform data into common format – Also to speed up processing

  • Data integration

– From multiple flows to single flow

  • Data compression
  • Data preprocessing (e.g., filtering)
  • Backpressure and routing

– Buffer data in case of temporary spikes in workload and provide a mechanism to replay it later

5 Valeria Cardellini - SABD 2019/2020

slide-4
SLIDE 4

A unifying view: Lambda architecture

Valeria Cardellini - SABD 2019/2020 6

Data acquisition layer

  • Allows collecting, aggregating and moving data
  • From various sources (server logs, social media,

streaming sensor data, …)

  • To a data store (distributed file system, NoSQL data

store, messaging system)

  • We analyze

– Apache Flume – Apache Sqoop – Apache NiFi

Valeria Cardellini - SABD 2019/2020 7

slide-5
SLIDE 5

Apache Flume

  • Distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of stream data (e.g., log data)

  • Robust and fault tolerant with tunable reliability

mechanisms and failover and recovery mechanisms

– Tunable reliability levels

  • Best effort: “Fast and loose”
  • Guaranteed delivery: “Deliver no matter what”
  • Suitable for streaming analytics

Valeria Cardellini - SABD 2019/2020 8

Flume architecture

Valeria Cardellini - SABD 2019/2020 9

slide-6
SLIDE 6

Flume architecture

  • Agent: JVM running Flume

– One per machine – Can run many sources, sinks and channels

  • Event

– Basic unit of data that is moved using Flume (e.g., Avro event) – Normally ~4KB

  • Source

– Produces data in the form of events

  • Channel

– Connects source to sink (like a queue) – Implements the reliability semantics

  • Sink

– Removes an event from a channel and forwards it to either to a destination (e.g., HDFS) or to another agent

Valeria Cardellini - SABD 2019/2020 10

Flume data flows

  • Flume allows a user to build multi-hop flows where

events travel through multiple agents before reaching the final destination

  • Supports multiplexing the event flow to one or more

destinations

  • Multiple built-in sources and sinks (e.g., Avro, Kafka)

Valeria Cardellini - SABD 2019/2020 11

slide-7
SLIDE 7

Flume reliability

  • Events are staged in a channel on each agent

– A channel can be either durable (FILE, will persist data to disk) or non durable (MEMORY, will lose data if a machine fails)

  • Events are then delivered to next agent or final

destination (e.g., HDFS) in the flow

  • Events are removed from a channel only after they

are stored in the channel of next agent or in the final destination

  • Transactional approach to guarantee the reliable

delivery of events

– Sources and sinks encapsulate in a transaction the storage/retrieval of events

Valeria Cardellini - SABD 2019/2020 12

Apache Sqoop

  • A commonly used tool for SQL data transfer to

Hadoop

– SQL to Hadoop = SQOOP

  • To import bulk data from structured data stores such

as RDBMS into HDFS, HBase or Hive

  • Also to export data from HDFS to RDBMS
  • Supports a variety of file formats (e.g., Avro)

Valeria Cardellini - SABD 2019/2020 13

slide-8
SLIDE 8

Apache NiFi

Valeria Cardellini - SABD 2019/2020 14

  • Powerful and reliable system to automate the flow of

data between systems

  • Mainly used for data routing and transformation
  • Highly configurable

– Flow specific QoS: loss tolerant vs guaranteed delivery, low latency vs high throughput – Prioritized queueing – Flow can be modified at runtime

  • Useful for data preprocessing

– Back pressure

  • Data governance and security
  • Ease of use: visual command and control

– Web-based UI where to define sources from where to collect data, processors for data conversion, destinations to store the data

Apache NiFi: core concepts

  • Based on flow based programming
  • Main concepts:

– FlowFile: each object moving through the system – FlowFile Processor: performs the work of data routing, transformation, or mediation between systems – Connection: actual linkage between processors; acts as queue

Valeria Cardellini - SABD 2019/2020 15

– Flow Controller: maintains the knowledge of how processes connect and manages threads and allocations – Process Group: specific set

  • f processes and their

connections

slide-9
SLIDE 9

Apache NiFi: architecture

Valeria Cardellini - SABD 2019/2020 16

  • NiFi executes within a JVM
  • Multiple NiFi servers can be clustered for scalability

Apache NiFi: use case

  • Use NiFi to fetch tweets by means of NiFi’s processor

‘GetTwitter’

– It uses Twitter Streaming API for retrieving tweets

  • Move data stream to Apache Kafka using NiFi’s

processor ‘PublishKafka’

Valeria Cardellini - SABD 2019/2020 17

slide-10
SLIDE 10

Data serialization formats for Big Data

  • Serialization: process of converting structured

data into a compact (binary) form

  • Some data serialization formats you already

know

– JSON – XML

  • Other serialization formats

– Apache Avro (row-oriented) – Apache Parquet (column-oriented) – Protocol buffers – Thrift

Valeria Cardellini - SABD 2019/2020 18

Apache Avro

  • Key features

– Compact, fast, binary data format – Supports a number of data structures for serialization – Neutral to programming language – Simple integration with dynamic languages – Relies on schemas: data+schema is fully self-describing

  • JSON-based schema segregated from data

– RPC – Both Hadoop and Spark can access Avro as data source

  • Comparing performance of serialization formats

https://bit.ly/2qrMnOz – Avro should not be used from small objects (high serialization and deserialization times) – Interesting for large objects

Valeria Cardellini - SABD 2019/2020 19

slide-11
SLIDE 11

Messaging layer: architectural choices

  • Message queue

– ActiveMQ – RabbitMQ – ZeroMQ – Amazon SQS

  • Publish/subscribe

– Kafka – NATS http://www.nats.io – Apache Pulsar https://pulsar.apache.org/

  • Geo-replication of stored messages

– Redis

Valeria Cardellini - SABD 2019/2020 20

Messaging layer: use cases

  • Mainly used in the data processing pipelines

for data ingestion or aggregation

  • Envisioned mainly to be used at the

beginning or end of a data processing pipeline

  • Example

– Incoming data from various sensors: ingest this data into a streaming system for real-time analytics or a distributed file system for batch analytics

Valeria Cardellini - SABD 2019/2020 21

slide-12
SLIDE 12

Message queue pattern

  • Messages are put into queue
  • Multiple consumers can read from the queue
  • Each message is delivered to only one

consumer

  • Pros (non only in Big data domain)

– Loose coupling – Service statelessness

22 Valeria Cardellini - SABD 2019/2020

Message queue API

  • Basic interface to a queue in a MQS:

– put: nonblocking send

  • Append a message to a specified queue

– get: blocking receive

  • Block until the specified queue is nonempty and remove the

first message

  • Variations: allow searching for a specific message in the

queue, e.g., using a matching pattern

– poll: nonblocking receive

  • Check a specified queue for message and remove the first
  • Never block

– notify: nonblocking receive

  • Install a handler to be automatically called when a message

is put into the specified queue

23 Valeria Cardellini - SABD 2019/2020

slide-13
SLIDE 13

Publish/subscribe pattern

24

  • Application components can publish asynchronous

messages (e.g., event notifications), and/or declare their interest in message topics (or content) by issuing a subscription

– Let’s consider only topic-based pub/sub

Valeria Cardellini - SABD 2019/2020

Publish/subscribe pattern

25

  • Multiple consumers can subscribe to topic with
  • r without filters
  • Subscriptions are collected by a distributed

event dispatcher component, responsible for routing events to all matching subscribers

  • Pros

– High degree of decoupling

  • Spatial, temporal, asynchronous

Valeria Cardellini - SABD 2019/2020

slide-14
SLIDE 14

Publish/subscribe API

  • Basic interface:

– publish(event): publish event

  • Events can be of any data type supported by the given

implementation languages and may also contain meta- data

– subscribe(filter_expr, notify_cb, expiry) → sub_handle: subscribe to event

  • Takes a filter expression, a reference to a notify callback

for event delivery, and an expiry time for the subscription registration.

  • Returns a subscription handle

– unsubscribe(sub_handle) – notify_cb(sub_handle, event): called by pub/sub system to deliver a matching event to the corresponding subscribers

Valeria Cardellini - SABD 2019/2020 26

Pub/sub vs. message queue

  • A sibling of message queue pattern but

further generalizes it by delivering a message to multiple consumers

– Message queue: delivers messages to only one receiver, i.e., one-to-one communication – Pub/sub: delivers messages to multiple receivers, i.e., one-to-many communication

  • Some frameworks (e.g., RabbitMQ, Kafka,

NATS) support both patterns

Valeria Cardellini - SABD 2019/2020 27

slide-15
SLIDE 15

Apache Kafka

  • General-purpose, distributed pub/sub system
  • Originally developed in 2010 by LinkedIn
  • Used at scale by tech giants (Netflix, Uber, LinkedIn, …)
  • Written in Scala
  • Horizontally scalable
  • Fault-tolerant
  • High throughput ingestion

– Billions of messages

Kreps et al., “Kafka: A Distributed Messaging System for Log Processing”, NetDB’11.

Valeria Cardellini - SABD 2019/2020 28

  • Not only messaging, also

processing of data

⎼ We focus on messaging https://kafka.apache.org/documentation/

Kafka at a glance

  • Kafka maintains feeds of messages in categories called

topics

– A topic can have 0, 1, or many consumers subscribing to data written to it

  • Producers: publish messages to a Kafka topic
  • Consumers: subscribe to topics and process the feed of

published message

  • Kafka cluster: distributed log of data over servers known

as brokers

– A broker is responsible for receiving and storing published data

Valeria Cardellini - SABD 2019/2020 29

slide-16
SLIDE 16

Kafka: topics and partitions

  • Topic: a category to which the message is published
  • For each topic, Kafka cluster maintains a partitioned log

– Log (data structure!): append-only, totally-ordered sequence of records ordered by time

  • A topic is split into a pre-defined number of partitions

– Partition: unit of parallelism of the topic (allows for parallel access)

  • Each partition is replicated with some replication factor

$> bin/kafka-topics.sh --create --bootstrap-server localhost:9092

  • -replication-factor 1 --partitions 1 --topic test

30

  • Create a topic with 1 partition and 1 replica using CLI tools

Valeria Cardellini - SABD 2019/2020

Kafka: partitions

  • Producers publish their records to partitions of a topic

(round-robin or partitioned by keys), and consumers consume the published records of that topic

  • Each partition is an ordered, numbered, immutable

sequence of records that is continually appended to

– Like a commit log

  • Each record is associated with a monotonically

increasing sequence number, called offset

31 Valeria Cardellini - SABD 2019/2020

slide-17
SLIDE 17

Kafka: partitions

  • Partitions are distributed across brokers for scalability
  • Each partition is replicated for fault tolerance across

a configurable number of brokers

  • Each partition has one leader broker and 0 or more

followers

  • The leader handles read and write requests

– Read from leader – Write to leader

  • A follower replicates the leader and acts as a backup
  • Each broker is a leader for some of it partitions and a

follower for others to load balance

– Brokers rely on Apache Zookeeper for coordination

32 Valeria Cardellini - SABD 2019/2020

Kafka: partitions

Valeria Cardellini - SABD 2019/2020 33

slide-18
SLIDE 18

Kafka: producers

  • Producers = data sources
  • Publish data to topics of their choice

– The producer sends data directly to the broker that is the leader for the partition without any routing tier

  • Also responsible for choosing which record to assign

to which partition within the topic

– Random or key-based partitioned – E.g., if user id is the key, all data for a given user will be sent to the same partition

  • Send some message using CLI tools

$> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is a message This is another message

Valeria Cardellini - SABD 2019/2020 34

Kafka: design choice for consumers

  • Push vs. pull model for consumers
  • Push model

– The brokers actively push messages to the consumers – Challenging for the broker to deal with different types of consumers as it controls the rate at which data is transferred – Need to decide whether to send a message immediately or accumulate more data and send

  • Pull model

– The consumer assumes the primary responsibility for retrieving messages from the brokers – The consumer has to maintain an offset that identifies the next message to be transmitted and processed – Pros: better scalability and flexibility (different consumers with diverse needs and capabilities) – Cons: in case broker has no data, consumers may end up busy waiting for data to arrive

35 Valeria Cardellini - SABD 2019/2020

slide-19
SLIDE 19

Kafka: consumers

  • In Kafka design, pull approach rather than push

approach for consumers

– Why? Less burden on brokers

https://kafka.apache.org/documentation/#design_pull

  • Consumer Group: set of consumers sharing a common

group ID

– A Consumer Group maps to a logical subscriber – Each group consists of multiple consumers for scalability and fault tolerance

36 Valeria Cardellini - SABD 2019/2020

Kafka: consumers

  • In Kafka design, pull approach rather than push

approach for consumers

– Why? Less burden on brokers

https://kafka.apache.org/documentation/#design_pull

  • Consumer Group: set of consumers sharing a common

group ID

– A Consumer Group maps to a logical subscriber – Each group consists of multiple consumers for scalability and fault tolerance

37 Valeria Cardellini - SABD 2019/2020

slide-20
SLIDE 20

Kafka: consumers

  • Consumers use the offset to track which messages have

been consumed

– Messages can be replayed using the offset

  • Run the consumer using CLI tools

$> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

38 Valeria Cardellini - SABD 2019/2020

Kafka: ordering guarantees

  • Messages sent by a producer to a particular topic

partition will be appended in the order they are sent

  • Consumer sees records in the order they are stored

in the partition

  • Strong guarantees about ordering only within a

partition

– Total order over messages within a partition – But Kafka cannot preserve message order between different partitions in a topic

  • However such per-partition ordering combined with

the ability to partition data by key among topic partitions is sufficient for most applications

Valeria Cardellini - SABD 2019/2020 39

slide-21
SLIDE 21

Kafka: delivery semantics

  • Delivery guarantees supported by Kafka

– At-least-once (default): guarantees no message loss but duplicated messages, possibly out-of-order – Exactly-once: guarantees no loss and no duplicates, but requires expensive end-to-end 2PC

  • But not fully exactly-once
  • Support depends on the destination system
  • User can also implement at-most-once delivery by

disabling retries on producer and committing offsets in consumer prior to processing a message

See https://kafka.apache.org/documentation/#semantics

Valeria Cardellini - SABD 2019/2020 40

Kafka: fault tolerance

  • Kafka replicates partitions for fault tolerance
  • Kafka makes a message available for consumption
  • nly after all the followers acknowledge to the

leader a successful write

– Implies that a message may not be immediately available for consumption (the usual tradeoff between consistency and availability) – This default behavior can be relaxed is such strong guarantee is not required

  • Kafka retains messages for a configured period of

time

– Messages can be “replayed” in case a consumer fails

Valeria Cardellini - SABD 2019/2020 41

slide-22
SLIDE 22

Kafka and ZooKeeper

  • Kafka uses ZooKeeper to coordinate between

producers, consumers and brokers

  • ZooKeeper stores Kafka metadata
  • List of brokers
  • List of consumers and their offsets
  • List of producers

Valeria Cardellini - SABD 2019/2020 42

Kafka: APIs

  • Four core APIs
  • Producer API: allows apps to

publish records (e.g., clickstream, logs, IoT) to topics

  • Consumer API: allows apps to

read records from topics

  • Connect API: reusable

connectors (producers or consumers) that connect Kafka topics to existing applications

  • r data systems so to move

large collections of data into and out of Kafka

⎼ Connectors for AWS S3, HDFS, RabbitMQ, MySQL, Postgres, AWS Lambda, MongoDB, Twitter, …

43

https://kafka.apache.org/documentation/#api

Valeria Cardellini - SABD 2019/2020

slide-23
SLIDE 23

Kafka: APIs

  • Streams API: allows transforming streams of data

from input topics to output topics

– Kafka is not only a pub/sub system but also a real-time streaming platform

  • Use Kafka Streams to process data in pipelines consisting of

multiple stages

  • Kafka APIs support Java and Scala only

Valeria Cardellini - SABD 2019/2020 44

Kafka: client library

  • JVM internal client
  • Plus rich ecosystem of clients, among which:

– Sarama: Go client library

https://shopify.github.io/sarama/

– Python client library

https://pypi.org/project/kafka-python/ https://github.com/confluentinc/confluent-kafka-python/

  • NodeJS client library

https://github.com/Blizzard/node-rdkafka

Valeria Cardellini - SABD 2019/2020 45

slide-24
SLIDE 24

Performance comparison: Kafka vs RabbitMQ

Valeria Cardellini - SABD 2019/2020 46

  • Both guarantee millisecond-level low-latency

– At least once delivery guarantee more expensive on Kafka (latency almost doubles)

  • Replication has a drastic impact in both cases

– Performance reduced by 50% (RabbitMQ) and 75% (Kafka)

  • Kafka is best suited as scalable ingestion system
  • The two systems can be chained
  • For feature comparison see also

https://www.cloudkarafka.com/blog/2020-02-02-which-service-rabbitmq-vs- apache-kafka.html

Dobbelaere and Esmaili, “Kafka versus RabbitMQ”, ACM DEBS 2017

Kafka: limitations

  • No complete set of monitoring tools
  • No support for wildcard topic selection
  • Limited support for geo-replication

– Can run a single Apache Kafka cluster across multiple geographic regions but suffers from latency – MirrorMaker tool allows to maintain a replica of an existing Kafka cluster but mainly for fault tolerance

Valeria Cardellini - SABD 2019/2020 47

slide-25
SLIDE 25

Kafka: evolution

  • Kafka is no longer just a pub/sub system but

rather a distributed streaming platform

– Kafka Stream – Including KSQL, a streaming SQL engine for Kafka on top of Kafka Stream

  • You will see Kafka Streams in the hands-on

course

Valeria Cardellini - SABD 2019/2020 48

Kafka @ Netflix

  • Netflix uses Kafka for data collection and buffering

Valeria Cardellini - SABD 2019/2020

http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html

49

slide-26
SLIDE 26

Kafka @ Uber

  • Uber has one of the largest Kafka deployment in the

industry

Valeria Cardellini - SABD 2019/2020

https://eng.uber.com/ureplicator/

50

Kafka @ Audi

Valeria Cardellini - SABD 2019/2020 51

  • Audi uses Kafka for

real-time data processing

https://www.youtube.com/watch?v=yGLKi3TMJv8

slide-27
SLIDE 27

Kafka @ CINI Smart City Challenge ’17

Valeria Cardellini - SABD 2019/2020

By M. Adriani, D. Magnanimi, M. Ponza, F. Rossi

52

Cloud services for Kafka

  • Fully-managed services based on Kafka
  • Amazon MSK (Managed Streaming for

Apache Kafka) https://aws.amazon.com/msk/

  • Confluent Cloud https://www.confluent.io/confluent-cloud/

– Led by the creators of Kafka

  • CloudKarafka https://www.cloudkarafka.com/

Valeria Cardellini - SABD 2019/2020 53

slide-28
SLIDE 28

Cloud services for data ingestion

  • Amazon Kinesis Data

Firehose

– Fully managed service for delivering streaming data

  • Google Cloud Pub/Sub

https://cloud.google.com/pubsub/

– Fully-managed real-time pub/sub messaging service

Valeria Cardellini - SABD 2019/2020 54

directly to S3, used as data lake – Can transform and compress streaming data before storing it – Can invoke Lambda functions to transform incoming source data and deliver it to S3

Cloud services for IoT data ingestion

  • Let’s consider some AWS cloud services

devoted to IoT data ingestion (and analysis)

– AWS IoT Events – AWS IoT Core – AWS IoT Analytics

Valeria Cardellini - SABD 2019/2020 55

slide-29
SLIDE 29

AWS IoT Events

  • IoT service to detect and respond to events from IoT

sensors and applications

– Select data sources to ingest, define logic for each event using if-then-else statements, and select alert or custom action to trigger when event occurs – Integrated with other services (AWS IoT Core and IoT Analytics) to enable detection and insights into events

Valeria Cardellini - SABD 2019/2020 56

AWS IoT Core

Valeria Cardellini - SABD 2019/2020 57

  • Managed cloud service that lets connected devices

interact with cloud applications and other devices

slide-30
SLIDE 30

AWS IoT Analytics

  • Fully-managed Cloud service to run analytics on

massive volumes of IoT data

  • Filters, transforms, and enriches IoT data before

storing it in a time-series data store for analysis

Valeria Cardellini - SABD 2019/2020 58

References

  • Apache Flume documentation,

https://flume.apache.org/FlumeUserGuide.html

  • Apache NiFi documentation,

https://nifi.apache.org/docs.html

  • Apache Kafka documentation,

https://kafka.apache.org/documentation/

  • Kreps et al., “Kafka: A distributed messaging system

for log processing”, NetDB 2011. http://bit.ly/2oxpael

  • Dobbelaere and Esmaili, “Kafka versus RabbitMQ: A

comparative study of two industry reference publish/subscribe implementations”, ACM DEBS ’17

https://doi.org/10.1145/3093742.3093908

Valeria Cardellini - SABD 2019/2020 59